Synthetic Data Generation
Generate synthetic data using random generators, algorithms, statistical models, and Large Language Models (LLMs) to simulate real data for developing and testing solutions effectively.
Synthetic data can be a powerful tool when developing and testing solutions, especially when actual data doesn’t yet exist or isn’t allowed to be used. Synthetic, or artificially generated, data can have many of the characteristics of real data. Synthetic data is typically created using random generators, algorithms, or statistical models to simulate the characteristics and distributions of real data. However, this can be complex to produce or rely on tools with features of varying sophistication. With the availability of Large Language Models (LLMs), more relevant and schema-specific data may be generated that considers characteristics expressed in natural language.
Topics
- Benefits of Synthetic Data Generation
Synthetic data generation enables populating database metadata clones, supporting development, testing, and machine learning projects without using sensitive data from original tables. - Generate Synthetic Data
UseDBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA
function to generate synthetic data and query the data using Select AI actions. - Monitor and Troubleshoot Synthetic Data Generation
When generating large amounts of data across many tables, Select AI splits synthetic data generation tasks into smaller chunks and runs tasks in parallel. The status of each chunk is tracked in theSYNTHETIC_DATA$<operation_id>_STATUS
table.
Benefits of Synthetic Data Generation
Synthetic data generation enables populating database metadata clones, supporting development, testing, and machine learning projects without using sensitive data from original tables.
-
Populating metadata clones with synthetic data: A metadata clone replicates the structure of a database or schema without including actual data. Select AI allows synthetic data generation to populate these clones, protecting sensitive data while enabling development, testing, and creating templates. This approach supports performance and scalability testing.
-
Starting new projects: When starting a new project, actual data may not be available. Synthetic data provides realistic samples to help demonstrate concepts and gain support for project proposals.
-
Validating user experience: Synthetic data aids in testing user interfaces by providing diverse data sets to uncover design flaws, performance, and scalability issues.
-
Supporting AI and machine learning projects: Synthetic data is useful for training AI and machine learning models when real data is unavailable or restricted. LLMs can generate data with specific patterns to facilitate model training and scoring.
Parent topic: Synthetic Data Generation
Generate Synthetic Data
Use DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA
function to
generate synthetic data and query the data using Select AI actions.
To learn more, see Example: Generate Synthetic Data.
Parent topic: Synthetic Data Generation
Monitor and Troubleshoot Synthetic Data Generation
When generating large amounts of data across many tables,
Select AI splits synthetic data generation tasks into smaller chunks and runs tasks in
parallel. The status of each chunk is tracked in the SYNTHETIC_DATA$<operation_id>_STATUS
table.
Synthetic data generation operations are logged in the tables
DBA_LOAD_OPERATIONS
and USER_LOAD_OPERATIONS
. Use
these tables to monitor the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA
operation. See Track DBMS_CLOUD Load Operations for more details.
USER_LOAD_OPERATION
using the
following:SELECT max(id) FROM user_load_operations;
To view the synthetic data operations running in a different session, use the
DBA_LOAD_OPERATIONS
view.
View Status of Synthetic Data Generation
The status table of synthetic data generation operation shows the
progress of each table and its corresponding chunk. The
STATUS_TABLE
column in USER_LOAD_OPERATIONS
or
DBA_LOAD_OPERATIONS
shows the status table name. The table name
is SYNTHETIC_DATA$<operation_id>_STATUS
and it has following
columns:
Name | Datatype | Description |
---|---|---|
|
|
Unique identifier of the record. |
|
|
Qualified name of the table, such as
"ADB_USER"."EMPLOYEES" |
|
|
Desired number of records for this data generation task |
|
|
Actual number of records generated. |
|
|
Starting value for the primary key during this data generation task. |
|
|
Timestamp indicating when the record was last modified. |
|
|
Status of the data generation task. The valid values
are:
|
|
|
Error code, if the data generation task fails. |
|
|
Error message provided if the task fails. |
|
|
Timestamp marking the end of the data generation task. |
To check the number of records generated for each table, issue the following:
SELECT name, SUM(rows_loaded) FROM synthetic_data$<operation_id>_status group by name;
Query ROWS_LOADED
to confirm how many number of rows are
loaded for each chunk, and SUM(ROWS_LOADED)
for rows for each
table.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_list => '[{"owner": "ADB_USER", "name": "DIRECTOR","record_count":150},
{"owner": "ADB_USER", "name": "MOVIE_ACTOR","record_count":300},
{"owner": "ADB_USER", "name": "CLASSES", "user_prompt":"all in fall semester","record_count":5},
{"owner": "ADB_USER", "name": "ACTOR","record_count":220},
{"owner": "ADB_USER", "name": "MOVIE","record_count":50}]'
);
END;
/
-- Check loaded rows for each chunk
SQL> SELECT name, rows_loaded FROM synthetic_data$141_status order by name;
NAME ROWS_LOADED
------------------------------------------
"ADB_USER"."ACTOR" 188
"ADB_USER"."ACTOR" 32
"ADB_USER"."CLASSES" 5
"ADB_USER"."DIRECTOR" 150
"ADB_USER"."MOVIE" 50
"ADB_USER"."MOVIE_ACTOR" 38
"ADB_USER"."MOVIE_ACTOR" 114
"ADB_USER"."MOVIE_ACTOR" 148
-- Check loaded rows for each table
SQL> SELECT name, SUM(rows_loaded) FROM synthetic_data$141_status group by name;
NAME SUM(ROWS_LOADED)
------------------------------------------
"ADB_USER"."DIRECTOR" 150
"ADB_USER"."MOVIE_ACTOR" 300
"ADB_USER"."CLASSES" 5
"ADB_USER"."ACTOR" 220
"ADB_USER"."MOVIE" 50
Parent topic: Synthetic Data Generation