Step-by-Step Guide to Creating Synthetic Data with the Synthetic Data Vault (SDV)
In today’s data-driven world, real-world data often comes with challenges such as high costs, messiness, and strict privacy regulations. Synthetic data presents a viable solution, enabling businesses to train large language models, simulate fraud detection scenarios, and pre-train vision models without compromising privacy.
What is the Synthetic Data Vault (SDV)?
The Synthetic Data Vault (SDV) is an open-source Python library that generates realistic tabular data using machine learning techniques. It learns patterns from existing datasets and creates high-quality synthetic data, making it safe for sharing, testing, and model training.
Practical Steps to Use SDV
1. Installation of the SDV Library
Start by installing the SDV library with the following command:
pip install sdv
2. Reading Your Dataset
To read your dataset, import the necessary module and connect to the folder containing your dataset files. The data will be stored as pandas DataFrames, and you can access the main dataset as follows:
from sdv.io.local import CSVHandler
connector = CSVHandler()
FOLDER_NAME = '.' # Adjust if necessary
data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']
3. Importing Metadata
Next, import the metadata for your dataset from a JSON file. This metadata provides essential information about your data structure, including:
- Table name
- Primary key
- Data types of each column (e.g., categorical, numerical, datetime)
- Column formats (e.g., datetime patterns)
- Table relationships for multi-table setups
Here’s an example of the JSON format:
{
"METADATA_SPEC_VERSION": "V1",
"tables": {
"your_table_name": {
"primary_key": "your_primary_key_column",
"columns": {
"your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
"date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
"category_column": { "sdtype": "categorical" },
"numeric_column": { "sdtype": "numerical" }
},
"column_relationships": []
}
}
}
4. Automatically Detecting Metadata
You can also use SDV to automatically infer the metadata. However, double-check the results for accuracy:
from sdv.metadata import Metadata
metadata = Metadata.detect_from_dataframes(data)
5. Generating Synthetic Data
With the metadata and dataset ready, train a model to generate synthetic data. Specify the number of rows you want to create:
from sdv.single_table import GaussianCopulaSynthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)
6. Evaluating Synthetic Data Quality
Use SDV tools to evaluate the quality of your synthetic data by comparing it to the original dataset. Start with a quality report:
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
salesDf,
synthetic_data,
metadata)
Additionally, visualize the comparisons for specific columns:
from sdv.evaluation.single_table import get_column_plot
fig = get_column_plot(
real_data=salesDf,
synthetic_data=synthetic_data,
column_name='Sales',
metadata=metadata
)
fig.show()
7. Visualizing Average Monthly Sales Trends
Analyze average monthly sales trends for both datasets:
import pandas as pd
import matplotlib.pyplot as plt
# Ensure 'Date' columns are datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y')
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y')
# Extract 'Month' as year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)
# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')
# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o')
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o')
plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)
plt.tight_layout()
plt.show()
This visualization confirms that the average monthly sales in both datasets are quite similar, indicating the effectiveness of the synthetic data generation process.
Conclusion
This guide outlines the process of preparing your data for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can produce high-quality synthetic data that mirrors real-world data patterns. We also explored evaluation and visualization techniques to ensure the synthetic data maintains key metrics. Embracing synthetic data can help your business overcome privacy and availability hurdles while enhancing data analysis and machine learning workflows.
For further insights into how artificial intelligence can transform your business, consider identifying processes that can be automated and defining key performance indicators (KPIs) to measure AI impact. Begin with a small project, gather data on its success, and then scale your AI initiatives. For assistance in managing AI in your business, feel free to reach out to us.