Itinai.com it company office background blured photography by 48cb21e9 ed8f 4a55 9f5b 4570e52f1cce 1
Itinai.com it company office background blured photography by 48cb21e9 ed8f 4a55 9f5b 4570e52f1cce 1

Creating Synthetic Data with the Synthetic Data Vault: A Step-by-Step Guide

Step-by-Step Guide to Creating Synthetic Data with the Synthetic Data Vault (SDV)

In today’s data-driven world, real-world data often comes with challenges such as high costs, messiness, and strict privacy regulations. Synthetic data presents a viable solution, enabling businesses to train large language models, simulate fraud detection scenarios, and pre-train vision models without compromising privacy.

What is the Synthetic Data Vault (SDV)?

The Synthetic Data Vault (SDV) is an open-source Python library that generates realistic tabular data using machine learning techniques. It learns patterns from existing datasets and creates high-quality synthetic data, making it safe for sharing, testing, and model training.

Practical Steps to Use SDV

1. Installation of the SDV Library

Start by installing the SDV library with the following command:

pip install sdv

2. Reading Your Dataset

To read your dataset, import the necessary module and connect to the folder containing your dataset files. The data will be stored as pandas DataFrames, and you can access the main dataset as follows:

from sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = '.'  # Adjust if necessary

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']

3. Importing Metadata

Next, import the metadata for your dataset from a JSON file. This metadata provides essential information about your data structure, including:

  • Table name
  • Primary key
  • Data types of each column (e.g., categorical, numerical, datetime)
  • Column formats (e.g., datetime patterns)
  • Table relationships for multi-table setups

Here’s an example of the JSON format:

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": []
    }
  }
}

4. Automatically Detecting Metadata

You can also use SDV to automatically infer the metadata. However, double-check the results for accuracy:

from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

5. Generating Synthetic Data

With the metadata and dataset ready, train a model to generate synthetic data. Specify the number of rows you want to create:

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

6. Evaluating Synthetic Data Quality

Use SDV tools to evaluate the quality of your synthetic data by comparing it to the original dataset. Start with a quality report:

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

Additionally, visualize the comparisons for specific columns:

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name='Sales',
    metadata=metadata
)

fig.show()

7. Visualizing Average Monthly Sales Trends

Analyze average monthly sales trends for both datasets:

import pandas as pd
import matplotlib.pyplot as plt

# Ensure 'Date' columns are datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y')
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y')

# Extract 'Month' as year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)

# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o')
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o')

plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)
plt.tight_layout()
plt.show()

This visualization confirms that the average monthly sales in both datasets are quite similar, indicating the effectiveness of the synthetic data generation process.

Conclusion

This guide outlines the process of preparing your data for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can produce high-quality synthetic data that mirrors real-world data patterns. We also explored evaluation and visualization techniques to ensure the synthetic data maintains key metrics. Embracing synthetic data can help your business overcome privacy and availability hurdles while enhancing data analysis and machine learning workflows.

For further insights into how artificial intelligence can transform your business, consider identifying processes that can be automated and defining key performance indicators (KPIs) to measure AI impact. Begin with a small project, gather data on its success, and then scale your AI initiatives. For assistance in managing AI in your business, feel free to reach out to us.

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions