Creating Synthetic Data with the Synthetic Data Vault: A Step-by-Step Guide

Step-by-Step Guide to Creating Synthetic Data with the Synthetic Data Vault (SDV)

In today’s data-driven world, real-world data often comes with challenges such as high costs, messiness, and strict privacy regulations. Synthetic data presents a viable solution, enabling businesses to train large language models, simulate fraud detection scenarios, and pre-train vision models without compromising privacy.

What is the Synthetic Data Vault (SDV)?

The Synthetic Data Vault (SDV) is an open-source Python library that generates realistic tabular data using machine learning techniques. It learns patterns from existing datasets and creates high-quality synthetic data, making it safe for sharing, testing, and model training.

Practical Steps to Use SDV

1. Installation of the SDV Library

Start by installing the SDV library with the following command:

pip install sdv

2. Reading Your Dataset

To read your dataset, import the necessary module and connect to the folder containing your dataset files. The data will be stored as pandas DataFrames, and you can access the main dataset as follows:

from sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = '.'  # Adjust if necessary

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']

3. Importing Metadata

Next, import the metadata for your dataset from a JSON file. This metadata provides essential information about your data structure, including:

Table name
Primary key
Data types of each column (e.g., categorical, numerical, datetime)
Column formats (e.g., datetime patterns)
Table relationships for multi-table setups

Here’s an example of the JSON format:

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": []
    }
  }
}

4. Automatically Detecting Metadata

You can also use SDV to automatically infer the metadata. However, double-check the results for accuracy:

from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

5. Generating Synthetic Data

With the metadata and dataset ready, train a model to generate synthetic data. Specify the number of rows you want to create:

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

6. Evaluating Synthetic Data Quality

Use SDV tools to evaluate the quality of your synthetic data by comparing it to the original dataset. Start with a quality report:

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

Additionally, visualize the comparisons for specific columns:

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name='Sales',
    metadata=metadata
)

fig.show()

7. Visualizing Average Monthly Sales Trends

Analyze average monthly sales trends for both datasets:

import pandas as pd
import matplotlib.pyplot as plt

# Ensure 'Date' columns are datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y')
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y')

# Extract 'Month' as year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)

# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o')
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o')

plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)
plt.tight_layout()
plt.show()

This visualization confirms that the average monthly sales in both datasets are quite similar, indicating the effectiveness of the synthetic data generation process.

Conclusion

This guide outlines the process of preparing your data for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can produce high-quality synthetic data that mirrors real-world data patterns. We also explored evaluation and visualization techniques to ensure the synthetic data maintains key metrics. Embracing synthetic data can help your business overcome privacy and availability hurdles while enhancing data analysis and machine learning workflows.

For further insights into how artificial intelligence can transform your business, consider identifying processes that can be automated and defining key performance indicators (KPIs) to measure AI impact. Begin with a small project, gather data on its success, and then scale your AI initiatives. For assistance in managing AI in your business, feel free to reach out to us.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

AI Artifacts App: An Open Source Version of Anthropic Artifacts that can Analyze Python Code, Generate HTML/CSS/JS and Next.js Code

The AI Artifacts App: A Comprehensive Solution for Executing AI-Generated Code Practical Solutions and Value Many developers struggle with securely running AI-generated code. The AI Artifacts app addresses this challenge by providing a secure, open-source tool…

AI Tech News
What is Deep Learning?

The Rise of Data in the Digital Age The digital age generates a vast amount of data daily, including text, images, audio, and video. While traditional machine learning can be useful, it often struggles with complex…

AI Tech News
COCOM: An Effective Context Compression Method that Revolutionizes Context Embeddings for Efficient Answer Generation in RAG

Efficiently Managing Long Contextual Inputs in RAG Models Challenges and Solutions Retrieval-Augmented Generation (RAG) models face challenges in handling long contextual inputs, leading to prolonged response times in real-time applications. Current methods involve context compression techniques,…

AI Tech News
A.I. Electricity Use May Soon Match Whole Nations Power Consumption

The rapid adoption of OpenAI’s ChatGPT, a revolutionary AI innovation by Google Cloud, has raised concerns about its increasing energy consumption. A peer-reviewed analysis predicts that by 2027, AI servers could consume between 85 to 134…

AI Tech News
MEDEC: A Benchmark for Detecting and Correcting Medical Errors in Clinical Notes Using LLMs

Understanding the Challenges and Solutions of LLMs in Medical Documentation Impressive Capabilities but Significant Risks Large Language Models (LLMs) can answer medical questions accurately and even outperform average humans in some medical exams. However, using them…

AI Tech News
OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring How Well AI Agents Perform at Machine Learning Engineering

Introduction to MLE-bench Machine Learning (ML) models can perform various coding tasks, but there is a need to better evaluate their capabilities in ML engineering. Current benchmarks often focus on basic coding skills, neglecting complex tasks…

AI Tech News
New AI model helps brain surgeons analyze tumors on the fly

Dutch scientists have developed a deep learning tool called Sturgeon, which aids brain surgeons in classifying tumor types and subtypes during surgery. By examining specific segments of a tumor’s DNA, the AI tool provides rapid insights…

AI Tech News
NVIDIA AI Releases cuPyNumeric: A Drop-in Replacement Library for NumPy Bringing Distributed and Accelerated Computing for Python

NVIDIA Introduces cuPyNumeric: A Powerful Upgrade for NumPy Addressing Computational Limitations Researchers and data scientists often face challenges with traditional tools like NumPy, especially as datasets grow larger and models become more complex. NumPy relies solely…

AI Tech News
Optimizing Graph Neural Network Training with DiskGNN: A Leap Toward Efficient Large-Scale Learning

Optimizing Graph Neural Network Training with DiskGNN: A Leap Toward Efficient Large-Scale Learning Introduction Graph Neural Networks (GNNs) are essential for processing complex data from domains like e-commerce and social networks. However, as graph data scales,…

AI Tech News
10 Types of Machine learning Algorithms and Their Use Cases

Understanding Machine Learning Machine Learning (ML) is a part of Artificial Intelligence (AI) that allows machines to learn from data and make decisions without being explicitly programmed. It identifies patterns in data, similar to how a…

AI Tech News
You’ve Hit a Wall in Your Data Project, Now What?

This article provides strategies for overcoming obstacles in data analytics development. The author suggests stepping away from the problem to gain a fresh perspective, reframing assumptions about the data or code, isolating individual segments of code…

AI Tech News
Hugging Face Releases SmolTools: A Collection of Lightweight AI-Powered Tools Built with LLaMA.cpp and Small Language Models

Embracing Efficient AI Solutions In the fast-changing world of artificial intelligence, many focus on large, complex models that require a lot of computing power. However, many real-life applications benefit more from smaller, efficient models. Not everyone…

AI Tech News
Could future AI crave a favorite food?

A team of researchers is developing an electronic tongue that mimics how taste affects our food choices, potentially offering a blueprint for AI that processes information like humans. However, AI is not yet capable of getting…

AI Tech News
A Review Paper on Personalized Medicine: The Promise of Machine Learning in Individualized Treatment Effect Estimation

Machine learning in healthcare aims to revolutionize medical treatment by predicting tailored outcomes for individual patients. Traditional clinical trials often fail to represent diverse patient populations, hindering the development of effective treatments. Researchers are turning to…

AI Tech News
AI-Enhanced Math Problem Solving: Exploring DualDistill and Agentic-R1

Understanding DualDistill and Agentic-R1 In the world of artificial intelligence, particularly in mathematical problem-solving, researchers are continually seeking ways to enhance performance and efficiency. The DualDistill framework and its model, Agentic-R1, represent a significant advancement in…

AI Tech News
Efficient Local AI: Introducing SmallThinker LLMs for Business and Research

Understanding SmallThinker: Revolutionizing Local Deployment of AI The landscape of artificial intelligence is evolving rapidly, with traditional large language models (LLMs) often requiring extensive cloud infrastructure to function effectively. However, this dependence on cloud-based models presents…

AI Tech News
Can AI Really Understand Sarcasm? This Paper from NYU Explores Advanced Models in Natural Language Processing

Natural Language Processing (NLP) plays a crucial role in identifying sarcasm online, particularly in reviews and comments. A recent study by a New York University researcher evaluates the performance of two LLMs for sarcasm detection, emphasizing…

AI Tech News
Nobel Prize winner warns against studying STEM subjects

Nobel laureate Sir Christopher Pissarides cautions against rushing into STEM education due to AI’s impact on job markets. He emphasizes AI’s potential to replace STEM jobs and suggests a shift towards roles requiring empathy and creativity.…

AI Tech News
DeepSim: AI-Accelerated 3D Physics Simulator for Engineers

DeepSim: AI-Accelerated 3D Physics Simulator for Engineers Practical Solutions and Value DeepSim is a groundbreaking AI simulation platform that automates physics setup, enabling 1000X faster design simulations without compromising accuracy. By combining a powerful GPU-accelerated solver…

AI Tech News
Neural SpaceTimes (NSTs): A Class of Trainable Deep Learning-based Geometries that can Universally Represent Nodes in Weighted Directed Acyclic Graphs (DAGs) as Events in a Spacetime Manifold

Understanding Directed Graphs and Their Challenges Directed graphs are essential for modeling complex systems like gene networks and flow networks. However, representing these graphs can be challenging, especially in understanding cause-and-effect relationships. Current methods struggle to…

AI Tech News