📌 This repository is archived with Zenodo and can be cited using the DOI above.
A curated list of tools, models, datasets, and resources for generating, evaluating, and applying synthetic data — artificial data created to augment, protect, or replace real-world datasets for AI, analytics, and research.
Support ongoing maintenance and curation via GitHub Sponsors.
- Synthetic Data Generators
- Privacy & Compliance–Focused Tools
- Simulation Engines
- Image, Video & Multimodal Generators
- Evaluation & Benchmarking
- Datasets
- Learning Resources
- Related Awesome Lists
- SDV (Synthetic Data Vault) – Most popular framework for generating synthetic tabular, relational, and time-series data.
- Gretel – Tools for privacy-preserving synthetic tabular and text data using ML/DL models.
- Synthesized.io – Synthetic tabular data generation with differential privacy.
- ydata-synthetic – GAN-based synthetic data generator for tabular and time-series data.
- CTGAN – GAN-based framework from SDV for high-quality tabular synthetic data.
- Copulas – Library for modeling multivariate distributions for synthetic data generation.
- Synthetic Data from HuggingFace – LLM-based text generation for domain-specific corpora.
- OpenDP SmartNoise – Differential privacy tools for generating and evaluating synthetic data.
- Mostly AI – Commercial platform for privacy-preserving tabular synthetic data.
- Tonic.ai – Developer-focused synthetic data tool with privacy constraints.
- Hazy – Enterprise-grade platform for secure synthetic data pipelines.
- CARLA Simulator – Autonomous driving simulator for synthetic sensor data.
- AirSim – Drone, robotics, and autonomous vehicle simulation.
- NVIDIA Isaac Sim – High-fidelity robotics simulation with synthetic data generation.
- Unreal Engine – Popular for synthetic visual datasets in research.
- Unity Perception – Synthetic computer vision datasets using Unity.
- Stable Diffusion – Diffusion model for generating synthetic images for CV training.
- Stable Video Diffusion – Video generation for multimodal synthetic data.
- Diffusers – Library for training/customizing diffusion models for synthetic data use cases.
- Sora-based Synthetic Data Examples – High-fidelity video generation for simulation-like datasets.
- DreamSim – Framework for generating synthetic ground-truth image similarity datasets.
- GenAI Synthetic Speech Datasets – Tools and models for generating synthetic audio training data.
- RAGAS (for text synthetic evaluation) – Useful when embedding synthetic text into pipelines.
- Gretel Eval – Built-in evaluation for fidelity, privacy, and memorization.
- SDMetrics – Quality, fidelity, and statistical similarity metrics for synthetic data.
- SmartNoise EVAL – Differential privacy checks for synthetic datasets.
- OpenAI Evals (for synthetic datasets) – LLM-based evaluation framework adaptable to synthetic corpora.
- SDV Demo Data – Starter datasets for synthetic generation experiments.
- Unity Perception Ground Truth – Pre-labeled computer vision synthetic datasets.
- CARLA Sample Datasets – Autonomous driving simulation datasets.
- Open Images + Synthetic Variants – Real + augmented imagery useful for CV pipelines.
- HuggingFace Synthetic Corpora – Curated synthetic and mixed datasets across domains.
- Synthetic Data 101 (SDV) – Introductory and advanced tutorials.
- Gretel Academy – Guides on synthetic text, tabular, and privacy.
- Differential Privacy for Synthetic Data – Research and practical applications.
- Unity Perception Tutorials – Building synthetic CV datasets.
- Autonomous Driving Synthetic Data Guides – Tutorials for generating AV-specific data.
- Awesome AI
- Awesome Machine Learning
- Awesome AI Research Papers
- Awesome AI Infrastructure
- Awesome Computer Vision
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.