- SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains
SynthTextEval is an open-source library built to enable comprehensive evaluation of synthetic text generated by large language models (LLMs). With privacy concerns on the rise, especially in high-stakes domains, ensuring the utility, fairness, and privacy of systems trained on synthetic data is crucial. SynthTextEval is a unified framework designed for evaluating synthetic text across multiple dimensions, and integrates both the synthetic text generation and evaluation into a single pipeline. Additionally, it supports text generation using differentially private methods and enables qualitative assessments of this data, making it the first framework to offer a holistic approach to synthetic text evaluation.
-
Synthetic Text Generation 📄: Enables controllable text generation using control codes to create targeted synthetic data based on specific training data attributes. It also supports differential privacy to ensure additional privacy protections when training text generators.
-
Downstream Utility Evaluation 📌: Assesses the effectiveness of synthetic text for tasks like classification and coreference resolution, allowing for direct performance comparison between real and synthetic datasets.
-
Fairness Evaluation ⚖️: Analyzes model performance across different subgroups to ensure fairness and detect any distributional biases in models trained on the generated text.
-
Automated Open-Ended Text Evaluation 🎯: Measures the quality of text using metrics such as Frechet Distance (FID), MAUVE Score, and perplexity to measure the quality and distributional differences between synthetic and real text corpora.
-
Visualization and Descriptive Text Analysis 📊: Provides tools for visualizing and analyzing key text features like named entities, n-grams, TF-IDF, and topic modeling, providing deeper insights into text structure, diversity, and themes.
-
Privacy and Memorization 🔐: Detects memorization in synthetic outputs and language models through privacy evaluations that include canary attacks and entity-based metrics. This is to ensure sensitive information is not inadvertently reproduced and compliance with data regulations is followed.
Our GUI lets domain experts explore and annotate synthetic and real text samples, supporting more nuanced, qualitative evaluation.
Key Features:
- 🔍 Text Similarity Exploration: For any given synthetic text, display the most similar real text side-by-side.
- 🏷️ Entity-Based Comparison: For synthetic texts containing named entities, view real texts that contain the same entity.
- 📝 Annotation & Feedback: Write, save, and share free-form comments on text samples to support collaborative qualitative review.
| Platform | Download Link |
|---|---|
| 🖥️ Mac (arm64) | Download for Mac arm64 |
| 🪟 Windows (x64) | Download for Windows x64 |
| 🪟 Windows (arm64) | Download for Windows arm64 |
The following is a condensed structure of our repository and the functionality it enables.
synthtexteval
|____data # Dataset directory for data stored and used in the pipeline
|____scripts # Contains scripts for running the experiments in the system demonstration
|____synthtexteval
| |____eval
| | |____descriptive
| | | |____descriptor.py # Module for descriptive analysis and statistics of the text
| | |____text_quality
| | | |____metrics.py # Contains automated evaluation metrics for machine-generated text
| | |____downstream
| | | |____classify
| | | |____coref
| | |____privacy
| | | |____canary # Directory for canary-based privacy evaluations
| | | |____metrics.py # Contains metrics for entity leakage and span memorization
| | |____fairness
| | | |____metrics.py
| |____generation # Module for generating synthetic text with controllable generation and differential privacy
| | |____controllable
| |____utils
| | |____filtering.py # Filtering functionality provided in case the user wants to filter synthetic text
| | |____utils.py
|____demo.ipynb # A sample demo for running the functionality provided in our package.
|____README.md
|____requirements.txt
|____setup.py
The package can be installed directly through pip:
pip install SynthTextEval
Alternatively, you can clone the repository and run the commands below to set up the package.
git clone https://github.com/kr-ramesh/synthtexteval/
Execute the following commands to install the dependencies for the environment to use the package and install it locally (Note: Need to publish this package to pip):
pip install -r requirements.txt
pip install -e .
To set up the directories, run the scripts/download_data/setup_dir.sh script. For downloading the Text Anonymization Benchmark (TAB) and MIMIC-III datasets, please follow the instructions in the /scripts/download_data folder. Note that accessing MIMIC-III requires a PhysioNet account and CITI certifications. You can preprocess and format the datasets by following the instructions in the respective dataset subdirectories under scripts/.
The scripts/ folder also includes demo scripts, which are also available in a more general form within their corresponding subdirectories.
The TextDescriptor class provides comprehensive text analysis capabilities, including named entity recognition, n-gram frequency analysis, TF-IDF computation, and topic modeling using LDA. Designed for evaluating synthetic and real text, it offers functionality to extract insights, visualize entity distributions, and save results for further analysis. Some of the functions can be called as follows:
from synthtexteval.eval.descriptive.descriptor import TextDescriptor
from synthtexteval.eval.descriptive.arguments import TextDescriptorArgs
desc_analyze = TextDescriptor(texts = texts, # A list of texts to analyze
args = TextDescriptorArgs(produce_plot=True), # Passes the arguments and the hyperparameters for the descriptor module
reference_texts = # (Optional) A list of reference texts, typically sourced from the real distribution.
)
# Example functionality:
desc_analyze.analyze_entities()
desc_analyze._topic_modeling_display()
desc_analyze._compute_tfidf()
desc_analyze._ngram_frequency()If our synthetic data lacks label annotations, we can generate silver annotations for it using an existing pretrained model.
from synthtexteval.eval.downstream.classify.generate_silver_annotations import generate_silver_annotations
generate_silver_annotations(
model_name = "bert-base-uncased",
path_to_model = # path to model to use for silver annotations,
n_labels = # number of labels,
problem_type = # multiclass or multilabel classification,
data_path = # path to data to annotate,
text_column = # input text column,
label_column = # label column,
output_path = # path to the annotated CSV,
ckpt_exists = True # set to True when loading from local files
)To prepare synthetic data for classification tasks, use create_classification_dataset to convert a CSV file into a structured dataset.
from synthtexteval.utils.utils import create_classification_dataset
# Create a classification dataset
_, _, _ = create_classification_dataset(
df,
label_column= # name of the label column,
output_json_path= # path to the output json file used for mapping the data to a numeric label,
output_dir= # path to where the data splits will be saved,
multilabel=False,
train_ratio=0.7, test_ratio=0.15, val_ratio=0.15
)After creating the dataset, train a classifier using the synthtexteval.eval.downstream.classify module. We also provide a training script for this in the eval/downstream/classify subdirectory. This script can also be used to test the classifier using the same module. Additional detailed documentation on the module, including its functionalities for testing with synthetic data and augmenting the training set with synthetic data, can be found in this subdirectory.
# Command to execute the training script with the required parameters:
#
# Parameters:
# --model_name: Name of the pre-trained model to use (e.g., bert-base-uncased)
# --path_to_dataset: Path to the dataset directory (e.g., sst2)
# --path_to_model: Directory where the trained model will be saved (e.g., models/bert-base-uncased-sst2)
# --num_labels: The number of output labels for classification (e.g., 2 for binary classification)
# --is_train: Whether to train the model (True or False)
# --is_test: Whether to test the model (True or False)
#
# Usage:
# sh train.sh <model_name> <path_to_dataset> <path_to_model> <num_labels> <is_train> <is_test>
# Example:
sh train.sh bert-base-uncased "sst2" models/bert-base-uncased-sst2 2 True Truetemp_output_dir = './temp' # Define a temp output directory
model_dir = temp_output_dir + '/base_pretrained_model' # Path to where the base_pretrained_model is downloaded/saved
os.makedirs(temp_output_dir, exist_ok = True)Download a pre-trained model (Instructions sourced from here):
export MODEL_DIR=<model_dir>
curl -L https://www.dropbox.com/sh/7hpw662xylbmi5o/AAC3nfP4xdGAkf0UkFGzAbrja?dl=1 > temp_model.zip
unzip temp_model.zip -d $MODEL_DIR
rm -rf temp_model.zipIn case our synthetic data does not have coreference annotations, we can generate silver annotations for this synthetic data using an existing pretrained model.
from synthtexteval.eval.downstream.coref.minimize_synth import minimize_file
synthetic_data_path = # Path to the synthetic data (a csv file)
output_path = "./temp" # Path to where outputs are saved
sample_size = 100 # Sampling a 100 samples from the synthetic data
minimize_file(synthetic_data_path, output_path, sample_size)Fine-tuning a model on these silver annotations and testing it on gold data (the paths can be specified in the arguments).
from synthtexteval.eval.downstream.coref.run_coref_comparison import coref_train
from synthtexteval.eval.downstream.coref.arguments import set_default_coref_args
args = set_default_coref_args(output_dir= # path to where the outputs are saved,
base_model_dir = # directory where the base model is saved,
test_file = #path to the test.jsonlines file
)
coref_train(args)Alternative, we can run this with the following script provided in the eval/downstream/coref subdirectory.
python run_coref_comparison.py \
--output_dir=$temp_output_dir \
--model_type=longformer \
--base_model_name_or_path=$model_dir \
--tokenizer_name=allenai/longformer-large-4096 \
--test_file=$test_file \
--do_infer \
--num_train_epochs=$num_train_epochs \
--logging_steps=100 \
--save_steps=1000 \
--eval_steps=150 \
--max_seq_length=4000 \
--predict_file=$predict_file \
--predict_file_write=$predict_file_write \
--normalise_loss \
--max_total_seq_len=4000 \
--experiment_name=eval_model \
--warmup_steps=5600 \
--adam_epsilon=1e-6 \
--head_learning_rate=3e-4 \
--learning_rate=1e-5 \
--adam_beta2=0.98 \
--weight_decay=0.01 \
--dropout_prob=0.3 \
--save_if_best \
--top_lambda=0.4 \
--tensorboard_dir=$temp_output_dir/tbAssessing fairness is crucial when evaluating synthetic text models. The analyze_group_fairness_performance() function helps analyze performance disparities across subgroups. The user can define any categorical column in their dataframe as the subgroup_type.
from synthtexteval.eval.downstream.classify.visualize import tabulate_results
path_to_test_output = # Path to the dataframe ontaining model predictions from classification task
tabulate_results(csv_results_dir = ,
n_labels = # Number of classes,
print_fairness = # Set to True to calculate fairness metric scores,
subgroup_type = # Demographic attribute for subgroup/fairness analysis
problem_type = # Set classification type (single_label or multilabel)
)The tabulated results include:
- Evaluation of the classifier's accuracy, precision, recall, and F1-score per subgroup.
- Fairness metric scores that quantify disparities in model performance across different subgroups.
Assuming the user already has access to a list of private entities from their original data, we can conduct a privacy evaluation as follows:
from synthtexteval.eval.privacy.metrics import entity_leakage
# Returns the overall percentage of leaked entities and a dictionary containing the entities leaked in each text
total_leakage, privacy_analysis = entity_leakage(paragraphs = # list of synthetic texts,
entities = # list of private entities provided by the user,
entity_leakage_result_path = # path to save the results
)The search_and_compute_EPO() function helps identify occurrences of private entities in synthetically generated text and extracts surrounding context for analysis. It helps detect memorization in synthetically generated text by identifying instances where spans of text, including private entities, are regurgitated from the training data.
from synthtexteval.eval.privacy.metrics import search_and_compute_EPO
search_and_compute_EPO(synth_file = # Path to dataframe with synthetic text
reference_texts = # List of reference text files
synth_phrase_file_path = # Path to where the entity context spans from synthetic text is saved
entity_patterns = # User-defined list of private entities
max_window_len = # Maximum number of words around each entity to extract its surrounding context
text_field = # Name of the column corresponding to the text field in the dataframe
)We also provide functionality to conduct canary-based evaluations for evaluating leakage in the generative model. Further details are provided in the eval.privacy subdirectory.
We can evaluate the quality of synthetic text by comparing it against real-world samples using the Fréchet Inception Distance, MAUVE, and perplexity metrics.
from dataclasses import dataclass
from synthtexteval.eval.text_quality.metrics import QualEval
from synthtexteval.eval.text_quality.arguments import MauveArgs, LMArgs, FrechetArgs, Arguments
# Prepare evaluation dataframe containing synthetic and real samples
df = pd.DataFrame({
'source': # Synthetic text
'reference': # Real text
})
args = Arguments(frechet = FrechetArgs, mauve = MauveArgs, perplexity = LMArgs)
qual_eval = QualEval(args)
qual_eval.calculate_fid_score(df) # Frechet Inception Distance (FID)
qual_eval.calculate_mauve_score(df) # MAUVE Score for distribution similarity
qual_eval.calculate_perplexity(df) # Perplexity score for fluencyWe provide functionality to train models to generate synthetic data using control codes, allowing for the generation of synthetic text with controllable attributes specified by the user. We provide a script to train the model in the generation/controllable subdirectory, which accepts the following arguments (some of which are optional) (further details regarding the arguments are specified in the generation/controllable subdirectory):
# Command to execute the training of the synthetic data generator.
# Parameters:
# --model_name : The name of the pre-trained model (e.g., "princeton-nlp/Sheared-LLaMA-1.3B")
# --dir_to_save_model : The directory to save the trained model (e.g., "/data/projects/synthtexteval/models/")
# --disable_dp : Disable Differential Privacy (true/false)
# --epsilon_value : The epsilon value for DP (used when DP is enabled)
# --dataset_name : The identifier for the dataset to be used. Set to 'hf' when loading from HuggingFace.
# --path_to_dataset : Path to the dataset for training the generator model.
# --epochs : Number of epochs to train
# --gradient_accumulation_steps : Number of gradient accumulation steps
# --load_ckpt : Whether to load a checkpoint for continuing training (true/false)
# --path_to_load_model : Path to the checkpoint model (if loading, else set to "")
# --enable_lora : Whether to enable LoRA (true/false)
# Usage:
# sh train.sh <model_name> <dir_to_save_the_model> <disable_dp> <epsilon_value> <dataset_name> <path_to_the_dataset> <epochs> <gradient_accumulation_steps> <load_ckpt> <path_to_load_model> <enable_lora>
# Example:
sh train.sh "gpt2" "models/gpt2_DP_" true "inf" "tab" "dataset.csv" 5 1 false "" trueOur training script provides the functionality to train models with differential privacy and LoRA, a parameter-efficient fine-tuning method. This can be toggled in the run-train.sh or directly in the train.sh script with the disable_dp and enable_lora arguments. The hyperparameters for the privacy budget and LoRA can also be specified in the train.sh script.
sh train.sh "gpt2" "models/gpt_DP_" false 8 "tab" "dataset.csv" 5 64 false "" trueWe also provide a script to run inference in the generation/controllable subdirectory.
# Command to generate synthetic data from the trained generator model.
# Parameters:
# --dir_to_save_test_output : Directory to save the test output
# --model_name : The name of the pre-trained model (e.g., "gpt2")
# --dir_to_load_model : Path to the directory where the model checkpoint for loading is stored
# --dataset_name : The identifier for the dataset to be used. Set to 'hf' when loading from HuggingFace.
# --path_to_dataset : Path to the dataset containing control codes for generating synthetic data. Set to None when you want to load from a csv file by specifying path_to_test_dataset
# --path_to_test_dataset : Path to the dataset containing control codes for generating synthetic data. Set to None when you want to load from a HF dataset.
# --disable_dp : Whether to disable Differential Privacy (true/false)
# --epsilon_value : The epsilon value for DP (used when DP is enabled)
# --enable_lora : Whether to enable LoRA (true/false)
# --num_return_seq : Number of synthetic generations to produce per control code.
# Usage:
# sh inf.sh <dir_to_save_test_output> <model_name> <dir_to_load_model> <path_to_test_dataset> <disable_dp> <epsilon_value> <enable_lora>
# Example:
sh inf.sh "inference.csv" "gpt2" "models/gpt2_DP_" "tab" None "dataset/test.csv" true "inf" trueIf you use SynthTextEval in your research or project, please cite:
@inproceedings{ramesh-etal-2025-synthtexteval,
title = "{S}ynth{T}ext{E}val: Synthetic Text Data Generation and Evaluation for High-Stakes Domains",
author = "Ramesh, Krithika and
Smolyak, Daniel and
Zhao, Zihao and
Gandhi, Nupoor and
Agarwal, Ritu and
Bjarnad{\'o}ttir, Margr{\'e}t V. and
Field, Anjalie",
editor = {Habernal, Ivan and
Schulam, Peter and
Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-demos.35/",
pages = "487--499",
ISBN = "979-8-89176-334-0",
abstract = "We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit{'}s generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development."
}A special thank you to Tianli Xu for developing the visual evaluation interface!

