Skip to content

kr-ramesh/synthtexteval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

126 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains

Live Demo of GUI EMNLP Demo 2025 arXiv:2507.07229

Contents

Introduction to SynthTextEval

Overview

SynthTextEval is an open-source library built to enable comprehensive evaluation of synthetic text generated by large language models (LLMs). With privacy concerns on the rise, especially in high-stakes domains, ensuring the utility, fairness, and privacy of systems trained on synthetic data is crucial. SynthTextEval is a unified framework designed for evaluating synthetic text across multiple dimensions, and integrates both the synthetic text generation and evaluation into a single pipeline. Additionally, it supports text generation using differentially private methods and enables qualitative assessments of this data, making it the first framework to offer a holistic approach to synthetic text evaluation.

Key Features of SynthTextEval

  • Synthetic Text Generation 📄: Enables controllable text generation using control codes to create targeted synthetic data based on specific training data attributes. It also supports differential privacy to ensure additional privacy protections when training text generators.

  • Downstream Utility Evaluation 📌: Assesses the effectiveness of synthetic text for tasks like classification and coreference resolution, allowing for direct performance comparison between real and synthetic datasets.

  • Fairness Evaluation ⚖️: Analyzes model performance across different subgroups to ensure fairness and detect any distributional biases in models trained on the generated text.

  • Automated Open-Ended Text Evaluation 🎯: Measures the quality of text using metrics such as Frechet Distance (FID), MAUVE Score, and perplexity to measure the quality and distributional differences between synthetic and real text corpora.

  • Visualization and Descriptive Text Analysis 📊: Provides tools for visualizing and analyzing key text features like named entities, n-grams, TF-IDF, and topic modeling, providing deeper insights into text structure, diversity, and themes.

  • Privacy and Memorization 🔐: Detects memorization in synthetic outputs and language models through privacy evaluations that include canary attacks and entity-based metrics. This is to ensure sensitive information is not inadvertently reproduced and compliance with data regulations is followed.

    Architectural Overview of SynthTextEval

Visual Evaluation Interface

Our GUI lets domain experts explore and annotate synthetic and real text samples, supporting more nuanced, qualitative evaluation.

Try Live Demo

Key Features:

  • 🔍 Text Similarity Exploration: For any given synthetic text, display the most similar real text side-by-side.
  • 🏷️ Entity-Based Comparison: For synthetic texts containing named entities, view real texts that contain the same entity.
  • 📝 Annotation & Feedback: Write, save, and share free-form comments on text samples to support collaborative qualitative review.

⬇️ Interface Downloads

Platform Download Link
🖥️ Mac (arm64) Download for Mac arm64
🪟 Windows (x64) Download for Windows x64
🪟 Windows (arm64) Download for Windows arm64
Synthetic Text Review GUI Tool

Repository Structure

The following is a condensed structure of our repository and the functionality it enables.

synthtexteval
|____data # Dataset directory for data stored and used in the pipeline
|____scripts # Contains scripts for running the experiments in the system demonstration
|____synthtexteval
| |____eval
| | |____descriptive
| | | |____descriptor.py # Module for descriptive analysis and statistics of the text
| | |____text_quality
| | | |____metrics.py # Contains automated evaluation metrics for machine-generated text
| | |____downstream
| | | |____classify
| | | |____coref
| | |____privacy
| | | |____canary # Directory for canary-based privacy evaluations
| | | |____metrics.py  # Contains metrics for entity leakage and span memorization
| | |____fairness
| | | |____metrics.py
| |____generation # Module for generating synthetic text with controllable generation and differential privacy
| | |____controllable
| |____utils
| | |____filtering.py # Filtering functionality provided in case the user wants to filter synthetic text
| | |____utils.py
|____demo.ipynb # A sample demo for running the functionality provided in our package.
|____README.md
|____requirements.txt
|____setup.py


Installation Instructions:

The package can be installed directly through pip:

pip install SynthTextEval

Alternatively, you can clone the repository and run the commands below to set up the package.

git clone https://github.com/kr-ramesh/synthtexteval/

Setting up the environment

Execute the following commands to install the dependencies for the environment to use the package and install it locally (Note: Need to publish this package to pip):

pip install -r requirements.txt
pip install -e .

Dataset download and instructions

To set up the directories, run the scripts/download_data/setup_dir.sh script. For downloading the Text Anonymization Benchmark (TAB) and MIMIC-III datasets, please follow the instructions in the /scripts/download_data folder. Note that accessing MIMIC-III requires a PhysioNet account and CITI certifications. You can preprocess and format the datasets by following the instructions in the respective dataset subdirectories under scripts/.

The scripts/ folder also includes demo scripts, which are also available in a more general form within their corresponding subdirectories.


Evaluation Pipeline:

Generating Descriptive Statistics

The TextDescriptor class provides comprehensive text analysis capabilities, including named entity recognition, n-gram frequency analysis, TF-IDF computation, and topic modeling using LDA. Designed for evaluating synthetic and real text, it offers functionality to extract insights, visualize entity distributions, and save results for further analysis. Some of the functions can be called as follows:

from synthtexteval.eval.descriptive.descriptor import TextDescriptor
from synthtexteval.eval.descriptive.arguments import TextDescriptorArgs

desc_analyze = TextDescriptor(texts = texts, # A list of texts to analyze
                              args = TextDescriptorArgs(produce_plot=True), # Passes the arguments and the hyperparameters for the descriptor module
                              reference_texts = # (Optional) A list of reference texts, typically sourced from the real distribution.
                              )

# Example functionality:
desc_analyze.analyze_entities()
desc_analyze._topic_modeling_display()
desc_analyze._compute_tfidf()
desc_analyze._ngram_frequency()

Evaluating Downstream Utility

Classification

Generating silver annotations

If our synthetic data lacks label annotations, we can generate silver annotations for it using an existing pretrained model.

from synthtexteval.eval.downstream.classify.generate_silver_annotations import generate_silver_annotations

generate_silver_annotations(
    model_name = "bert-base-uncased",
    path_to_model = # path to model to use for silver annotations,
    n_labels = # number of labels,
    problem_type = # multiclass or multilabel classification,
    data_path = # path to data to annotate,
    text_column = # input text column,
    label_column = # label column,
    output_path = # path to the annotated CSV,
    ckpt_exists = True # set to True when loading from local files
)

Creating a dataset

To prepare synthetic data for classification tasks, use create_classification_dataset to convert a CSV file into a structured dataset.

from synthtexteval.utils.utils import create_classification_dataset

# Create a classification dataset
_, _, _ = create_classification_dataset(
    df, 
    label_column= # name of the label column, 
    output_json_path= # path to the output json file used for mapping the data to a numeric label,
    output_dir= # path to where the data splits will be saved, 
    multilabel=False, 
    train_ratio=0.7, test_ratio=0.15, val_ratio=0.15
)

Training a classifier

After creating the dataset, train a classifier using the synthtexteval.eval.downstream.classify module. We also provide a training script for this in the eval/downstream/classify subdirectory. This script can also be used to test the classifier using the same module. Additional detailed documentation on the module, including its functionalities for testing with synthetic data and augmenting the training set with synthetic data, can be found in this subdirectory.

# Command to execute the training script with the required parameters:
#
# Parameters:
#   --model_name: Name of the pre-trained model to use (e.g., bert-base-uncased)
#   --path_to_dataset: Path to the dataset directory (e.g., sst2)
#   --path_to_model: Directory where the trained model will be saved (e.g., models/bert-base-uncased-sst2)
#   --num_labels: The number of output labels for classification (e.g., 2 for binary classification)
#   --is_train: Whether to train the model (True or False)
#   --is_test: Whether to test the model (True or False)
#
# Usage:
#   sh train.sh <model_name> <path_to_dataset> <path_to_model> <num_labels> <is_train> <is_test>

# Example: 
sh train.sh bert-base-uncased "sst2" models/bert-base-uncased-sst2 2 True True

Coreference Resolution and Mention Annotation

temp_output_dir = './temp' # Define a temp output directory
model_dir = temp_output_dir + '/base_pretrained_model' # Path to where the base_pretrained_model is downloaded/saved
os.makedirs(temp_output_dir, exist_ok = True)

Download a pre-trained model (Instructions sourced from here):

export MODEL_DIR=<model_dir>
curl -L https://www.dropbox.com/sh/7hpw662xylbmi5o/AAC3nfP4xdGAkf0UkFGzAbrja?dl=1 > temp_model.zip
unzip temp_model.zip -d $MODEL_DIR
rm -rf temp_model.zip

Generating silver annotations

In case our synthetic data does not have coreference annotations, we can generate silver annotations for this synthetic data using an existing pretrained model.

from synthtexteval.eval.downstream.coref.minimize_synth import minimize_file

synthetic_data_path = # Path to the synthetic data (a csv file)
output_path = "./temp" # Path to where outputs are saved
sample_size = 100 # Sampling a 100 samples from the synthetic data
minimize_file(synthetic_data_path, output_path, sample_size)

Fine-tuning and testing the model

Fine-tuning a model on these silver annotations and testing it on gold data (the paths can be specified in the arguments).

from synthtexteval.eval.downstream.coref.run_coref_comparison import coref_train
from synthtexteval.eval.downstream.coref.arguments import set_default_coref_args

args = set_default_coref_args(output_dir= # path to where the outputs are saved, 
                              base_model_dir = # directory where the base model is saved, 
                              test_file = #path to the test.jsonlines file
)
coref_train(args)

Alternative, we can run this with the following script provided in the eval/downstream/coref subdirectory.

python run_coref_comparison.py \
	    --output_dir=$temp_output_dir \
        --model_type=longformer \
        --base_model_name_or_path=$model_dir \
        --tokenizer_name=allenai/longformer-large-4096 \
        --test_file=$test_file \
        --do_infer \
        --num_train_epochs=$num_train_epochs \
        --logging_steps=100 \
        --save_steps=1000 \
        --eval_steps=150 \
        --max_seq_length=4000 \
        --predict_file=$predict_file \
        --predict_file_write=$predict_file_write \
        --normalise_loss \
        --max_total_seq_len=4000 \
        --experiment_name=eval_model \
        --warmup_steps=5600 \
        --adam_epsilon=1e-6 \
        --head_learning_rate=3e-4 \
        --learning_rate=1e-5 \
        --adam_beta2=0.98 \
        --weight_decay=0.01 \
        --dropout_prob=0.3 \
        --save_if_best \
        --top_lambda=0.4  \
        --tensorboard_dir=$temp_output_dir/tb

Fairness Evaluation

Assessing fairness is crucial when evaluating synthetic text models. The analyze_group_fairness_performance() function helps analyze performance disparities across subgroups. The user can define any categorical column in their dataframe as the subgroup_type.

from synthtexteval.eval.downstream.classify.visualize import tabulate_results

path_to_test_output = # Path to the dataframe ontaining model predictions from classification task

tabulate_results(csv_results_dir = , 
                 n_labels =  # Number of classes, 
                 print_fairness = # Set to True to calculate fairness metric scores, 
                 subgroup_type =  # Demographic attribute for subgroup/fairness analysis 
                 problem_type = # Set classification type (single_label or multilabel)
)

The tabulated results include:

  • Evaluation of the classifier's accuracy, precision, recall, and F1-score per subgroup.
  • Fairness metric scores that quantify disparities in model performance across different subgroups.

Privacy Evaluation

Assuming the user already has access to a list of private entities from their original data, we can conduct a privacy evaluation as follows:

from synthtexteval.eval.privacy.metrics import entity_leakage

# Returns the overall percentage of leaked entities and a dictionary containing the entities leaked in each text
total_leakage, privacy_analysis = entity_leakage(paragraphs = # list of synthetic texts, 
                                                 entities = # list of private entities provided by the user, 
                                                 entity_leakage_result_path = # path to save the results
)

The search_and_compute_EPO() function helps identify occurrences of private entities in synthetically generated text and extracts surrounding context for analysis. It helps detect memorization in synthetically generated text by identifying instances where spans of text, including private entities, are regurgitated from the training data.

from synthtexteval.eval.privacy.metrics import search_and_compute_EPO

search_and_compute_EPO(synth_file = #  Path to dataframe with synthetic text
                       reference_texts = #  List of reference text files
                       synth_phrase_file_path = # Path to where the entity context spans from synthetic text is saved
                       entity_patterns = # User-defined list of private entities 
                       max_window_len = # Maximum number of words around each entity to extract its surrounding context
                       text_field = # Name of the column corresponding to the text field in the dataframe
)

We also provide functionality to conduct canary-based evaluations for evaluating leakage in the generative model. Further details are provided in the eval.privacy subdirectory.


Qualitative Evaluation

We can evaluate the quality of synthetic text by comparing it against real-world samples using the Fréchet Inception Distance, MAUVE, and perplexity metrics.

from dataclasses import dataclass
from synthtexteval.eval.text_quality.metrics import QualEval
from synthtexteval.eval.text_quality.arguments import MauveArgs, LMArgs, FrechetArgs, Arguments

# Prepare evaluation dataframe containing synthetic and real samples
df = pd.DataFrame({
    'source':  # Synthetic text
    'reference': # Real text
})

args = Arguments(frechet = FrechetArgs, mauve = MauveArgs, perplexity = LMArgs)

qual_eval = QualEval(args)

qual_eval.calculate_fid_score(df)     # Frechet Inception Distance (FID)
qual_eval.calculate_mauve_score(df)   # MAUVE Score for distribution similarity
qual_eval.calculate_perplexity(df)    # Perplexity score for fluency

Training the Model to Generate Synthetic Data

We provide functionality to train models to generate synthetic data using control codes, allowing for the generation of synthetic text with controllable attributes specified by the user. We provide a script to train the model in the generation/controllable subdirectory, which accepts the following arguments (some of which are optional) (further details regarding the arguments are specified in the generation/controllable subdirectory):

# Command to execute the training of the synthetic data generator.

#   Parameters:
#   --model_name           : The name of the pre-trained model (e.g., "princeton-nlp/Sheared-LLaMA-1.3B")
#   --dir_to_save_model    : The directory to save the trained model (e.g., "/data/projects/synthtexteval/models/")
#   --disable_dp           : Disable Differential Privacy (true/false)
#   --epsilon_value        : The epsilon value for DP (used when DP is enabled)
#   --dataset_name         : The identifier for the dataset to be used. Set to 'hf' when loading from HuggingFace.
#   --path_to_dataset      : Path to the dataset for training the generator model.
#   --epochs               : Number of epochs to train
#   --gradient_accumulation_steps : Number of gradient accumulation steps
#   --load_ckpt            : Whether to load a checkpoint for continuing training (true/false)
#   --path_to_load_model   : Path to the checkpoint model (if loading, else set to "")
#   --enable_lora          : Whether to enable LoRA (true/false)


# Usage:
#   sh train.sh <model_name> <dir_to_save_the_model> <disable_dp> <epsilon_value> <dataset_name> <path_to_the_dataset> <epochs> <gradient_accumulation_steps> <load_ckpt> <path_to_load_model> <enable_lora>

# Example:
sh train.sh "gpt2" "models/gpt2_DP_" true "inf" "tab" "dataset.csv" 5 1 false "" true

Using Differential Privacy to Generate Synthetic Data

Our training script provides the functionality to train models with differential privacy and LoRA, a parameter-efficient fine-tuning method. This can be toggled in the run-train.sh or directly in the train.sh script with the disable_dp and enable_lora arguments. The hyperparameters for the privacy budget and LoRA can also be specified in the train.sh script.

sh train.sh "gpt2" "models/gpt_DP_" false 8 "tab" "dataset.csv" 5 64 false "" true

Generating Synthetic Data

We also provide a script to run inference in the generation/controllable subdirectory.

# Command to generate synthetic data from the trained generator model.

#   Parameters:
#   --dir_to_save_test_output   : Directory to save the test output
#   --model_name                : The name of the pre-trained model (e.g., "gpt2")
#   --dir_to_load_model         : Path to the directory where the model checkpoint for loading is stored
#   --dataset_name              : The identifier for the dataset to be used. Set to 'hf' when loading from HuggingFace.
#   --path_to_dataset           : Path to the dataset containing control codes for generating synthetic data. Set to None when you want to load from a csv file by specifying path_to_test_dataset
#   --path_to_test_dataset      : Path to the dataset containing control codes for generating synthetic data. Set to None when you want to load from a HF dataset.
#   --disable_dp                : Whether to disable Differential Privacy (true/false)
#   --epsilon_value             : The epsilon value for DP (used when DP is enabled)
#   --enable_lora               : Whether to enable LoRA (true/false)
#   --num_return_seq            : Number of synthetic generations to produce per control code.

# Usage:
#   sh inf.sh <dir_to_save_test_output> <model_name> <dir_to_load_model> <path_to_test_dataset> <disable_dp> <epsilon_value> <enable_lora>

# Example:
sh inf.sh "inference.csv" "gpt2" "models/gpt2_DP_" "tab" None "dataset/test.csv" true "inf" true

Citations

If you use SynthTextEval in your research or project, please cite:

@inproceedings{ramesh-etal-2025-synthtexteval,
    title = "{S}ynth{T}ext{E}val: Synthetic Text Data Generation and Evaluation for High-Stakes Domains",
    author = "Ramesh, Krithika  and
      Smolyak, Daniel  and
      Zhao, Zihao  and
      Gandhi, Nupoor  and
      Agarwal, Ritu  and
      Bjarnad{\'o}ttir, Margr{\'e}t V.  and
      Field, Anjalie",
    editor = {Habernal, Ivan  and
      Schulam, Peter  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.35/",
    pages = "487--499",
    ISBN = "979-8-89176-334-0",
    abstract = "We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit{'}s generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development."
}

Contributors

A special thank you to Tianli Xu for developing the visual evaluation interface!

About

SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data For High-Stakes Domains (EMNLP 2025 System Demonstration)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors