Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Guangzhan Wang, Hongyu Zhang, Beijun Shen, Xiaodong Gu

This is the PyTorch implementation for the following paper: Transplant Then Regenerate: A New Paradigm for Text Data Augmentation, which has been accepted to the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Oral).

Introduction

This repo provides the code for reproducing the experiments in “Transplant Then Regenerate: A New Paradigm for Text Data Augmentation”. LMTransplant is a novel Data Augmentation paradigm with LLM-based text transplanting. It crafts realistic contextual scenarios to the original text, by more effectively leveraging knowledge embedded in LLMs, thereby crafting higher-quality and more diverse text data.

Overview

We develop LMTransplant, a novel text data augmentation paradigm based on transplantation. Following illustrates the overall pipeline. LMTransplant generates high-quality and diverse augmented text by leveraging a bidirectional text continuation strategy and masked text prediction. It generates contextually relevant scenes that align with the original text, making full use of the knowledge embedded in LLMs. We elaborate on each step in the following sections.

Datasets download and preprocessing

We obtain all experimental datasets from publicly available data sources and implement corresponding data preprocessing pipelines tailored to different task types. For text classification datasets, we perform label standardization by converting the original numerical category labels into textual labels. In processing the question-answering datasets, we filter samples based on text length, excluding those that are excessively long. All preprocessed data are stored in .jsonl file format to facilitate subsequent operations.

git clone https://anonymous.4open.science/r/LMTransplant/
cd LMTransplant
pip install -r requirements.txt

cd data_augmentation/classification/utils
bash download_and_prepare_datasets.sh

cd data_augmentation
python datasets_preprocess.py

Get seed data

To simulate real-world low-data scenarios and assess the effectiveness of data augmentation methods, we adopt a subsampling strategy similar to existing studies(Kumar et al., 2020; Ubani et al., 2023). In the text classification tasks, we perform class-balanced subsampling on the original training and development sets, selecting 10 random samples from each class. In the question-answering task, we randomly subsample 50 samples from the original training and development sets to construct sub-training and sub-development sets.

cd data_augmentation
python generate_seed_data.py

Generate augmented data

We use LMTransplant to generate augmented data for seed data in the sub-training set. For a given text, LMTransplant instructs the LLMs to craft its preceding and subsequent contexts through by bidirectional text continuation. Subsequently, LMTransplant masks the original text in the transplanted text and directs the LLMs to regenerate the missing parts given the crafted contexts, thereby producing new variants of the original text. In this process, the context acts as a bridge between the original and the new text. By incorporating knowledge embedded in LLMs, the newly generated text not only preserves theme similarity to the original but also exhibits greater innovation and diversity.

cd data_augmentation
python ours_l_r.py

Intrinsic evaluation

We first evaluate the quality of the augmented samples, using two commonly employed metrics, Distinct-n and Semantic Variability. By more effectively leveraging the knowledge embedded in LLMs, we expect the samples generated by our method to be thematically related to the original samples, while also being more innovative in content, and demonstrating greater diversity.

Distinct-n

cd data_augmentation/eval_distinct_n
python distinct_3.py

Semantic Variability

cd data_augmentation/semantic_variability
python semantic_variability.py

Extrinsic evaluation

We evaluate the improvement in downstream task performance achieved by the augmented samples generated through LMTransplant.

Classification task

cd data_augmentation/classification
bash script/bert_sst2.sh

Question answering task

cd data_augmentation/question_answer
bash train.sh

Named entity recognition task

cd data_augmentation/named_entity_recognition
bash train.sh

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data_augmentation		data_augmentation
README.md		README.md
extrinsic_evaluation.png		extrinsic_evaluation.png
intrinsic_evaluation.png		intrinsic_evaluation.png
overview.png		overview.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Introduction

Overview

Datasets download and preprocessing

Get seed data

Generate augmented data

Intrinsic evaluation

Distinct-n

Semantic Variability

Extrinsic evaluation

Classification task

Question answering task

Named entity recognition task

Main Result

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Introduction

Overview

Datasets download and preprocessing

Get seed data

Generate augmented data

Intrinsic evaluation

Distinct-n

Semantic Variability

Extrinsic evaluation

Classification task

Question answering task

Named entity recognition task

Main Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages