Guangzhan Wang, Hongyu Zhang, Beijun Shen, Xiaodong Gu
This is the PyTorch implementation for the following paper: Transplant Then Regenerate: A New Paradigm for Text Data Augmentation, which has been accepted to the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025, Oral).
This repo provides the code for reproducing the experiments in “Transplant Then Regenerate: A New Paradigm for Text Data Augmentation”. LMTransplant is a novel Data Augmentation paradigm with LLM-based text transplanting. It crafts realistic contextual scenarios to the original text, by more effectively leveraging knowledge embedded in LLMs, thereby crafting higher-quality and more diverse text data.
We develop LMTransplant, a novel text data augmentation paradigm based on transplantation. Following illustrates the overall pipeline. LMTransplant generates high-quality and diverse augmented text by leveraging a bidirectional text continuation strategy and masked text prediction. It generates contextually relevant scenes that align with the original text, making full use of the knowledge embedded in LLMs. We elaborate on each step in the following sections.
We obtain all experimental datasets from publicly available data sources and implement corresponding data preprocessing pipelines tailored to different task types. For text classification datasets, we perform label standardization by converting the original numerical category labels into textual labels. In processing the question-answering datasets, we filter samples based on text length, excluding those that are excessively long. All preprocessed data are stored in .jsonl file format to facilitate subsequent operations.
git clone https://anonymous.4open.science/r/LMTransplant/
cd LMTransplant
pip install -r requirements.txt
cd data_augmentation/classification/utils
bash download_and_prepare_datasets.sh
cd data_augmentation
python datasets_preprocess.pyTo simulate real-world low-data scenarios and assess the effectiveness of data augmentation methods, we adopt a subsampling strategy similar to existing studies(Kumar et al., 2020; Ubani et al., 2023). In the text classification tasks, we perform class-balanced subsampling on the original training and development sets, selecting 10 random samples from each class. In the question-answering task, we randomly subsample 50 samples from the original training and development sets to construct sub-training and sub-development sets.
cd data_augmentation
python generate_seed_data.pyWe use LMTransplant to generate augmented data for seed data in the sub-training set. For a given text, LMTransplant instructs the LLMs to craft its preceding and subsequent contexts through by bidirectional text continuation. Subsequently, LMTransplant masks the original text in the transplanted text and directs the LLMs to regenerate the missing parts given the crafted contexts, thereby producing new variants of the original text. In this process, the context acts as a bridge between the original and the new text. By incorporating knowledge embedded in LLMs, the newly generated text not only preserves theme similarity to the original but also exhibits greater innovation and diversity.
cd data_augmentation
python ours_l_r.pyWe first evaluate the quality of the augmented samples, using two commonly employed metrics, Distinct-n and Semantic Variability. By more effectively leveraging the knowledge embedded in LLMs, we expect the samples generated by our method to be thematically related to the original samples, while also being more innovative in content, and demonstrating greater diversity.
cd data_augmentation/eval_distinct_n
python distinct_3.pycd data_augmentation/semantic_variability
python semantic_variability.pyWe evaluate the improvement in downstream task performance achieved by the augmented samples generated through LMTransplant.
cd data_augmentation/classification
bash script/bert_sst2.shcd data_augmentation/question_answer
bash train.shcd data_augmentation/named_entity_recognition
bash train.sh

