reward_model\train_reward.py: Python script to train the reward model. The reward model is a Llama2-7B with regression head. The training is distributed via Accelerate FSDP.run.sh: Shell script to run the training of the reward model.
TnD\EPPOTrainer.py: Custom PPOTrainer class. Adapted fromtrlPPOTrainer class.trainer_util.py: Training utility functions.run_TnD.py: Main script to run training and evaluation for TnD modelsrun_experiments.sh: Shell script to replicate the experiments in the paper.
Before running any experiments, have the paths for reward model, training set, evaluation set, word set, teacher model, and output directory ready.
NOTE: the training requires 2 GPUs
- To run the main experiments on BabyLM and BookCorpus, change the
TRAIN_SET_PATH,EVAL_SET_PATH, andWORD_SET_PATHinrun_experiments.shto the paths of the training set, evaluation set, and word set respectively. - Then, change the path for the corresponding teacher and reward model in
run_experiments.sh.
- To run the CLM baseline, simply set the
clm_per_stepto any number greater than 10001 inrun_experiments.sh.
- To run "teacher's demostration only" training, set the
teacher_demo_onlyflag toTrueinrun_experiments.sh. - To run "student's trial only" training, set both the
teacher_demo_onlyanduse_ground_truthflags toFalseinrun_experiments.sh.
- Use the same
run_experiments.shfor the main experiments - TnD on BabyLM and BookCorpus and CLM baseline on BabyLM and BookCorpus. - Set
n_headandn_embdto the number of heads and embedding size of the teacher model inrun_experiments.sh. Numbers used in the paper aren_head=12, n_embd=588,n_head=10, n_embd=360, andn_head=10, n_embd=250.
- To run the masked teacher experiments, refraining the teacher model from generating certain tokens, set the
mask_typeflag tomaskinrun_experiments.sh.
- To run the double CLM experiments, set the
double_clmflag toTrueinrun_experiments.sh.