An experiment in augmenting IMDB dataset with GPT2.
Read Deep_proj.pdf for more information.
Uses libraries:
torch 1.13.1
torchtext 0.14.1
transformers 4.21.0
scikit-learn 1.2.0
pandas 1.4.3
numpy 1.23.1
Run "genData.py" to generate additional samples using GPT2.
The samples are stored in "data.out".
The code assumes you have a GPU available in cuda:0, but there are instructions within on modifying it to run on CPU.
This code takes a while to run.
Run "cleanData.py" to clean the samples in "data.out" into a usable format.
The samples will be written to "clean_data.out".
Run the notebook to train a transformer on the reduced/augmented IMDB dataset.
The first cell contains some parameters that can control the execution.
ADD_DATA - If True, will take the samples from "clean_data.out".
FORCE_BALANCE - If True, will duplicate samples to balance the number of positive and negative review.s
select_len - This is the maximum length of samples in the reduced dataset. 300 is the parameter used in "genData.py".
cut_off_len - Due to limited memory all sequences are truncated to this length. This includes the test_set.