Skip to content

pbehhe/GPT2Augmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT2Augmentation

An experiment in augmenting IMDB dataset with GPT2.
Read Deep_proj.pdf for more information.

Uses libraries:
torch 1.13.1
torchtext 0.14.1
transformers 4.21.0
scikit-learn 1.2.0
pandas 1.4.3
numpy 1.23.1

Run "genData.py" to generate additional samples using GPT2.
The samples are stored in "data.out".
The code assumes you have a GPU available in cuda:0, but there are instructions within on modifying it to run on CPU.
This code takes a while to run.

Run "cleanData.py" to clean the samples in "data.out" into a usable format.
The samples will be written to "clean_data.out".

Run the notebook to train a transformer on the reduced/augmented IMDB dataset.
The first cell contains some parameters that can control the execution.
ADD_DATA - If True, will take the samples from "clean_data.out".
FORCE_BALANCE - If True, will duplicate samples to balance the number of positive and negative review.s
select_len - This is the maximum length of samples in the reduced dataset. 300 is the parameter used in "genData.py".
cut_off_len - Due to limited memory all sequences are truncated to this length. This includes the test_set.

About

An experiment in augmenting IMDB dataset with GPT2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors