GPT2Augmentation

An experiment in augmenting IMDB dataset with GPT2.
Read Deep_proj.pdf for more information.

Uses libraries:
torch 1.13.1
torchtext 0.14.1
transformers 4.21.0
scikit-learn 1.2.0
pandas 1.4.3
numpy 1.23.1

Run "genData.py" to generate additional samples using GPT2.
The samples are stored in "data.out".
The code assumes you have a GPU available in cuda:0, but there are instructions within on modifying it to run on CPU.
This code takes a while to run.

Run "cleanData.py" to clean the samples in "data.out" into a usable format.
The samples will be written to "clean_data.out".

Run the notebook to train a transformer on the reduced/augmented IMDB dataset.
The first cell contains some parameters that can control the execution.
ADD_DATA - If True, will take the samples from "clean_data.out".
FORCE_BALANCE - If True, will duplicate samples to balance the number of positive and negative review.s
select_len - This is the maximum length of samples in the reduced dataset. 300 is the parameter used in "genData.py".
cut_off_len - Due to limited memory all sequences are truncated to this length. This includes the test_set.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Deep_Proj.pdf		Deep_Proj.pdf
IMDBCode.ipynb		IMDBCode.ipynb
README.md		README.md
cleanData.py		cleanData.py
clean_data.out		clean_data.out
data.out		data.out
genData.py		genData.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT2Augmentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPT2Augmentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages