For iNTUitive hackathon (27-28 Feb 2021)
Team members (That One): Leow Cong Sheng, Tay Kai Yang, Lin Min Htoo and Soo Jian Xian
Slide Deck:: https://drive.google.com/file/d/1tt3_8DVBHQjPIkl-QvzQEOXdynA335rZ/view?usp=sharing
Video Pitch: https://youtu.be/cFYOaKjbBs4
Dev Post: https://devpost.com/software/tattletale
Check out our webapp here! https://ta-ttletale.herokuapp.com/
We hope to increase the accessibility and exposure to reading in developing countries.
Because, we strongly believe that education is a social leveller. Reading books and stories have tremendous impacts on cognitive and linguistics development which can be difficult to access for the illiterate.
We utilized the GPT-2 language model for story generation through a simple, no-frill web application. To bias the model to generate children's stories, we finetuned the model on a dataset of children's stories that we scraped from the Web.
Frontend: https://github.com/xfated/intuitive
Backend: https://github.com/linminhtoo/that_one
Just run
bash -i setup.sh
The contents:
# ensure conda is already initialized
conda create -n tattletale python=3.7 -y
conda activate tattletale
pip install transformers torch
pip install beautifulsoup4
Run
python scrape.py
Feel free to add more weblinks to scrape. See the code for more details, but the format is: Tuple[link, pointer_for_posts, pointer_for_sentences]
It takes ~5 minutes to finetune for 3 epochs on 1xRTX2080 using fp16 (mixed precision training). With full precision, it is almost 2x slower. Just 3 epochs of training gave us the best validation loss given our dataset, which is, admittedly, a little small, but works well enough for a proof of concept. See finetune.py for detailed arguments
python finetune.py \
--fp16 --epochs 10 --patience 3 \
--train_data 'data/train_1024_n_80.txt' --eval_data 'data/eval_1024_n_80.txt' \
--ckpt_folder 'checkpoints/curr_best'
To do distributed training over multiple GPUs (replace $N_GPUS with the total number of GPUs per node)
python -m torch.distributed.launch \
--nproc_per_node $N_GPUS finetune.py \
--fp16 --epochs 10 --early_stop --patience 3 \
--train_data 'data/train_1024_n_80.txt' --eval_data 'data/eval_1024_n_80.txt' \
--ckpt_folder 'checkpoints/curr_best'
First, download this entire folder from google drive.
Create a folder called 'checkpoints' and drop this folder as-is inside.
You need to specify the checkpoint folder of the weights you wish to use in generate.py, so it should be CHECKPOINT_FOLDER = "checkpoints/curr_best" in line 4.
See the script itself for parameters that you can modify
python generate.py
Especially important since this application is targeted at children. We downloaded a list of ~5000 common inappropriate English words from CMU CS Website and appended a few other bad words that we deemed were too violent/rude for children. We have already ran:
python get_bad_word_tokens.py
which uses the GPT-2 finetuned tokenizer (GPT-2 tokenizer + <BOS>, <EOS>, <PAD> tokens). You only need to re-run this script if you
use a different tokenizer, or wish to augment the existing list of bad words. If you change the name of the data/bad_tokens.txt file, you will need to modify line 12 of generate.py
Fine-tuning & generation code were mostly standard HuggingFace templates. We also referenced the code from this Towards Data Science article. All other code, such as scraping & filtering bad words, were written from scratch.
367 children's stories were scraped from two websites:
Bad words were downloaded from Carnegie-Mellon University's Computer Science website