Snippext

Snippext is the extraction pipeline for mining opinions and customer experiences from user-generated content (e.g., online reviews).

Paper: Zhengjie Miao, Yuliang Li, Xiaolan Wang, Wang-Chiew Tan, "Snippext: Semi-supervised Opinion Mining with Augmented Data", In theWebConf (WWW) 2020

Requirements

Python 3.7.5
PyTorch 1.3
HuggingFace Transformers
Spacy with the em_core_web_sm models
NLTK (stopwords, wordnet)
Gensim
NVIDIA Apex (fp16 training)

Install required packages

conda install -c conda-forge nvidia-apex
pip install -r requirements.txt

Download pre-trained BERT models and word2vec models (for data augmentation) :

wget https://snippext.s3.us-east-2.amazonaws.com/finetuned_bert.zip
unzip finetuned_bert.zip
wget https://snippext.s3.us-east-2.amazonaws.com/word2vec.zip
unzip word2vec.zip

Training with the baseline BERT finetuning

The baseline method performs BERT finetuning on a specific task:

CUDA_VISIBLE_DEVICES=0 python train_baseline.py \
  --task restaurant_ae_tagging \
  --logdir results/ \
  --save_model \
  --finetuning \
  --batch_size 32 \
  --lr 5e-5 \
  --n_epochs 20 \
  --bert_path finetuned_bert/rest_model.bin

Parameters:

--task: the name of the task (defined in configs.json)
--logdir: the logging directory with Tensorboard
--save_model: whether to save the best model
--batch_size, --lr, --n_epochs: batch size, learning rate, and the number of epochs
--bert_path (Optional): the path of a fine-tuned BERT checkpoint. Use the base uncased model if not specified.
--max_len (Optional): maximum sequence length

(New) (also in MixDA and MixMatchNL):

--fp16 (Optional): whether to train with fp16 acceleration
--lm (Optional): other language models, e.g., "distilbert" or "albert"

Task Specification

The train/dev/test sets of a task (tagging or span classification) are specificed in the file configs.json. The file configs.json is a list of entries where each one is of the following format:

{
  "name": "hotel_tagging",
  "task_type": "tagging",
  "vocab": [
    "B-AS",
    "I-AS",
    "B-OP",
    "I-OP"
  ],
  "trainset": "combined_data/hotel/train.txt.combined",
  "validset": "combined_data/hotel/dev.txt",
  "testset": "combined_data/hotel/dev.txt",
  "unlabeled": "combined_data/hotel/unlabeled.txt"
},

Fields:

name: the name of the task. A tagging task should end with a suffix _tagging
task_type: either tagging or classification
vocab: the list of class labels. For tagging tasks, all labels start with B- or I- indicating the begin/end of a span. For classification task, the list contains all the possible class labels.
trainset, validset, testset: the paths to the train/dev/test sets
unlabeled (Optional): the path to the unlabeled dataset for semi-supervised learning. The file has same format as the train/test sets but the labels are simply ignored.

Training with MixDA (data augmentation)

Build the augmentation index:

python augment_index_builder.py \
  --task restaurant_ae_tagging \
  --w2v_path word2vec/rest_w2v.model \
  --idf_path word2vec/rest_finetune.txt \
  --bert_path finetuned_bert/rest_model.bin \
  --index_output_path augment/rest_index.json

Simply replace restaurant_ae_tagging with laptop_ae_tagging, restaurant_asc, and laptop_asc to generate the other indices. Replace rest with laptop for the laptop_ae_tagging and laptop_asc indices.

Train with:

CUDA_VISIBLE_DEVICES=0 python train_mixda.py \
  --task restaurant_ae_tagging \
  --logdir results/ \
  --finetuning \
  --batch_size 32 \
  --lr 5e-5 \
  --n_epochs 5 \
  --bert_path finetuned_bert/rest_model.bin \
  --alpha_aug 0.8 \
  --augment_index augment/rest_index.json \
  --augment_op token_repl_tfidf \
  --run_id 0

Parameters:

alpha_aug: the mixup parameter between the original example and the augmented example ({0.2, 0.5, 0.8} are usually good).
augment_index: the path to the augmentation indices
augment_op: the name of the DA operators. We currently support the following 9 operators:

Operators	Details
token_del_tfidf	Token deletion by importance (measured by TF-IDF)
token_del	Token deletion (uniform)
token_repl_tfidf	Token replacement by importance
token_repl	Token replacement (uniform)
token_swap	Swapping two tokens
token_ins	Inserting new tokens
span_sim	Replacing a span with similar a one
span_freq	Replacing a span by frequency
span	Uniform span replacement

Training with MixMatchNL (MixDA + Semi-supervised Learning)

Our implementation of MixMatch with MixDA. To train with MixMatchNL:

CUDA_VISIBLE_DEVICES=0 python train_mixmatchnl.py \
  --task restaurant_ae_tagging \
  --logdir results/ \
  --finetuning \
  --batch_size 32 \
  --lr 5e-5 \
  --n_epochs 5 \
  --bert_path finetuned_bert/rest_model.bin \
  --alpha_aug 0.8 \
  --alpha 0.2 \
  --u_lambda 50.0 \
  --num_aug 2 \
  --augment_index augment/rest_index.json \
  --augment_op token_repl_tfidf \
  --run_id 0

Additional parameters:

alpha: the mixup parameter (between labeled and unlabeled data). We chose alpha from {0.2, 0.5, 0.8}.
u_lambda: the weight of unlabeled data loss, typically (chosen from {10.0, 25.0, 50.0})
num_aug: the number of augmented examples per unlabeled example (we chose from {2, 4})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Snippext

Requirements

Training with the baseline BERT finetuning

Task Specification

Training with MixDA (data augmentation)

Training with MixMatchNL (MixDA + Semi-supervised Learning)

Hyperparameters and experiment scipts (coming soon)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
combined_data		combined_data
snippext		snippext
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
augment_index_builder.py		augment_index_builder.py
augment_utils.py		augment_utils.py
configs.json		configs.json
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
train_baseline.py		train_baseline.py
train_mixda.py		train_mixda.py
train_mixmatchnl.py		train_mixmatchnl.py

License

rit-git/Snippext_public

Folders and files

Latest commit

History

Repository files navigation

Snippext

Requirements

Training with the baseline BERT finetuning

Task Specification

Training with MixDA (data augmentation)

Training with MixMatchNL (MixDA + Semi-supervised Learning)

Hyperparameters and experiment scipts (coming soon)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages