Snippext is the extraction pipeline for mining opinions and customer experiences from user-generated content (e.g., online reviews).
Paper: Zhengjie Miao, Yuliang Li, Xiaolan Wang, Wang-Chiew Tan, "Snippext: Semi-supervised Opinion Mining with Augmented Data", In theWebConf (WWW) 2020
- Python 3.7.5
- PyTorch 1.3
- HuggingFace Transformers
- Spacy with the
em_core_web_smmodels - NLTK (stopwords, wordnet)
- Gensim
- NVIDIA Apex (fp16 training)
Install required packages
conda install -c conda-forge nvidia-apex
pip install -r requirements.txt
Download pre-trained BERT models and word2vec models (for data augmentation) :
wget https://snippext.s3.us-east-2.amazonaws.com/finetuned_bert.zip
unzip finetuned_bert.zip
wget https://snippext.s3.us-east-2.amazonaws.com/word2vec.zip
unzip word2vec.zip
The baseline method performs BERT finetuning on a specific task:
CUDA_VISIBLE_DEVICES=0 python train_baseline.py \
--task restaurant_ae_tagging \
--logdir results/ \
--save_model \
--finetuning \
--batch_size 32 \
--lr 5e-5 \
--n_epochs 20 \
--bert_path finetuned_bert/rest_model.bin
Parameters:
--task: the name of the task (defined inconfigs.json)--logdir: the logging directory with Tensorboard--save_model: whether to save the best model--batch_size,--lr,--n_epochs: batch size, learning rate, and the number of epochs--bert_path(Optional): the path of a fine-tuned BERT checkpoint. Use the base uncased model if not specified.--max_len(Optional): maximum sequence length
(New) (also in MixDA and MixMatchNL):
--fp16(Optional): whether to train with fp16 acceleration--lm(Optional): other language models, e.g., "distilbert" or "albert"
The train/dev/test sets of a task (tagging or span classification) are specificed in the file configs.json.
The file configs.json is a list of entries where each one is of the following format:
{
"name": "hotel_tagging",
"task_type": "tagging",
"vocab": [
"B-AS",
"I-AS",
"B-OP",
"I-OP"
],
"trainset": "combined_data/hotel/train.txt.combined",
"validset": "combined_data/hotel/dev.txt",
"testset": "combined_data/hotel/dev.txt",
"unlabeled": "combined_data/hotel/unlabeled.txt"
},
Fields:
name: the name of the task. A tagging task should end with a suffix_taggingtask_type: eithertaggingorclassificationvocab: the list of class labels. For tagging tasks, all labels start withB-orI-indicating the begin/end of a span. For classification task, the list contains all the possible class labels.trainset,validset,testset: the paths to the train/dev/test setsunlabeled(Optional): the path to the unlabeled dataset for semi-supervised learning. The file has same format as the train/test sets but the labels are simply ignored.
- Build the augmentation index:
python augment_index_builder.py \
--task restaurant_ae_tagging \
--w2v_path word2vec/rest_w2v.model \
--idf_path word2vec/rest_finetune.txt \
--bert_path finetuned_bert/rest_model.bin \
--index_output_path augment/rest_index.json
Simply replace restaurant_ae_tagging with laptop_ae_tagging, restaurant_asc, and laptop_asc to generate the other indices.
Replace rest with laptop for the laptop_ae_tagging and laptop_asc indices.
- Train with:
CUDA_VISIBLE_DEVICES=0 python train_mixda.py \
--task restaurant_ae_tagging \
--logdir results/ \
--finetuning \
--batch_size 32 \
--lr 5e-5 \
--n_epochs 5 \
--bert_path finetuned_bert/rest_model.bin \
--alpha_aug 0.8 \
--augment_index augment/rest_index.json \
--augment_op token_repl_tfidf \
--run_id 0
Parameters:
alpha_aug: the mixup parameter between the original example and the augmented example ({0.2, 0.5, 0.8} are usually good).augment_index: the path to the augmentation indicesaugment_op: the name of the DA operators. We currently support the following 9 operators:
| Operators | Details |
|---|---|
| token_del_tfidf | Token deletion by importance (measured by TF-IDF) |
| token_del | Token deletion (uniform) |
| token_repl_tfidf | Token replacement by importance |
| token_repl | Token replacement (uniform) |
| token_swap | Swapping two tokens |
| token_ins | Inserting new tokens |
| span_sim | Replacing a span with similar a one |
| span_freq | Replacing a span by frequency |
| span | Uniform span replacement |
Our implementation of MixMatch with MixDA. To train with MixMatchNL:
CUDA_VISIBLE_DEVICES=0 python train_mixmatchnl.py \
--task restaurant_ae_tagging \
--logdir results/ \
--finetuning \
--batch_size 32 \
--lr 5e-5 \
--n_epochs 5 \
--bert_path finetuned_bert/rest_model.bin \
--alpha_aug 0.8 \
--alpha 0.2 \
--u_lambda 50.0 \
--num_aug 2 \
--augment_index augment/rest_index.json \
--augment_op token_repl_tfidf \
--run_id 0
Additional parameters:
alpha: the mixup parameter (between labeled and unlabeled data). We chosealphafrom {0.2, 0.5, 0.8}.u_lambda: the weight of unlabeled data loss, typically (chosen from {10.0, 25.0, 50.0})num_aug: the number of augmented examples per unlabeled example (we chose from {2, 4})