Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

The official evaluation script lives in this directory. We have provided sample output from the baseline model on the development data in ../baseline/. You may run the baseline as shown in the examples below. For the baseline, we use the pretrained BertTokenizer models from huggingface.

Word-level Task Evaluation:

python evaluate.py --guess ../baseline/eng.word.dev.bert.tsv --gold ../data/eng.word.dev.tsv --category

category: 000
distance        2.11
f_measure       2.96
precision       2.02
recall  5.55


category: 001
distance        1.42
f_measure       53.97
precision       46.60
recall  64.11


category: 010
distance        2.75
f_measure       28.97
precision       25.29
recall  33.90


category: 011
distance        2.96
f_measure       37.86
precision       36.85
recall  38.93


category: 100
distance        2.73
f_measure       14.87
precision       12.16
recall  19.14


category: 101
distance        1.51
f_measure       50.83
precision       51.57
recall  50.12


category: 110
distance        3.31
f_measure       25.53
precision       24.92
recall  26.17


category: 111
distance        3.22
f_measure       31.22
precision       34.81
recall  28.30


category: all
distance        2.69
f_measure       24.28
precision       20.99
recall  28.79

Sentence-level Task Evaluation:

python evaluate.py --guess ../baseline/eng.sentence.dev.bert.tsv --gold ../data/eng.sentence.dev.tsv

category: all
distance        5.50
f_measure       64.71
precision       63.68
recall  65.77
python evaluate.py --guess ../baseline/mon.sentence.dev.bert.tsv --gold ../data/mon.sentence.dev.tsv
category: all
distance        28.86
f_measure       23.99
precision       20.00
recall  29.95
python evaluate.py --guess ../baseline/ces.sentence.dev.bert.tsv --gold ../data/ces.sentence.dev.tsv
category: all
distance        21.01
f_measure       33.25
precision       36.76
recall  30.35