Code and data for the paper by Karolina Zaczynska, Nils Feldhus, Robert Schwarzenberg, Aleksandra Gabryszak and Sebastian Möller:
https://arxiv.org/abs/2007.03765
It originally appeared in the proceedings of the Swiss Text Analytics Conference & Conference on Natural Language Processing (KONVENS) 2020: http://ceur-ws.org/Vol-2624/paper7.pdf
We recommend to refer to the more recent arXiv publication, because it includes minor adjustments.
See the data folder README for more information.
- jsonlines==1.2.0
- nltk==3.4.5
- overrides==2.8.0
- torch==1.4.0
- tqdm==4.43.0
- transformers==2.5.1
- Pattern==3.6
Execute python run_probing_experiment.py with the following flags:
--input: [Required] Path to the input jsonl (directory or file). Please choose from the directories indata/input, e.g.data/input/SimplSent(whole directory) ordata/input/SimplSent/SimplSent_pl.jsonl(single file).--output_dir: Path to the output of the experiment, by defaultdata/output/. This will create a sub-folder to the one set by--output_dirwith a name according to the language model identifier set by--lm. In here, you will find another sub-folder with the name corresponding to the case. That folder contains the .jsonl file(s), e.g.data/output/bert-base-german-dbmdz-cased/SimplSent/SimplSent_pl.jsonl.--lm: [Required] Language model identifier, i.e. eitherbert-base-german-dbmdz-cased(our paper: gBERT_large) ordistilbert-base-german-cased(our paper: gBERT_small)--verbose: If set, it's printing data processing steps and results in detail
Execute python evaluation.py with the following flags:
--path: [Required] Path to the output directory with .jsonl files, e.g.data/output/bert-base-german-dbmdz-cased/SimplSent/--lm: [Required] Language model identifier, i.e. eitherbert-base-german-dbmdz-cased(our paper: gBERT_large) ordistilbert-base-german-cased(our paper: gBERT_small). This is for loading the correct tokenizer.--sum_up_cases: If set, it takes all .jsonl files in the directory set by--pathand display one result for all sub-cases instead of calculating them separately.