This is the repo for Two Counterexamples to Tokenization and the Noiseless Channel, presented at LREC-COLING 2024.
The poster can be found in this repo, and a recording of our presentation can be found here.
To run an experiment, do the following things:
- Clone the repo
- Navigate to the
fairseqdirectory - Install
fairseqas suggested in that README (that is, make sure you have all the dependencies and then run thepip install --editable ./command) - Navigate to
examples/duplication_bpeorexamples/random_drop_bpe - Run the
train-duplication-bpe.shortrain-random-drop.shscripts (follow the examples below) - The results will appear in the
fairseq/experimental_outputs/<EXPERIMENT_NAME>directory.
Examples for running the scripts:
bash train-duplication-bpe.sh --experiment-name "duplication_example" --src-bpe-tokens 4000 --tgt-bpe-tokens 4000 --duplication-n 100 --duplication-k 3 --seed 100 --device 0
bash train-random-drop.sh --experiment-name "random_drop_example" --src-bpe-tokens 6000 --tgt-bpe-tokens 6000 --random-drop-n 2000 --random-drop-k 1000 --seed 100 --bpe-seed 0 --device 0