University of Cape Town's WMT22 System: Multilingual Machine Translation for Southern African Languages
This repository provides guidelines for using our submission to the constrained track of the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages. Our system is a single multilingual translation model that translates between English and 8 South / South East African Languages, as well as between specific pairs of the African languages.
| Language | Code |
|---|---|
| English | eng |
| Afrikaans | afr |
| Northern Sotho | nso |
| Shona | sna |
| Swati | ssw |
| Tswana | tsn |
| Xhosa | xho |
| Xitsonga | tso |
| Zulu | zul |
Our system translates between English and 8 South / South East African languages (Afrikaans, Northern Sotho, Shona, Swati, Tswana, Xhosa, Xitsonga, Zulu) and in 8 additional directions (Xhosa to Zulu, Zulu to Shona, Shona to Afrikaans, Afrikaans to Swati, Swati to Tswana, Tswana to Xitsonga, Xitsonga to Northern Sotho, Northern Sotho to Xhosa).
eng ⟷ afr,eng ⟷ nso,eng ⟷ sna,eng ⟷ ssw,eng ⟷ tsn,eng ⟷ xho,eng ⟷ tso,eng ⟷ zul.xho ⟶ zul,zul ⟶ sna,sna ⟶ afr,afr ⟶ ssw,ssw ⟶ tsn,tsn ⟶ tso,tso ⟶ nso,nso ⟶ xho.
We have two bash scripts for installing the required pachages and running the model:
- setup_conda_env.sh
This creates a new conda env, installs the required packages, and downloads the model and other files needed to run our model.
bash setup_conda_env.sh
- run_inference.sh
Given source sentences in a simple text file (one sentence per line), it runs our preprocessing pipeline and produces our model's translations of the sentences in a target language. This script should be run with the following command line arguments:
bash run_inference.sh <tgt> <src> <batch_size> <input_path> <output_path>.
For example:
bash run_inference.sh eng afr 8 afr.test eng.out
The output path eng.out will be a simple text file with one sentence per line, in the same order as the sentences appeared in the input path afr.test.