Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese

This is the repository for our EMNLP 2025 paper. We propose that translationese should be a graded phenomenon, that is, one translation can contain more or less translationese than another. We step away from the traditional binary classification of translated vs. original text, and instead attempt to directly measure the degree of translationese in a text. In this paper, we compare several scoing functions for measuring translationese to find a better Translationese-index, which can measure translationese in a graded and generalizable manner.

We primarily experiment on two settings:

Multi-genre synthetic translations: methods should be able to classify high-translationese data from low-transaltionese data in multiple genres (generalizable).
In-the-wild translations with human annotations: methods should correlate well with human annotations, both for pointwise and pairwise annotations (graded).

All data used in our experiments can be found in the data/ folder.

And we provide implementations for all methods we compare in the src/ folder, and scripts for training and evaluating these methods in the scripts/ folder.

Among all methods, we find that the best method (so far) is the likelihood ratios of two contrastively fine-tuned LLMs (T-index), one trained on high-translationese data and the other trained on low-translationese data. The codes for batch inference of this method can be found in src/t_index.py.

# script for reproducing results in the paper
export CUDA_VISIBLE_DEVICES=0

for seed in 10 20 30; do
    bash scripts/train/sft.sh oliver_twist_qwen ${seed} 1000 1e-6 3 16
    bash scripts/train/dpo.sh oliver_twist_qwen ${seed} 1000 16
    bash scripts/train/rm.sh oliver_twist_qwen ${seed} 1000
    bash scripts/train/xlmr.sh oliver_twist_qwen ${seed}
done

bash scripts/train/sft.sh coca_blog_llama 10 1000 1e-6 3 16

export CUDA_VISIBLE_DEVICES=0,1
bash scripts/train/sft.sh mixture 10 5000 2.7e-5 1 32
bash scripts/train/sft.sh mixture 10 3000 2.7e-5 1 32

export CUDA_VISIBLE_DEVICES=0
bash scripts/train/sft.sh mixture 10 1000 1e-6 3 16

for n_samples in 5000 3000 1000; do
    bash scripts/train/rm.sh mixture 10 ${n_samples}
done

export CUDA_VISIBLE_DEVICES=0,1
for n_samples in 5000 3000 1000; do
    bash scripts/train/dpo.sh mixture 10 ${n_samples} 8
done

bash scripts/run/synthtic.sh
bash scripts/run/wild.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
recipes		recipes
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
produce_table.ipynb		produce_table.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages