There are two Python programs here:
./defaultchooses one of the sentences translated by Turkers. Specifically, the default method always chooses the sentence translated by the first Turker../gradecalculates the sentence-level BLEU score based on the output generated, using 'hidden' reference sentences that is only available to grader.
The commands are designed to work in a pipeline. For instance, this is a valid invocation:
./default | ./grade
To use this, you may do the following: 1. Develop your own model and generate output 2. Use grade function to evaluate your output
The data-train/, data-test/ directory contains a training set and a test set
-
data-traincontains the following:- lm : language model probability for English (n-grams)
- surveys.tsv: data on each Turker's information
- train_postedited_translations.tsv: translations generated by each Turkers are edited by other Turkers who are residing in the U.S. Total 10 edits are available for each source sentence. You may use this data to generate additional features.
- train_translations.tsv: 20% of original data which also have four reference sentences for each source sentence.
-
data-testcontains the following:- test_translations.tsv: Original data which do not have reference sentences. Your goal is to choose the best candidate sentence generated by each Turker when a source sentence, along with worker ids and four candidate sentences are provided. Note that first 358 sentences are from train_translations.tsv