Artifacts for APSEC 2024 Submission: "Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?"
We construct the PolyHumanEval benchmark, a variant of OpenAI HumanEval that supports 14 programming languages: C++, C# , Dart, Go, Java, JavaScript, Kotlin, PHP, Python, Ruby, Rust, Scala, Swift and TypeScript.
The PolyHumanEval benchmark is described in TestDSL, which is a domain-specific language for testing.
For example, the has_close_element problem in HumanEval:
def has_close_elements(numbers: List[float], threshold: float) -> bool:
...
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
...is described by the following TestDSL code:
code {
func has_close_elements(numbers: list<double>, threshold: double) -> bool
}
tests {
template nse {
([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) -> true
([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) -> false
...
}
}
The whole benchmark data is shown in benchmark/poly_humaneval.testdsl, which describes function signature and testcases for all 164 programming problems in HumanEval. Our handcraft solutions (also used for source code of translation) are shown in benchmark/poly_humaneval_sol.json.
We have developed a tool to parse the TestDSL data, generate test programs in all 14 programming languages, and execute the test programs to get the results. Its artifacts are placed in the evaluation folder. An example of usage is shown in evaluation/code/example.py.
For easier reproduction, we use Nix to manage the dependencies in evaluation/shell.nix.
Following these steps to reproducible the runtime environment:
> git clone https://github.com/AnonymousUser257/poly-humaneval
> cd evaluation
> nix-shell --pureIf the environment is well-settled, you can evaluate all the gold solutions by running the python script:
> python code/check_gold_solution_parallel.pywhich should not print any failed message.
Then, you can evaluate translation results generated by LLMs:
> mkdir ./output
> python code/check_generated_parallel --input data/RQ1-2_CodeLlama-13B.json --output output/evaluate_result.json
> python code/calculate_ca.py --input output/evaluate_result.json --output output/ca_result.jsonAll the LLM-generated results in our experiments are placed in llm_generated_translations.
The artifacts for self-training are placed in self_training_data.
self_training_data/generated_py_codes.jsonis the generated Python code with CodeLlama-13B.self_training_data/generated.testdslis the generated Python test cases inTestDSLformat.self_training_data/fine-tune-promptsis the Python-Go parallel data for fine-tuning CodeLlama-13B.self_training_data/fine-tune-lora-checkpointsis the LoRA adapter model checkpoints after self-training.