Poly-HumanEval

Artifacts for APSEC 2024 Submission: "Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?"

Benchmark

We construct the PolyHumanEval benchmark, a variant of OpenAI HumanEval that supports 14 programming languages: C++, C# , Dart, Go, Java, JavaScript, Kotlin, PHP, Python, Ruby, Rust, Scala, Swift and TypeScript.

TestDSL

The PolyHumanEval benchmark is described in TestDSL, which is a domain-specific language for testing.

For example, the has_close_element problem in HumanEval:

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    ...
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
...

is described by the following TestDSL code:

code {
    func has_close_elements(numbers: list<double>, threshold: double) -> bool
}
tests {
    template nse {
        ([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) -> true
        ([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) -> false
        ...
    }
}

The whole benchmark data is shown in benchmark/poly_humaneval.testdsl, which describes function signature and testcases for all 164 programming problems in HumanEval. Our handcraft solutions (also used for source code of translation) are shown in benchmark/poly_humaneval_sol.json.

Test Generation & Evaluation Tool

We have developed a tool to parse the TestDSL data, generate test programs in all 14 programming languages, and execute the test programs to get the results. Its artifacts are placed in the evaluation folder. An example of usage is shown in evaluation/code/example.py.

For easier reproduction, we use Nix to manage the dependencies in evaluation/shell.nix.

Following these steps to reproducible the runtime environment:

> git clone https://github.com/AnonymousUser257/poly-humaneval
> cd evaluation
> nix-shell --pure

If the environment is well-settled, you can evaluate all the gold solutions by running the python script:

> python code/check_gold_solution_parallel.py

which should not print any failed message.

Then, you can evaluate translation results generated by LLMs:

> mkdir ./output
> python code/check_generated_parallel --input data/RQ1-2_CodeLlama-13B.json --output output/evaluate_result.json
> python code/calculate_ca.py --input output/evaluate_result.json --output output/ca_result.json

Other Artifacts

All the LLM-generated results in our experiments are placed in llm_generated_translations.

The artifacts for self-training are placed in self_training_data.

self_training_data/generated_py_codes.json is the generated Python code with CodeLlama-13B.
self_training_data/generated.testdsl is the generated Python test cases in TestDSL format.
self_training_data/fine-tune-prompts is the Python-Go parallel data for fine-tuning CodeLlama-13B.
self_training_data/fine-tune-lora-checkpoints is the LoRA adapter model checkpoints after self-training.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
evaluation		evaluation
llm_generated_translations		llm_generated_translations
self_training_data		self_training_data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Poly-HumanEval

Benchmark

TestDSL

Test Generation & Evaluation Tool

Other Artifacts

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Poly-HumanEval

Benchmark

TestDSL

Test Generation & Evaluation Tool

Other Artifacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages