Skip to content

q4x3/poly-humaneval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Poly-HumanEval

Artifacts for APSEC 2024 Submission: "Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?"

Benchmark

We construct the PolyHumanEval benchmark, a variant of OpenAI HumanEval that supports 14 programming languages: C++, C# , Dart, Go, Java, JavaScript, Kotlin, PHP, Python, Ruby, Rust, Scala, Swift and TypeScript.

TestDSL

The PolyHumanEval benchmark is described in TestDSL, which is a domain-specific language for testing.

For example, the has_close_element problem in HumanEval:

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    ...
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
...

is described by the following TestDSL code:

code {
    func has_close_elements(numbers: list<double>, threshold: double) -> bool
}
tests {
    template nse {
        ([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) -> true
        ([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) -> false
        ...
    }
}

The whole benchmark data is shown in benchmark/poly_humaneval.testdsl, which describes function signature and testcases for all 164 programming problems in HumanEval. Our handcraft solutions (also used for source code of translation) are shown in benchmark/poly_humaneval_sol.json.

Test Generation & Evaluation Tool

We have developed a tool to parse the TestDSL data, generate test programs in all 14 programming languages, and execute the test programs to get the results. Its artifacts are placed in the evaluation folder. An example of usage is shown in evaluation/code/example.py.

For easier reproduction, we use Nix to manage the dependencies in evaluation/shell.nix.

Following these steps to reproducible the runtime environment:

> git clone https://github.com/AnonymousUser257/poly-humaneval
> cd evaluation
> nix-shell --pure

If the environment is well-settled, you can evaluate all the gold solutions by running the python script:

> python code/check_gold_solution_parallel.py

which should not print any failed message.

Then, you can evaluate translation results generated by LLMs:

> mkdir ./output
> python code/check_generated_parallel --input data/RQ1-2_CodeLlama-13B.json --output output/evaluate_result.json
> python code/calculate_ca.py --input output/evaluate_result.json --output output/ca_result.json

Other Artifacts

All the LLM-generated results in our experiments are placed in llm_generated_translations.

The artifacts for self-training are placed in self_training_data.

About

Artifacts for APSEC 2024 Submission: "Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors