Linke: https://ieeexplore.ieee.org/document/10764959
This repository for our ASE 2024 paper "How Effective Do Code Language Models Understand Poor-Readability Code?" includes benchmark suite, results, methods for acquiring and preparing materials, and source code of our automatic scoring tool. We hope this artifact can motivate and help future research on code summarization.
- Script to construct perturbed datasets from source data.
- Automatic inference scripts. Models include:
CodeBERT,CodeT5,Codellama. Programming languages include:Go,Java,Python. Data types include:source dataandperturbation generated data. - Script for automatic scoring, scoring targets include:
CodeBERT,CodeT5,CodeLlama,GPT-4o's inference results. Evaluation indicators include:BLEUScore,BERTScore,P-value
Experiments are conducted using Python 3.9.7 on a Ubuntu 22.01.1 server.
To install all required packages, navigate to the root directory of this project and run:
git clone https://github.com/ythere-y/PoorCodeSumEval.git
cd PoorCodeSumEval
pip install -r requirements.txtGet CodeXGlue dataset from https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text
Get TL-CodeSum from https://github.com/xing-hu/TL-CodeSum
Get DeppCom from https://github.com/xing-hu/EMSE-DeepCom
process_data/RobustCodeSum.py process Python Code.
process_data/RobustCodeSumGo.py process Go Code.
process_data/RobustCodeSumJava.py process Java Code.
Use Python & CodeXGlue as an example to construct the IOE perturbation dataset.
DATASET_PATH = "path_to_code_x/code_x_glue_ct_code_to_text"in main function
if __name__ == "__main__":
robust = PythonRobustCodeSum()
robust.gen_IOE()python process_data/RobustCodeSum.pyThe result dataset will saved into local_data/single/semantic/IOE/python/CSN
Inference with CodeBERT or CodeT5 or CodeLlama-7b, in tasks
Use Go & CodeLlama & FNE as an example to conduct inference.
Edit the main function like this to set the language and the dataset type.
if __name__ == "__main__":
lang_name = "go"
limit = 2000
single_dataset_gen(
partition_name="single",
type_name="semantic",
mode_name="FNE",
task_name="work",
lang_name=lang_name,
limit=limit,
)python tasks/single_llama_task.pyThe result includes the inference results of CodeLlama-7b on the Go dataset with FNE perturbation and the reference summaries.
The result will be saved into path ref_and_gen/codellama-7b/single/semantic/FNE/go/work_gen_[0-2000].json and ref_and_gen/codellama-7b/single/semantic/FNE/go/work_ref_[0-2000].json
In scores, BLEUScore, BERTScore and P-value scores are calculated.
if __name__ == "__main__":
reset_summary("CodeLlama-7b-hf")
model_name = "CodeLlama-7b-hf"
task_name = "work"
start_point = 0
limit = 2000
for lang_name in ["python", "go", "java"]:
print(f"start scoring model : {model_name}, lang : {lang_name}")
AllBLEUScore(model_name, lang_name, task_name, start_point, limit)
AllBERTScore(model_name, lang_name, task_name, start_point, limit)
t1 = time.time()python scores/bleu_BERTScore.py
Description: This script will read the inference results from the default path of the model and calculate the BLEU and BERT scores.
The result details will be saved into scores/CodeLlama-7b-hf, and the summary of the scores will be saved into scores/CodeLlama-7b-hf/summary.json.
def analysis_and_log():
model_name = "CodeLlama-7b-hf"
task_name = "work"
score_name = "BERTScore"
start_point = 0
limit = 2000
for lang_name in ["python", "go", "java"]:
ALLSignificant(model_name, lang_name, task_name, start_point, score_name, limit)python scores/significant.pyDescription: This script will read the BERTScore results from the default path of the model and calculate the P-value.
The result details will be printout directly.
Some explanations of common questions and experiments on P-value can be found in appendix.pdf.