This repository contains the code to replicate the experiments described in the paper "On LLMs’ Internal Representation of Code Correctness" submitted to ICSE'26.
You may want to setup your environment with (see Makefile):
make setupThe script run_eval.sh accepts a configuration file specifying evaluation parameters. Example config files are available in the eval_configs directory.
Usage:
./run_eval.sh <some_config.json>Alternatively, you can also run a single evaluation directly using the Python script:
Usage:
# Intrinsic (Standard) evaluation on BigCodeBench
python evaluate.py --code \
--std \
-m <model_name> \
--test <path/to/test.jsonl> \
--output-dir <output_dir>
# Reflective Regular (Heuristic) evaluation (with user/assistant token tags)
python evaluate.py --code \
--heu \ # use --tf-heu for Reflective True/False
-m <model_name> \
--test <path/to/test.jsonl> \
--concept correctness \
--user-start "<|user|>" \
--assistant-start "<|assistant|>" \
--output-dir <output_dir>
# LAT (representation engineering) evaluation
python evaluate.py --code \
--lat \
-m <model_name> \
--test <path/to/test.jsonl> \
--fit <path/to/fit.jsonl> \
--validate <path/to/validate.jsonl> \
--stimulus synthetic-code \
--concept correctness \
--user-start "<|user|>" \
--assistant-start "<|assistant|>" \
--output-dir <output_dir>The following helper scripts can be used to aggregate and format results:
-
mrr.sh: Compute Mean Reciprocal Rank (MRR) from RQ2's outputs# LAT usage: ./mrr.sh <results_dir> # Non‐LAT metrics only: ./mrr.sh --non-lat <non_lat_results_parent_dir>
-
x_fold_calc.py: Aggregate cross‐validation (x‐fold) LAT metrics across folds# Aggregate outer‐fold LAT metrics: python x_fold_calc.py # Include inner‐fold aggregation (for MBPP+ and Synthetic stimuli): python x_fold_calc.py --inner-folds
-
selection_table.py: Generate a Markdown table from pass@1 (baseline), pass@k (k=1..5), to pass@10 (ceiling) for Humaneval in RQ2:python selection_table.py
-
selection_table_bigcodebench.py: Generate a Markdown table from pass@1 (baseline), pass@k (k=1..5), to pass@10 (ceiling) for BigCodeBench in RQ2:python selection_table_bigcodebench.py
-
correctness_table.py: Compile correctness accuracies (intrinsic, reflective, LAT) into a Markdown table for RQ1:python correctness_table.py <results_dir>
-
compress.sh: Helper script to heavily compress results. Simply for convenience. Uses parallelization:./compress.sh <directory_to_compress> <output_base_name>
Note: Generally, all data is already prepared. Still, if you want to know how to adapt to your own data or simply want to find out how we organized it and/or how we processed it; please keep reading.
Under data/ we organize prepared datasets and processing scripts. The contents in data/ are to be used as is:
-
bigcodebench/
-
BigCodeBench.jsonl
The full BigCodeBench dataset (tasks, prompts, reference solutions). -
tested-*.json
Model‐specific evaluated generations (e.g.tested-gpt-4o.json,tested-gemini-2.0.json,tested-llama-3.3-70b.json).GPT-4o, Gemini-2.0, and Llama-3.3-70B are 3 of the top performers according to the leaderboard for which there were available datasets (as of 24/02/2025).
Note:
tested-*andeval-tested-*obtained from BigCodeBench's evaluator. Submit the respective*.jsonlto obtain the test results. -
scripts/
prep_data.py
Creates train/validate/test splits and assembles “plausible wrong” implementations.- x_fold/
Cross‐validation version:prep_data_x_fold.pyproduces k‐foldfit.jsonl/validate.jsonl/test.jsonl.
-
-
instruct-humaneval/
instructional‐HumanEval.jsonl
The full HumanEval dataset with docstrings in instructional-format.scripts/generate_raw/
Shell scripts (gen_raw.sh,eval_gen.sh) to run bigcode‐evaluation‐harness and collect raw generations. (seeMakefile'ssetup-harness-envanddata)prep_data.pyandx_fold/prep_data.py
Identify common failing tasks across models and produce fit/validate/test splits in JSONL.
raw/(after runninggen_raw.sh)
Contains obtained raw model outputs per task. Manually moved here after runningbigcode-evaluation-harnessdetailed_results/(after runningeval_gen.sh)
Evaluation logs and detailed per‐sample results.x_fold/
Each outer-fold's prepared data
-
mbpp/
prep_data.py
Converts MBPP+ into triplets of (prompt, correct/incorrect implementations) and splits into fit/validate/test.x_fold/
Each inner-fold's prepared data.
-
synthetic/
synthetic.jsonl
Full synthetic data: 20 tripletsx_fold/
Each inner-fold's prepared data.scripts/prep_data.pyScript that prepared the data already present inx_fold/