Representation Engineering for Source Code

This repository contains the code to replicate the experiments described in the paper "On LLMs’ Internal Representation of Code Correctness" submitted to ICSE'26.

Quickstart

You may want to setup your environment with (see Makefile):

make setup

Running Evaluations

The script run_eval.sh accepts a configuration file specifying evaluation parameters. Example config files are available in the eval_configs directory.

Usage:

./run_eval.sh <some_config.json>

Individual Evaluation with `evaluate.py`

Alternatively, you can also run a single evaluation directly using the Python script:

Usage:

# Intrinsic (Standard) evaluation on BigCodeBench
python evaluate.py --code \
  --std \
  -m <model_name> \
  --test <path/to/test.jsonl> \
  --output-dir <output_dir>

# Reflective Regular (Heuristic) evaluation (with user/assistant token tags)
python evaluate.py --code \
  --heu \ # use --tf-heu for Reflective True/False
  -m <model_name> \
  --test <path/to/test.jsonl> \
  --concept correctness \
  --user-start "<|user|>" \
  --assistant-start "<|assistant|>" \
  --output-dir <output_dir>

# LAT (representation engineering) evaluation
python evaluate.py --code \
  --lat \
  -m <model_name> \
  --test <path/to/test.jsonl> \
  --fit <path/to/fit.jsonl> \
  --validate <path/to/validate.jsonl> \
  --stimulus synthetic-code \
  --concept correctness \
  --user-start "<|user|>" \
  --assistant-start "<|assistant|>" \
  --output-dir <output_dir>

Analyzing the Data

The following helper scripts can be used to aggregate and format results:

mrr.sh: Compute Mean Reciprocal Rank (MRR) from RQ2's outputs

# LAT usage:
./mrr.sh <results_dir>
# Non‐LAT metrics only:
./mrr.sh --non-lat <non_lat_results_parent_dir>

x_fold_calc.py: Aggregate cross‐validation (x‐fold) LAT metrics across folds

# Aggregate outer‐fold LAT metrics:
python x_fold_calc.py
# Include inner‐fold aggregation (for MBPP+ and Synthetic stimuli):
python x_fold_calc.py --inner-folds

selection_table.py: Generate a Markdown table from pass@1 (baseline), pass@k (k=1..5), to pass@10 (ceiling) for Humaneval in RQ2:
```
python selection_table.py
```
selection_table_bigcodebench.py: Generate a Markdown table from pass@1 (baseline), pass@k (k=1..5), to pass@10 (ceiling) for BigCodeBench in RQ2:
```
python selection_table_bigcodebench.py
```
correctness_table.py: Compile correctness accuracies (intrinsic, reflective, LAT) into a Markdown table for RQ1:
```
python correctness_table.py <results_dir>
```
compress.sh: Helper script to heavily compress results. Simply for convenience. Uses parallelization:
```
./compress.sh <directory_to_compress> <output_base_name>
```

Data Directory Structure

Note: Generally, all data is already prepared. Still, if you want to know how to adapt to your own data or simply want to find out how we organized it and/or how we processed it; please keep reading.

Under data/ we organize prepared datasets and processing scripts. The contents in data/ are to be used as is:

bigcodebench/
- BigCodeBench.jsonl
  The full BigCodeBench dataset (tasks, prompts, reference solutions).
- tested-*.json
  Model‐specific evaluated generations (e.g. tested-gpt-4o.json, tested-gemini-2.0.json, tested-llama-3.3-70b.json).
  
  GPT-4o, Gemini-2.0, and Llama-3.3-70B are 3 of the top performers according to the leaderboard for which there were available datasets (as of 24/02/2025).
  
  Note: tested-* and eval-tested-* obtained from BigCodeBench's evaluator. Submit the respective *.jsonl to obtain the test results.
- scripts/
  - prep_data.py
    Creates train/validate/test splits and assembles “plausible wrong” implementations.
  - x_fold/
    Cross‐validation version: prep_data_x_fold.py produces k‐fold fit.jsonl / validate.jsonl / test.jsonl.
instruct-humaneval/
- instructional‐HumanEval.jsonl
  The full HumanEval dataset with docstrings in instructional-format.
- scripts/
  - generate_raw/
    Shell scripts (gen_raw.sh, eval_gen.sh) to run bigcode‐evaluation‐harness and collect raw generations. (see Makefile's setup-harness-env and data)
  - prep_data.py and x_fold/prep_data.py
    Identify common failing tasks across models and produce fit/validate/test splits in JSONL.
- raw/ (after running gen_raw.sh)
  Contains obtained raw model outputs per task. Manually moved here after running bigcode-evaluation-harness
- detailed_results/ (after running eval_gen.sh)
  Evaluation logs and detailed per‐sample results.
- x_fold/
  Each outer-fold's prepared data
mbpp/
- prep_data.py
  Converts MBPP+ into triplets of (prompt, correct/incorrect implementations) and splits into fit/validate/test.
- x_fold/
  Each inner-fold's prepared data.
synthetic/
- synthetic.jsonl
  Full synthetic data: 20 triplets
- x_fold/
  Each inner-fold's prepared data.
- scripts/prep_data.py Script that prepared the data already present in x_fold/

Name		Name	Last commit message	Last commit date
Latest commit History 426 Commits
comparison_with_rankef		comparison_with_rankef
data		data
eval_configs		eval_configs
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
all_eval.sh		all_eval.sh
all_eval_array.sh		all_eval_array.sh
analyze_results.py		analyze_results.py
apply_bh_correction.py		apply_bh_correction.py
apply_bh_correction_rq2.py		apply_bh_correction_rq2.py
apply_bh_correction_rq2_with_rankef.py		apply_bh_correction_rq2_with_rankef.py
bigcodebench.md		bigcodebench.md
bigcodebench_selection_accuracy.md		bigcodebench_selection_accuracy.md
check_improvement_vs_baseline.py		check_improvement_vs_baseline.py
compress.sh		compress.sh
convert_rankef_to_rq2_format.py		convert_rankef_to_rq2_format.py
correctness_sanity_check.md		correctness_sanity_check.md
correctness_table.py		correctness_table.py
count_tasks_with_passing_sample.py		count_tasks_with_passing_sample.py
evaluate.py		evaluate.py
evaluators.py		evaluators.py
heuristic.py		heuristic.py
humaneval.md		humaneval.md
humaneval_selection_accuracy.md		humaneval_selection_accuracy.md
loaders.py		loaders.py
model.py		model.py
mrr.py		mrr.py
mrr.sh		mrr.sh
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
runtime.sh		runtime.sh
selection_table.py		selection_table.py
selection_table_bigcodebench.py		selection_table_bigcodebench.py
significance.py		significance.py
significance.tar.xz		significance.tar.xz
significance_bh.tar.xz		significance_bh.tar.xz
significance_rq2.py		significance_rq2.py
significance_rq2.tar.xz		significance_rq2.tar.xz
significance_rq2_with_rankef.py		significance_rq2_with_rankef.py
significance_table.py		significance_table.py
slurm.sh		slurm.sh
slurm_array.sh		slurm_array.sh
x_fold_calc.py		x_fold_calc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Representation Engineering for Source Code

Quickstart

Running Evaluations

Individual Evaluation with `evaluate.py`

Analyzing the Data

Data Directory Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Representation Engineering for Source Code

Quickstart

Running Evaluations

Individual Evaluation with evaluate.py

Analyzing the Data

Data Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Individual Evaluation with `evaluate.py`

Packages