Skip to content

sanadlab/code-repe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

426 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Representation Engineering for Source Code

This repository contains the code to replicate the experiments described in the paper "On LLMs’ Internal Representation of Code Correctness" submitted to ICSE'26.

Quickstart

You may want to setup your environment with (see Makefile):

make setup

Running Evaluations

The script run_eval.sh accepts a configuration file specifying evaluation parameters. Example config files are available in the eval_configs directory.

Usage:

./run_eval.sh <some_config.json>

Individual Evaluation with evaluate.py

Alternatively, you can also run a single evaluation directly using the Python script:

Usage:

# Intrinsic (Standard) evaluation on BigCodeBench
python evaluate.py --code \
  --std \
  -m <model_name> \
  --test <path/to/test.jsonl> \
  --output-dir <output_dir>

# Reflective Regular (Heuristic) evaluation (with user/assistant token tags)
python evaluate.py --code \
  --heu \ # use --tf-heu for Reflective True/False
  -m <model_name> \
  --test <path/to/test.jsonl> \
  --concept correctness \
  --user-start "<|user|>" \
  --assistant-start "<|assistant|>" \
  --output-dir <output_dir>

# LAT (representation engineering) evaluation
python evaluate.py --code \
  --lat \
  -m <model_name> \
  --test <path/to/test.jsonl> \
  --fit <path/to/fit.jsonl> \
  --validate <path/to/validate.jsonl> \
  --stimulus synthetic-code \
  --concept correctness \
  --user-start "<|user|>" \
  --assistant-start "<|assistant|>" \
  --output-dir <output_dir>

Analyzing the Data

The following helper scripts can be used to aggregate and format results:

  • mrr.sh: Compute Mean Reciprocal Rank (MRR) from RQ2's outputs

    # LAT usage:
    ./mrr.sh <results_dir>
    # Non‐LAT metrics only:
    ./mrr.sh --non-lat <non_lat_results_parent_dir>
  • x_fold_calc.py: Aggregate cross‐validation (x‐fold) LAT metrics across folds

    # Aggregate outer‐fold LAT metrics:
    python x_fold_calc.py
    # Include inner‐fold aggregation (for MBPP+ and Synthetic stimuli):
    python x_fold_calc.py --inner-folds
  • selection_table.py: Generate a Markdown table from pass@1 (baseline), pass@k (k=1..5), to pass@10 (ceiling) for Humaneval in RQ2:

    python selection_table.py
  • selection_table_bigcodebench.py: Generate a Markdown table from pass@1 (baseline), pass@k (k=1..5), to pass@10 (ceiling) for BigCodeBench in RQ2:

    python selection_table_bigcodebench.py
  • correctness_table.py: Compile correctness accuracies (intrinsic, reflective, LAT) into a Markdown table for RQ1:

    python correctness_table.py <results_dir>
  • compress.sh: Helper script to heavily compress results. Simply for convenience. Uses parallelization:

    ./compress.sh <directory_to_compress> <output_base_name>

Data Directory Structure

Note: Generally, all data is already prepared. Still, if you want to know how to adapt to your own data or simply want to find out how we organized it and/or how we processed it; please keep reading.

Under data/ we organize prepared datasets and processing scripts. The contents in data/ are to be used as is:

  • bigcodebench/

    • BigCodeBench.jsonl
      The full BigCodeBench dataset (tasks, prompts, reference solutions).

    • tested-*.json
      Model‐specific evaluated generations (e.g. tested-gpt-4o.json, tested-gemini-2.0.json, tested-llama-3.3-70b.json).

      GPT-4o, Gemini-2.0, and Llama-3.3-70B are 3 of the top performers according to the leaderboard for which there were available datasets (as of 24/02/2025).

      Note: tested-* and eval-tested-* obtained from BigCodeBench's evaluator. Submit the respective *.jsonl to obtain the test results.

    • scripts/

      • prep_data.py
        Creates train/validate/test splits and assembles “plausible wrong” implementations.
      • x_fold/
        Cross‐validation version: prep_data_x_fold.py produces k‐fold fit.jsonl / validate.jsonl / test.jsonl.
  • instruct-humaneval/

    • instructional‐HumanEval.jsonl
      The full HumanEval dataset with docstrings in instructional-format.
    • scripts/
      • generate_raw/
        Shell scripts (gen_raw.sh, eval_gen.sh) to run bigcode‐evaluation‐harness and collect raw generations. (see Makefile's setup-harness-env and data)
      • prep_data.py and x_fold/prep_data.py
        Identify common failing tasks across models and produce fit/validate/test splits in JSONL.
    • raw/ (after running gen_raw.sh)
      Contains obtained raw model outputs per task. Manually moved here after running bigcode-evaluation-harness
    • detailed_results/ (after running eval_gen.sh)
      Evaluation logs and detailed per‐sample results.
    • x_fold/
      Each outer-fold's prepared data
  • mbpp/

    • prep_data.py
      Converts MBPP+ into triplets of (prompt, correct/incorrect implementations) and splits into fit/validate/test.
    • x_fold/
      Each inner-fold's prepared data.
  • synthetic/

    • synthetic.jsonl
      Full synthetic data: 20 triplets
    • x_fold/
      Each inner-fold's prepared data.
    • scripts/prep_data.py Script that prepared the data already present in x_fold/

About

Representation Engineering for Source Code

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors