Skip to content

joshuaisaact/forks-locks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Forks & Locks

Empirically testing whether per-token entropy-adaptive decoding can match SSD (Simple Self-Distillation) gains on code generation — and investigating what happens when you replicate SSD with different training data quality.

Key Findings

  1. Decode-time ceiling confirmed. Entropy-adaptive decoding (temperature, min-p, prob-gap blending) hits a structural ceiling: gains on hard problems come at the cost of easy problems. Net effect: -6 correct samples vs baseline. This matches the SSD paper's theoretical prediction (Section B.5, Figure 4).

  2. Public vs private test gap. 93% of solutions that pass public tests fail private tests. SSD trained on public-test-filtered data shows +4.6% on public tests but -1.2% on private tests — the model learns to satisfy weak test cases rather than genuinely improving.

  3. Private-test SSD works. Retraining with private-test-validated data (163 examples from hybrid filtering) recovers genuine gains: +2.98% pass@1 on official evaluation.

  4. Token-level calibration. ~75% of tokens have entropy < 0.23 nats. The lock/fork ratio for a well-calibrated code model is at least 3:1.

See FINDINGS.md for the full write-up with tables and analysis.

Models Tested

  • Qwen2.5-Coder-7B-Instruct (primary)
  • Qwen2.5-Coder-1.5B-Instruct (scaling experiments)
  • Qwen3-4B-Instruct-2507 (initial calibration)

Hardware

  • NVIDIA RTX 4090 (24GB VRAM), WSL2

Benchmark

  • LiveCodeBench v6 (175 problems, Jan–Mar 2025)
  • pass@1 and pass@5 (unbiased estimator)

Setup

conda create -n adaptive-decode python=3.11 -y
conda activate adaptive-decode
pip install torch transformers accelerate datasets
pip install vllm  # for fast batch inference

Or use the full setup script:

bash setup.sh

Running

# 1. Calibrate entropy distributions
python calibrate.py

# 2. Evaluate
python evaluate.py --mode baseline --samples 10
python evaluate.py --mode adaptive --samples 10
python evaluate.py --mode sweep --samples 10

# 3. Analyse
python analyse.py

See run_experiment.sh for the full pipeline.

About

Entropy-adaptive decoding vs SSD on code generation — empirically confirming the decode-time ceiling and validating self-distillation with private-test-filtered training data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors