Forks & Locks

Empirically testing whether per-token entropy-adaptive decoding can match SSD (Simple Self-Distillation) gains on code generation — and investigating what happens when you replicate SSD with different training data quality.

Key Findings

Decode-time ceiling confirmed. Entropy-adaptive decoding (temperature, min-p, prob-gap blending) hits a structural ceiling: gains on hard problems come at the cost of easy problems. Net effect: -6 correct samples vs baseline. This matches the SSD paper's theoretical prediction (Section B.5, Figure 4).
Public vs private test gap. 93% of solutions that pass public tests fail private tests. SSD trained on public-test-filtered data shows +4.6% on public tests but -1.2% on private tests — the model learns to satisfy weak test cases rather than genuinely improving.
Private-test SSD works. Retraining with private-test-validated data (163 examples from hybrid filtering) recovers genuine gains: +2.98% pass@1 on official evaluation.
Token-level calibration. ~75% of tokens have entropy < 0.23 nats. The lock/fork ratio for a well-calibrated code model is at least 3:1.

See FINDINGS.md for the full write-up with tables and analysis.

Models Tested

Qwen2.5-Coder-7B-Instruct (primary)
Qwen2.5-Coder-1.5B-Instruct (scaling experiments)
Qwen3-4B-Instruct-2507 (initial calibration)

Hardware

NVIDIA RTX 4090 (24GB VRAM), WSL2

Benchmark

LiveCodeBench v6 (175 problems, Jan–Mar 2025)
pass@1 and pass@5 (unbiased estimator)

Setup

conda create -n adaptive-decode python=3.11 -y
conda activate adaptive-decode
pip install torch transformers accelerate datasets
pip install vllm  # for fast batch inference

Or use the full setup script:

bash setup.sh

Running

# 1. Calibrate entropy distributions
python calibrate.py

# 2. Evaluate
python evaluate.py --mode baseline --samples 10
python evaluate.py --mode adaptive --samples 10
python evaluate.py --mode sweep --samples 10

# 3. Analyse
python analyse.py

See run_experiment.sh for the full pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results		results
.gitignore		.gitignore
FINDINGS.md		FINDINGS.md
README.md		README.md
analyse.py		analyse.py
calibrate.py		calibrate.py
consensus.py		consensus.py
consensus_lite.py		consensus_lite.py
decode.py		decode.py
eval_vllm.py		eval_vllm.py
evaluate.py		evaluate.py
fork_topk_processor.py		fork_topk_processor.py
generate_ssd_data.py		generate_ssd_data.py
generate_ssd_data_private.py		generate_ssd_data_private.py
prompts.py		prompts.py
run_experiment.sh		run_experiment.sh
sanity_check.py		sanity_check.py
setup.sh		setup.sh
test_consensus.py		test_consensus.py
test_consensus_lite.py		test_consensus_lite.py
test_fork_topk.py		test_fork_topk.py
train_ssd.py		train_ssd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forks & Locks

Key Findings

Models Tested

Hardware

Benchmark

Setup

Running

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Forks & Locks

Key Findings

Models Tested

Hardware

Benchmark

Setup

Running

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages