Empirically testing whether per-token entropy-adaptive decoding can match SSD (Simple Self-Distillation) gains on code generation — and investigating what happens when you replicate SSD with different training data quality.
-
Decode-time ceiling confirmed. Entropy-adaptive decoding (temperature, min-p, prob-gap blending) hits a structural ceiling: gains on hard problems come at the cost of easy problems. Net effect: -6 correct samples vs baseline. This matches the SSD paper's theoretical prediction (Section B.5, Figure 4).
-
Public vs private test gap. 93% of solutions that pass public tests fail private tests. SSD trained on public-test-filtered data shows +4.6% on public tests but -1.2% on private tests — the model learns to satisfy weak test cases rather than genuinely improving.
-
Private-test SSD works. Retraining with private-test-validated data (163 examples from hybrid filtering) recovers genuine gains: +2.98% pass@1 on official evaluation.
-
Token-level calibration. ~75% of tokens have entropy < 0.23 nats. The lock/fork ratio for a well-calibrated code model is at least 3:1.
See FINDINGS.md for the full write-up with tables and analysis.
- Qwen2.5-Coder-7B-Instruct (primary)
- Qwen2.5-Coder-1.5B-Instruct (scaling experiments)
- Qwen3-4B-Instruct-2507 (initial calibration)
- NVIDIA RTX 4090 (24GB VRAM), WSL2
- LiveCodeBench v6 (175 problems, Jan–Mar 2025)
- pass@1 and pass@5 (unbiased estimator)
conda create -n adaptive-decode python=3.11 -y
conda activate adaptive-decode
pip install torch transformers accelerate datasets
pip install vllm # for fast batch inferenceOr use the full setup script:
bash setup.sh# 1. Calibrate entropy distributions
python calibrate.py
# 2. Evaluate
python evaluate.py --mode baseline --samples 10
python evaluate.py --mode adaptive --samples 10
python evaluate.py --mode sweep --samples 10
# 3. Analyse
python analyse.pySee run_experiment.sh for the full pipeline.