Record: CROWN-Q + GPTQ + Legal TTT — val_bpb 1.1174 (3-seed mean)#1129
Record: CROWN-Q + GPTQ + Legal TTT — val_bpb 1.1174 (3-seed mean)#1129EthanYangTW wants to merge 2 commits intoopenai:mainfrom
Conversation
…174 (3-seed mean)
There was a problem hiding this comment.
Pull request overview
Adds a new Track “10min/16MB” record bundle (V38) that captures a 3-seed run and the corresponding end-to-end training/quantization/eval script implementing sqrt warmdown, CROWN-Q, GPTQ, and post-quant score-first TTT.
Changes:
- Added V38 training script (
train_gpt.py) implementing sqrt cooldown warmdown, CROWN-Q penalty, full Cholesky GPTQ quantization, and post-quant score-first TTT sliding-window eval. - Added per-seed training logs and consolidated submission metadata for the 3-seed result.
- Added a README describing architecture, training, quantization/eval, and compliance claims for the record.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_gpt.py | New V38 end-to-end script (train → average → quantize → eval/TTT) used to generate the record. |
| records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_seed1337.log | Seed 1337 run log capturing training, quantization, and TTT eval outputs. |
| records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_seed42.log | Seed 42 run log capturing training, quantization, and TTT eval outputs. |
| records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_seed7.log | Seed 7 run log capturing training, quantization, and TTT eval outputs. |
| records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/submission.json | Submission metadata (aggregate + per-seed metrics and sizes). |
| records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/README.md | Human-readable explanation/results/compliance notes for the record. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for mod in base_model.modules(): | ||
| if isinstance(mod, CastedLinear) and mod.weight.ndim == 2: | ||
| w = mod.weight.float() | ||
| row_max = w.abs().amax(dim=1).detach() |
There was a problem hiding this comment.
CROWN-Q penalty currently detaches row_max from mod.weight, which makes delta (and thus crownq_penalty) constant w.r.t. the model weights. As a result, adding args.crownq_lambda * crownq_penalty won’t produce gradients and the penalty will have no training effect. Remove the .detach() (or use a differentiable surrogate for the quantization step estimate) so the penalty actually shapes the weights.
| row_max = w.abs().amax(dim=1).detach() | |
| row_max = w.abs().amax(dim=1) |
| result[name + ".scale"] = s | ||
| meta[name] = {"type": "int6"} | ||
| elif cat in int6_cats and t.ndim >= 1: | ||
| q, s = quantize_int6_per_row(t) |
There was a problem hiding this comment.
mixed_quantize_int6_gptq accepts a clip_range argument, but the elif cat in int6_cats and t.ndim >= 1: branch calls quantize_int6_per_row(t) without passing clip_range. If QUANT_CLIP_RANGE is set to something other than 31 (e.g., int5), 1D tensors in int6 categories would be quantized with the wrong range.
| q, s = quantize_int6_per_row(t) | |
| q, s = quantize_int6_per_row(t, clip_range=clip_range) |
| if cat in int6_cats and t.ndim >= 1: | ||
| q, s = quantize_int6_per_row(t) | ||
| result[name + ".q"] = q | ||
| result[name + ".scale"] = s |
There was a problem hiding this comment.
mixed_quantize_int6 similarly calls quantize_int6_per_row(t) without threading through a configurable clip_range (unlike the GPTQ path). If this script is intended to support QUANT_CLIP_RANGE!=31, this path will silently quantize with the default range instead of the configured one.
| result[name + ".q"] = q | ||
| result[name + ".scale"] = s | ||
| meta[name] = {"type": "int8"} | ||
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) |
There was a problem hiding this comment.
print(f"gptq_quantize: ...") is unconditional, so under torchrun it will be emitted by every rank (the provided logs show repeated lines). Consider guarding this behind rank==0 / master_process (or routing through log0) to avoid noisy logs and potential performance impact from multi-process stdout contention.
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) | |
| rank = dist.get_rank() if dist.is_available() and dist.is_initialized() else 0 | |
| if rank == 0: | |
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) |
| try: | ||
| L = torch.linalg.cholesky(H) | ||
| Hinv = torch.cholesky_inverse(L) | ||
| except torch._C._LinAlgError: |
There was a problem hiding this comment.
Catching torch._C._LinAlgError relies on a private/internal exception type that can change across PyTorch versions. Prefer catching the public torch.linalg.LinAlgError (or a broader RuntimeError with a targeted message check) so GPTQ fallback remains robust across environments.
| except torch._C._LinAlgError: | |
| except torch.linalg.LinAlgError: |
| - Full Cholesky GPTQ with act-order (training data calibration only) | ||
| - int6 + zstd-22 compression |
There was a problem hiding this comment.
README claims “training data calibration only” for GPTQ, but train_gpt.py’s gptq_calibrate_selfgen() explicitly calibrates on random/self-generated token sequences (not training shards). Please update the README wording to match the implementation (or change the code to actually calibrate on training data if that’s the intent).
| - Training: 600s wallclock (hard cap) | ||
| - GPTQ calibration: training data only | ||
| - TTT: legal score-first (each token scored before any gradient update using it) |
There was a problem hiding this comment.
Compliance section says “GPTQ calibration: training data only”, but the code calibrates via gptq_calibrate_selfgen() using random tokens. This should be corrected so the compliance statement is accurate and auditable.
- SQRT_WARMDOWN=1: use sqrt(remaining/warmdown) instead of linear - Holds LR higher for longer, decays faster at the end - From PR openai#1129 optimization techniques Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
11L GQA + XSA-all + full Cholesky GPTQ + score-first AdamW TTT. Sqrt cooldown schedule holds LR higher during warmdown, improving post-quantization TTT quality.
val_bpb: 1.1174 (3-seed mean, std 0.0004)
Results
Key Techniques
sqrt(x)schedule during warmdown instead of linear. Holds LR higher longer.Architecture
Timing
Compliance