Skip to content

Record: CROWN-Q + GPTQ + Legal TTT — val_bpb 1.1174 (3-seed mean)#1129

Open
EthanYangTW wants to merge 2 commits intoopenai:mainfrom
EthanYangTW:submission/v38-sqrt-cooldown-3seed
Open

Record: CROWN-Q + GPTQ + Legal TTT — val_bpb 1.1174 (3-seed mean)#1129
EthanYangTW wants to merge 2 commits intoopenai:mainfrom
EthanYangTW:submission/v38-sqrt-cooldown-3seed

Conversation

@EthanYangTW
Copy link
Copy Markdown

@EthanYangTW EthanYangTW commented Mar 30, 2026

Summary

11L GQA + XSA-all + full Cholesky GPTQ + score-first AdamW TTT. Sqrt cooldown schedule holds LR higher during warmdown, improving post-quantization TTT quality.

val_bpb: 1.1174 (3-seed mean, std 0.0004)

Results

Seed TTT BPB Artifact
1337 1.1170 15,961,751 bytes
42 1.1176 15,850,151 bytes
7 1.1176 15,844,080 bytes
Mean 1.1174
Std 0.0004

Key Techniques

  • Full Cholesky GPTQ: Hessian-aware quantization with act-order, self-generated calibration (no val data).
  • Score-first AdamW TTT: Each token scored before any gradient update using it. Last 2 blocks unfrozen (4.7M/27M params), 3 epochs.
  • Sqrt cooldown: sqrt(x) schedule during warmdown instead of linear. Holds LR higher longer.
  • CROWN-Q: Curvature-weighted quantization variance penalty during warmdown.

Architecture

  • 11L, 512d, GQA 8/4, MLP 3x relu²
  • XSA on all 11 layers, BigramHash 2048
  • Partial RoPE 16/64, SmearGate + OrthoInit
  • EMA 0.997, SWA, Late QAT at 50% warmdown
  • 26.99M params, int6 + zstd-22

Timing

  • Training: 600s wallclock (8xH100 SXM), 89ms/step (FA3 Hopper), ~6650 steps
  • Eval: sliding window stride=32 (~150s) + TTT 3 epochs (~460s) ≈ 610s

Compliance

  • Training ≤ 600s wallclock
  • GPTQ calibration: self-generated tokens (no validation data used)
  • TTT: legal score-first (every token scored before any gradient update)
  • All artifacts < 16,000,000 bytes

@EthanYangTW EthanYangTW marked this pull request as ready for review March 30, 2026 09:49
Copilot AI review requested due to automatic review settings March 30, 2026 09:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Track “10min/16MB” record bundle (V38) that captures a 3-seed run and the corresponding end-to-end training/quantization/eval script implementing sqrt warmdown, CROWN-Q, GPTQ, and post-quant score-first TTT.

Changes:

  • Added V38 training script (train_gpt.py) implementing sqrt cooldown warmdown, CROWN-Q penalty, full Cholesky GPTQ quantization, and post-quant score-first TTT sliding-window eval.
  • Added per-seed training logs and consolidated submission metadata for the 3-seed result.
  • Added a README describing architecture, training, quantization/eval, and compliance claims for the record.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_gpt.py New V38 end-to-end script (train → average → quantize → eval/TTT) used to generate the record.
records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_seed1337.log Seed 1337 run log capturing training, quantization, and TTT eval outputs.
records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_seed42.log Seed 42 run log capturing training, quantization, and TTT eval outputs.
records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/train_seed7.log Seed 7 run log capturing training, quantization, and TTT eval outputs.
records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/submission.json Submission metadata (aggregate + per-seed metrics and sizes).
records/track_10min_16mb/2026-03-30_V38_SqrtCooldown_3seed/README.md Human-readable explanation/results/compliance notes for the record.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for mod in base_model.modules():
if isinstance(mod, CastedLinear) and mod.weight.ndim == 2:
w = mod.weight.float()
row_max = w.abs().amax(dim=1).detach()
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CROWN-Q penalty currently detaches row_max from mod.weight, which makes delta (and thus crownq_penalty) constant w.r.t. the model weights. As a result, adding args.crownq_lambda * crownq_penalty won’t produce gradients and the penalty will have no training effect. Remove the .detach() (or use a differentiable surrogate for the quantization step estimate) so the penalty actually shapes the weights.

Suggested change
row_max = w.abs().amax(dim=1).detach()
row_max = w.abs().amax(dim=1)

Copilot uses AI. Check for mistakes.
result[name + ".scale"] = s
meta[name] = {"type": "int6"}
elif cat in int6_cats and t.ndim >= 1:
q, s = quantize_int6_per_row(t)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed_quantize_int6_gptq accepts a clip_range argument, but the elif cat in int6_cats and t.ndim >= 1: branch calls quantize_int6_per_row(t) without passing clip_range. If QUANT_CLIP_RANGE is set to something other than 31 (e.g., int5), 1D tensors in int6 categories would be quantized with the wrong range.

Suggested change
q, s = quantize_int6_per_row(t)
q, s = quantize_int6_per_row(t, clip_range=clip_range)

Copilot uses AI. Check for mistakes.
Comment on lines +1256 to +1259
if cat in int6_cats and t.ndim >= 1:
q, s = quantize_int6_per_row(t)
result[name + ".q"] = q
result[name + ".scale"] = s
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed_quantize_int6 similarly calls quantize_int6_per_row(t) without threading through a configurable clip_range (unlike the GPTQ path). If this script is intended to support QUANT_CLIP_RANGE!=31, this path will silently quantize with the default range instead of the configured one.

Copilot uses AI. Check for mistakes.
result[name + ".q"] = q
result[name + ".scale"] = s
meta[name] = {"type": "int8"}
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(f"gptq_quantize: ...") is unconditional, so under torchrun it will be emitted by every rank (the provided logs show repeated lines). Consider guarding this behind rank==0 / master_process (or routing through log0) to avoid noisy logs and potential performance impact from multi-process stdout contention.

Suggested change
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
rank = dist.get_rank() if dist.is_available() and dist.is_initialized() else 0
if rank == 0:
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)

Copilot uses AI. Check for mistakes.
try:
L = torch.linalg.cholesky(H)
Hinv = torch.cholesky_inverse(L)
except torch._C._LinAlgError:
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching torch._C._LinAlgError relies on a private/internal exception type that can change across PyTorch versions. Prefer catching the public torch.linalg.LinAlgError (or a broader RuntimeError with a targeted message check) so GPTQ fallback remains robust across environments.

Suggested change
except torch._C._LinAlgError:
except torch.linalg.LinAlgError:

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +36
- Full Cholesky GPTQ with act-order (training data calibration only)
- int6 + zstd-22 compression
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README claims “training data calibration only” for GPTQ, but train_gpt.py’s gptq_calibrate_selfgen() explicitly calibrates on random/self-generated token sequences (not training shards). Please update the README wording to match the implementation (or change the code to actually calibrate on training data if that’s the intent).

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +44
- Training: 600s wallclock (hard cap)
- GPTQ calibration: training data only
- TTT: legal score-first (each token scored before any gradient update using it)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compliance section says “GPTQ calibration: training data only”, but the code calibrates via gptq_calibrate_selfgen() using random tokens. This should be corrected so the compliance statement is accurate and auditable.

Copilot uses AI. Check for mistakes.
demouo added a commit to demouo/parameter-golf that referenced this pull request Apr 1, 2026
- SQRT_WARMDOWN=1: use sqrt(remaining/warmdown) instead of linear
- Holds LR higher for longer, decays faster at the end
- From PR openai#1129 optimization techniques

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants