Please delete this PR by JF10R · Pull Request #6 · openai/parameter-golf

JF10R · 2026-03-18T18:26:40Z

Opened by mistake by an automated agent. Please delete this PR and its branch. Apologies for the noise.

Training-only improvements on the 9x512 baseline: - Quantization-aware training (STE) during last 30% of wallclock recovers post-quant BPB degradation - Gradient clipping at 1.0 stabilizes training - No architectural changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new 10-minute/16MB record submission that aims to improve post-quantization BPB by enabling quantization-aware training (STE) late in training, and stabilizes training by enabling gradient clipping by default.

Changes:

Introduces QAT (STE-based) for CastedLinear weights and tied embedding projection, activated after a configurable fraction of wallclock time.
Changes default GRAD_CLIP_NORM from disabled (0.0) to 1.0.
Adds record metadata (submission.json) and an explanatory README for the submission.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/train_gpt.py	Adds QAT activation logic + STE quantization in forward pass; changes grad clipping default.
records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/submission.json	Adds submission metadata entry for the new record.
records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/README.md	Documents the training-only changes and new hyperparameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/README.md

+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `GRAD_CLIP_NORM` | 1.0 | Max gradient norm (0 = disabled) |
+| `QAT_START_FRAC` | 0.70 | Fraction of wallclock after which QAT activates |


records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/train_gpt.py

+    with torch.no_grad():
+        row_max = w.abs().amax(dim=-1, keepdim=True).clamp(min=1e-12)
+        scale = row_max / 127.0
+        w_q = (w / scale).round().clamp(-128, 127) * scale


records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/train_gpt.py

+    with torch.no_grad():
+        row_max = w.abs().amax(dim=-1, keepdim=True).clamp(min=1e-12)
+        scale = row_max / 127.0
+        w_q = (w / scale).round().clamp(-128, 127) * scale
+    return w + (w_q - w).detach()


records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/submission.json

+  "val_loss": 0.0,
+  "val_bpb": 0.0,
+  "bytes_total": 0,
+  "bytes_code": 0


records/track_10min_16mb/2026-03-18_QAT_GradClip_Conservative/README.md

+
+The baseline loses ~0.0072 BPB when post-training quantization converts fp32 weights to int8. QAT uses a straight-through estimator to simulate int8 per-row quantization during training: the forward pass sees quantized weights, but gradients flow through as if quantization didn't happen. This teaches the model to place its weights in regions that survive int8 rounding, recovering most of the quantization gap.
+
+QAT is applied to:


PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant openai#5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix openai#1: ternary roundtrip eval on ALL ranks with dist.broadcast (was: only rank 0 loaded weights → invalid eval results) - Fix openai#2: pass pre-computed scales to export (avoids double-quantization) - Fix openai#3: keep scales as float32 (was: lossy float16 cast) - Fix openai#4: import returns float32 (was: lossy bfloat16 cast) - Fix openai#5: lower z_loss from 1e-4 to 1e-5 (prevents loss explosion) - Fix openai#6: add dist.broadcast after int8 roundtrip load too - Fix openai#7: add weights_only=False to suppress FutureWarning Ternary roundtrip is now LOSSLESS (max error = 0.0). The previous val_bpb=0.9650 was an artifact of bug openai#1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 18, 2026 18:26

Copilot started reviewing on behalf of JF10R March 18, 2026 18:27 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

JF10R closed this Mar 18, 2026

JF10R changed the title ~~feat: QAT + GradClip conservative submission~~ Please delete this PR Mar 18, 2026

JF10R deleted the worktree-agent-a5760f54 branch March 18, 2026 18:34

gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026

docs: add PR #6 summary placeholder

fd91b70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please delete this PR#6

Please delete this PR#6
JF10R wants to merge 1 commit intoopenai:mainfrom
JF10R:worktree-agent-a5760f54

JF10R commented Mar 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		The baseline loses ~0.0072 BPB when post-training quantization converts fp32 weights to int8. QAT uses a straight-through estimator to simulate int8 per-row quantization during training: the forward pass sees quantized weights, but gradients flow through as if quantization didn't happen. This teaches the model to place its weights in regions that survive int8 rounding, recovering most of the quantization gap.

		QAT is applied to:

Conversation

JF10R commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JF10R commented Mar 18, 2026 •

edited

Loading