int5 GPTQ + 33.6M model: 1.1179 BPB (3-seed mean) by EthanYangTW · Pull Request #544 · openai/parameter-golf

EthanYangTW · 2026-03-23T15:51:19Z

Summary

33.6M parameter model quantized to int5 with GPTQ error compensation, fitting under 16MB. First submission to achieve int5 quantization on a 33.6M model within the artifact size limit.

Architecture: 11L, 512d, MHA 8/8, MLP 3.5x (1792), BigramHash 8192, XSA all layers
Quantization: int5 per-row GPTQ (clip_range=15) + Early QAT (threshold 0.5) + EMA 0.997
TTT: Legal score-first AdamW, chunk=131072, last 2 blocks unfrozen

Results

Seed	Sliding BPB	TTT BPB	Artifact
1337	1.1244	1.1170	15.53 MB
42	1.1249	1.1182	15.36 MB
7	1.1250	1.1184	15.28 MB
Mean	1.1248	1.1179

Logs

Seed 1337

model_params:33580124
XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:600s seed:1337
step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms
step:500/20000 train_loss:2.3701 train_time:48612ms step_avg:97.22ms
step:1000/20000 train_loss:2.2455 train_time:97399ms step_avg:97.40ms
step:1500/20000 train_loss:2.1909 train_time:146193ms step_avg:97.46ms
step:2000/20000 train_loss:2.0314 train_time:195030ms step_avg:97.52ms
step:2500/20000 train_loss:2.1357 train_time:243879ms step_avg:97.55ms
step:3000/20000 train_loss:2.1196 train_time:292698ms step_avg:97.57ms
step:3500/20000 train_loss:2.1280 train_time:341516ms step_avg:97.58ms
step:4000/20000 val_loss:2.0061 val_bpb:1.1881 train_time:390334ms step_avg:97.58ms
late_qat:enabled step:4399 scale:0.4999
swa:start step:5450
step:6142/20000 val_loss:1.9016 val_bpb:1.1262 train_time:600020ms step_avg:97.69ms
best_averaging:ema val_bpb:1.1253
pruning:2.0% magnitude pruning applied
gptq:calibrated 68 layers in 2.6s
Serialized model int6+zstd: 15450968 bytes
Total submission size int6+zstd: 15526951 bytes
final_int6_sliding_window val_loss:1.8985 val_bpb:1.1244 stride:32 eval_time:164429ms
final_int6_sliding_window_exact val_loss:1.89850579 val_bpb:1.12440579
ttt:start chunks=474 chunk_tokens=131072 stride=32 lr=0.0001 epochs=3 opt=adamw freeze_first=2
  ttt_chunk [1/474] bpb=1.205191 time=1.2s
  ttt_chunk [101/474] bpb=1.125280 time=112.0s
  ttt_chunk [201/474] bpb=1.125951 time=222.8s
  ttt_chunk [301/474] bpb=1.121763 time=333.6s
  ttt_chunk [401/474] bpb=1.118378 time=444.4s
  ttt_chunk [474/474] bpb=1.117889 time=524.3s
final_int6_ttt val_loss:1.8860 val_bpb:1.1170 stride:32 eval_time:525132ms
final_int6_ttt_exact val_loss:1.88598432 val_bpb:1.11698985

Seed 42

model_params:33580124
seed:42
step:4000/20000 val_loss:2.0087 val_bpb:1.1897
late_qat:enabled step:4393 scale:0.4999
swa:start step:5450
step:6138/20000 val_loss:1.9040 val_bpb:1.1276 train_time:600007ms step_avg:97.75ms
best_averaging:ema val_bpb:1.1267
pruning:2.0% magnitude pruning applied
gptq:calibrated 68 layers in 2.5s
Serialized model int6+zstd: 15284494 bytes
Total submission size int6+zstd: 15360477 bytes
final_int6_sliding_window_exact val_loss:1.89928270 val_bpb:1.12486593
ttt:start chunks=474 chunk_tokens=131072 stride=32 lr=0.0001 epochs=3 opt=adamw freeze_first=2
  ttt_chunk [1/474] bpb=1.199439 time=1.2s
  ttt_chunk [101/474] bpb=1.125688 time=112.3s
  ttt_chunk [474/474] bpb=1.119019 time=525.9s
ttt:done val_loss=1.888037 val_bpb=1.118206 elapsed=525.9s
final_int6_ttt_exact val_loss:1.88803710 val_bpb:1.11820563

Seed 7

model_params:33580124
seed:7
best_averaging:ema val_bpb:1.1264
Serialized model int6+zstd: 15206146 bytes
Total submission size int6+zstd: 15282129 bytes
final_int6_sliding_window_exact val_loss:1.89958170 val_bpb:1.12504301
ttt:start chunks=474 chunk_tokens=131072 stride=32 lr=0.0001 epochs=2 opt=adamw freeze_first=2
ttt:done val_loss=1.888400 val_bpb=1.118421 elapsed=503.5s
final_int6_ttt_exact val_loss:1.88840031 val_bpb:1.11842074

Reproduction

pip install --break-system-packages zstandard
pip install --break-system-packages flash-attn --no-build-isolation
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80

SEED=1337 PRUNE_PCT=0.02 TTT_EPOCHS=3 TTT_LR=0.0001 \
TTT_OPTIMIZER=adamw TTT_FREEZE_BLOCKS=2 TTT_CHUNK_TOKENS=131072 \
EVAL_STRIDE=32 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

33.6M params (MHA 8/8, BigramHash 8192, MLP 3.5x) quantized to int5 with GPTQ error compensation. Artifact fits under 16MB (15.3-15.5MB). Seeds: 1337 (1.1170), 42 (1.1182), 7 (1.1184)

Seed 1337: de843ef6 (TTT 1.1170) Seed 42: b6560b60 (TTT 1.1182) Seed 7: c1c18644 (TTT 1.1184)

Copilot

Pull request overview

Updates train_gpt.py to implement and export a new 33.6M-parameter GPT variant targeting int5-style (clip_range=15) GPTQ-assisted quantization under the 16MB artifact limit, with added sliding-window evaluation and score-first TTT evaluation.

Changes:

Expands the model architecture (11 layers, BigramHash embedding, optional XSA, optional value embeddings, RoPE tweaks, smear gating, optional DTG/ln scaling).
Adds new evaluation modes (separate eval sequence length, sliding-window BPB eval, and “legal score-first” TTT eval).
Replaces the prior int8 export path with a mixed int8/int5-like (named “int6” in code) quantization pipeline including GPTQ calibration, optional pruning, and zstd/zlib compression.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T15:58:39Z

train_gpt.py

-Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.
-"""
-
+"""V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x)."""


train_gpt_mlx.py’s module docstring states a hard stop that both train_gpt.py and train_gpt_mlx.py should not exceed 1500 lines for newcomer readability, but train_gpt.py is now 1580 lines. Please move record-specific / experimental code (e.g., GPTQ/TTT helpers) into /records or otherwise reduce the core script length to stay within that stated limit.

Copilot · 2026-03-23T15:58:39Z

train_gpt.py

+            rd = self.rope_dims
+            if seq_len > self.train_seq_len:
+                scale = seq_len / self.train_seq_len
+                new_base = self.base * (scale ** (rd / (rd - 2)))
+                inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd))
+            else:
+                inv_freq = self.inv_freq.to(device)
+            t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype)


Rotary.forward() computes new_base = self.base * (scale ** (rd / (rd - 2))), which will divide by zero when rope_dims is 2 (and behaves poorly when rope_dims <= 2). Add validation that rope_dims is either 0 (meaning full head_dim) or an even value >= 4 to avoid runtime errors.

Copilot · 2026-03-23T15:58:40Z

train_gpt.py

+        if rope_dims > 0:
+            head_dim = model_dim // num_heads
+            for block in self.blocks:
+                block.attn.rope_dims = rope_dims
+                block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims)
+        self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else []


When rope_dims > 0, it’s applied without checking it’s even and <= head_dim. If rope_dims is odd or larger than head_dim, Rotary will build cos/sin tables with incompatible last-dimension sizes and apply_rotary_emb() can error at runtime. Consider validating 0 < rope_dims <= head_dim and rope_dims % 2 == 0 up front (e.g., in GPT.__init__).

Copilot · 2026-03-23T15:58:40Z

train_gpt.py

        self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
-        self.rotary = Rotary(self.head_dim, base=rope_base)
-
-    def forward(self, x: Tensor) -> Tensor:
+        self.rope_dims = 0
+        self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024)
+        self.use_xsa = False


CausalSelfAttention hard-codes train_seq_len=1024 when constructing Rotary, but the default Hyperparameters.train_seq_len is now 2048. This means RoPE will always take the “seq_len > train_seq_len” scaling branch during training/eval at 2048, which is easy to do unintentionally. If the scaling should be tied to the actual training context length, thread args.train_seq_len (or a dedicated hyperparameter) into Rotary construction.

Copilot · 2026-03-23T15:58:40Z

train_gpt.py

+    def bigram_hash(self, tokens: Tensor) -> Tensor:
+        t = tokens.to(torch.int32)
+        mod = self.bigram_vocab_size - 1
+        out = torch.empty_like(t)
+        out[..., 0] = mod
+        out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod
+        return out.long()


BigramHashEmbedding.bigram_hash() sets mod = bigram_vocab_size - 1 and then does % mod, which will raise a division-by-zero error if BIGRAM_VOCAB_SIZE is 1 (and produces negative indices if it’s 0 but still instantiated). Add input validation in BigramHashEmbedding.__init__ (or where it’s constructed) to require bigram_vocab_size >= 2 when enabled.

Copilot · 2026-03-23T15:58:40Z

train_gpt.py

+        if CastedLinear._qat_enabled and self.training and w.ndim == 2:
+            with torch.no_grad():
+                w32 = self.weight.float()
+                row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)


The QAT path in CastedLinear.forward() computes torch.quantile(..., dim=1) for every 2D weight on every forward pass when enabled. This is extremely expensive (quantile involves sorting / heavy reductions) and is likely to blow the 10-minute training budget once late-QAT turns on. Consider precomputing per-row clip values periodically, using a cheaper statistic (e.g., amax / EMA), or restricting QAT to a small subset of layers.

Suggested change

row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)

# Use a cheaper per-row amax-based EMA for clipping instead of per-forward quantile.

if not hasattr(self, "_qat_row_clip_ema"):

# Initialize EMA with current per-row maximum absolute values.

self._qat_row_clip_ema = w32.abs().amax(dim=1)

else:

current = w32.abs().amax(dim=1)

ema_decay = 0.9

self._qat_row_clip_ema.mul_(ema_decay).add_(current, alpha=1.0 - ema_decay)

row_clip = self._qat_row_clip_ema

Copilot · 2026-03-23T15:58:41Z

train_gpt.py

-
+        with torch.no_grad():
+            for name, t in base_model.state_dict().items():
+                ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)


ema_state is initialized and updated as GPU float32 copies of the full state_dict() each step. This can significantly increase GPU memory usage (potentially causing OOM) and adds per-step overhead. Consider keeping EMA on CPU (and updating from .detach().cpu()), or limiting EMA to a subset of parameters.

Suggested change

ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)

# Keep EMA state on CPU to avoid maintaining a full float32 copy on GPU.

ema_t = ema_state[name]

if ema_t.device != torch.device("cpu"):

ema_t = ema_t.detach().cpu()

ema_state[name] = ema_t

t_cpu = t.detach().float().cpu()

ema_t.mul_(ema_decay).add_(t_cpu, alpha=1.0 - ema_decay)

Copilot · 2026-03-23T15:58:41Z

train_gpt.py

+        if not t.is_floating_point() or t.numel() <= 65536:
+            result[name] = t.to(torch.float16) if t.is_floating_point() else t
+            meta[name] = "passthrough"
+            continue
+        if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS):
+            result[name] = t.float()
+            meta[name] = "passthrough_ctrl"
+            continue


In mixed_quantize_int6_gptq, the early branch if not t.is_floating_point() or t.numel() <= 65536 runs before the control-tensor check, so small control tensors (matching CONTROL_TENSOR_NAME_PATTERNS) will be downcast to fp16 and never get the intended passthrough_ctrl handling. If control tensors are meant to stay fp32 for stability/accuracy, check CONTROL_TENSOR_NAME_PATTERNS before the numel() <= 65536 passthrough/downcast logic.

Suggested change

if not t.is_floating_point() or t.numel() <= 65536:

result[name] = t.to(torch.float16) if t.is_floating_point() else t

meta[name] = "passthrough"

continue

if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS):

result[name] = t.float()

meta[name] = "passthrough_ctrl"

continue

# Non-floating tensors are always passed through as-is.

if not t.is_floating_point():

result[name] = t

meta[name] = "passthrough"

continue

# Control tensors are kept in float32 for stability/accuracy, regardless of size.

if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS):

result[name] = t.float()

meta[name] = "passthrough_ctrl"

continue

# Small floating-point tensors are passed through in float16.

if t.numel() <= 65536:

result[name] = t.to(torch.float16)

meta[name] = "passthrough"

continue

Copilot · 2026-03-23T15:58:41Z

train_gpt.py

+def quantize_int6_per_row(t: Tensor, clip_range: int = 15) -> tuple[Tensor, Tensor]:
+    t32 = t.float()
+    if t32.ndim == 2:
+        best_q, best_s, best_err = None, None, float('inf')
+        for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]:
+            if pct < 1.0:
+                row_clip = torch.quantile(t32.abs(), pct, dim=1)
+            else:
+                row_clip = t32.abs().amax(dim=1)
+            s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16)
+            q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8)
+            recon = q.float() * s.float()[:, None]
+            err = (t32 - recon).pow(2).mean().item()
+            if err < best_err:
+                best_q, best_s, best_err = q, s, err
+        return best_q, best_s
+    amax = t32.abs().max().item()
+    scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16)
+    q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8)
+    return q, scale


Several helpers and artifacts are named int6 (e.g., quantize_int6_per_row, final_model.int6.ptz), but the PR description/docstring calls this “int5 GPTQ”. Since clip_range=15 yields 31 signed levels (an int5-like scheme), please rename the functions/files/metadata to match the actual quantization format to avoid confusion for readers and future tooling.

int5 GPTQ + 33.6M model: 3-seed mean ~1.1179 BPB

20414e0

33.6M params (MHA 8/8, BigramHash 8192, MLP 3.5x) quantized to int5 with GPTQ error compensation. Artifact fits under 16MB (15.3-15.5MB). Seeds: 1337 (1.1170), 42 (1.1182), 7 (1.1184)

Copilot AI review requested due to automatic review settings March 23, 2026 15:51

Copilot started reviewing on behalf of EthanYangTW March 23, 2026 15:52 View session

Add 3-seed training logs

3df3c5e

Seed 1337: de843ef6 (TTT 1.1170) Seed 42: b6560b60 (TTT 1.1182) Seed 7: c1c18644 (TTT 1.1184)

EthanYangTW closed this Mar 23, 2026

Copilot AI reviewed Mar 23, 2026

View reviewed changes

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

raahilshah mentioned this pull request Mar 26, 2026

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int5 GPTQ + 33.6M model: 1.1179 BPB (3-seed mean)#544

int5 GPTQ + 33.6M model: 1.1179 BPB (3-seed mean)#544
EthanYangTW wants to merge 2 commits intoopenai:mainfrom
EthanYangTW:submission/int5-gptq-33m

EthanYangTW commented Mar 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
+                # Use a cheaper per-row amax-based EMA for clipping instead of per-forward quantile.
+                if not hasattr(self, "_qat_row_clip_ema"):
+                    # Initialize EMA with current per-row maximum absolute values.
+                    self._qat_row_clip_ema = w32.abs().amax(dim=1)
+                else:
+                    current = w32.abs().amax(dim=1)
+                    ema_decay = 0.9
+                    self._qat_row_clip_ema.mul_(ema_decay).add_(current, alpha=1.0 - ema_decay)
+                row_clip = self._qat_row_clip_ema

-                ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)
+                # Keep EMA state on CPU to avoid maintaining a full float32 copy on GPU.
+                ema_t = ema_state[name]
+                if ema_t.device != torch.device("cpu"):
+                    ema_t = ema_t.detach().cpu()
+                    ema_state[name] = ema_t
+                t_cpu = t.detach().float().cpu()
+                ema_t.mul_(ema_decay).add_(t_cpu, alpha=1.0 - ema_decay)

Conversation

EthanYangTW commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Logs

Seed 1337

Seed 42

Seed 7

Reproduction

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EthanYangTW commented Mar 23, 2026 •

edited

Loading