Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179) by EthanYangTW · Pull Request #585 · openai/parameter-golf

EthanYangTW · 2026-03-23T23:07:57Z

Summary

3-seed mean val_bpb = 1.1179 (std 0.0008)
Breakthrough: int5 quantization ([-15,15], 31 levels) with GPTQ error compensation enables 33.6M params in 16MB
All 3 artifacts under 16MB (15.53, 15.36, 15.28 MB)
Legal score-first TTT (every token scored before any gradient update)

Key Innovation

int5 GPTQ: fewer unique values (31 vs 63) = lower entropy = better zstd compression (~0.46 bytes/param vs int6's ~0.58). GPTQ Hessian-aware error compensation makes the quality loss minimal (0.001 BPB quant tax).

This unlocks a 33.6M param model (MHA 8/8, BigramHash 8192, MLP 3.5x) that was previously impossible to fit under 16MB with int6.

Results

Seed	Sliding BPB	TTT BPB	Artifact
1337	1.1244	1.1170	15.53 MB
42	1.1249	1.1182	15.36 MB
7	1.1250	1.1184	15.28 MB
Mean		1.1179

Test plan

Verify artifact sizes are all under 16,000,000 bytes
Verify 3-seed statistical significance (p < 0.01)
Verify training completes within 600s on 8xH100 SXM
Verify eval completes within 600s
Reproduce from records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/

33.6M params (MHA 8/8, BigramHash 8192, MLP 3.5x) quantized to int5 with GPTQ error compensation. Artifact fits under 16MB (15.3-15.5MB). Seeds: 1337 (1.1170), 42 (1.1182), 7 (1.1184)

Seed 1337: de843ef6 (TTT 1.1170) Seed 42: b6560b60 (TTT 1.1182) Seed 7: c1c18644 (TTT 1.1184)

seed1337.log - TTT 1.1170 seed42.log - TTT 1.1182 seed7.log - TTT 1.1184

Add proper /records submission with submission.json, README, train_gpt.py, and 3-seed logs.

Copilot

Pull request overview

This PR introduces a new record submission (“int5 GPTQ + 33.6M + legal score-first TTT”) and updates the root training script to match that record configuration.

Changes:

Replace/supersede the root train_gpt.py with a much more complex record-style script (FlashAttention v3 integration, sliding eval + legal TTT, EMA/SWA, GPTQ calibration + mixed quant export, etc.).
Add a new record folder under records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/ with the exact script snapshot, README, submission metadata, and 3-seed logs.
Add additional seed logs under the repository-level logs/ directory.

Reviewed changes

Copilot reviewed 4 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
train_gpt.py	Root script rewritten to the record configuration, adding GPTQ + TTT + sliding eval + many new model/optimizer features.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py	Record snapshot of the training script used to generate the submission artifacts.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md	Record writeup describing the innovations, architecture, quantization pipeline, and reproduction steps.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/submission.json	Submission metadata for the leaderboard entry (scores, sizes, method summary).
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed1337.log	Seed 1337 run log.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed42.log	Seed 42 run log.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed7.log	Seed 7 run log.
logs/seed1337.log	Extra copy of the seed 1337 log at repo root.
logs/seed42.log	Extra copy of the seed 42 log at repo root.
logs/seed7.log	Extra copy of the seed 7 log at repo root.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T23:14:17Z

train_gpt.py

+            result[name + ".q"] = q
+            result[name + ".scale"] = s
+            meta[name] = {"type": "int8"}
+    print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)


mixed_quantize_int6_gptq() prints its summary unconditionally. In distributed runs this produces duplicated lines (once per rank), as visible in the provided logs. Please guard this print behind rank == 0 / master_process (pass rank into the function or log at the call site) to keep logs deterministic and easier to parse.

Suggested change

print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)

distributed = dist.is_available() and dist.is_initialized()

rank = dist.get_rank() if distributed else 0

if rank == 0:

print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)

Copilot · 2026-03-23T23:14:18Z

train_gpt.py

    if master_process:
-        with open("final_model.int8.ptz", "wb") as f:
+        with open("final_model.int6.ptz", "wb") as f:
            f.write(quant_blob)
-        quant_file_bytes = os.path.getsize("final_model.int8.ptz")
+        quant_file_bytes = len(quant_blob)
        code_bytes = len(code.encode("utf-8"))
-        ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1)
-        log0(
-            f"Serialized model int8+zlib: {quant_file_bytes} bytes "
-            f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)"
-        )
-        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")
-
+        log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
+        log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")


The artifact filename/logging uses final_model.int6.ptz and messages like int6+{_COMPRESSOR}, but the PR description/record metadata claim an int5 quantization scheme ([-15,15], 31 levels). This mismatch is confusing for reviewers and for any downstream tooling that expects naming to reflect the actual quantization format. Please align naming (filenames, log strings, helper function names) with the true quantization level being produced.

Copilot · 2026-03-23T23:14:18Z

records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py

+    quant_blob = zstandard.ZstdCompressor(level=22).compress(quant_raw) if _COMPRESSOR == "zstd" else zlib.compress(quant_raw, 9)
+    if master_process:
+        with open("final_model.int6.ptz", "wb") as f:
+            f.write(quant_blob)
+        quant_file_bytes = len(quant_blob)
+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
+        log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")


This record script writes final_model.int6.ptz and logs int6+{_COMPRESSOR}, but the submission/README claim an int5 scheme ([-15,15], 31 levels). Please align filenames/log strings and helper names with the actual quantization format to avoid confusion during verification.

Copilot · 2026-03-23T23:14:18Z

train_gpt.py

+        q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None]
+        if _HAS_FA3:
+            y = flash_attn_3_func(q, k, v, causal=True).contiguous()
+        else:
+            y = F.scaled_dot_product_attention(
+                q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2),
+                attn_mask=None, is_causal=True,
+                enable_gqa=(self.num_kv_heads != self.num_heads),
+            ).transpose(1, 2)


When FlashAttention is available, the code calls flash_attn_3_func(q, k, v, ...) even if num_kv_heads != num_heads. Unlike the SDPA fallback (which enables GQA), FlashAttention may not accept mismatched head counts and can error if someone overrides NUM_KV_HEADS. Consider gating the FlashAttention path to num_kv_heads == num_heads (or using the FlashAttention API that explicitly supports GQA/MQA) and otherwise fall back to scaled_dot_product_attention(..., enable_gqa=True).

Copilot · 2026-03-23T23:14:18Z

train_gpt.py

+        if CastedLinear._qat_enabled and self.training and w.ndim == 2:
+            with torch.no_grad():
+                w32 = self.weight.float()
+                row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
+                scale = (row_clip / 15.0).clamp_min(1.0 / 15.0)
+                w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -16, 15) * scale[:, None]).to(x.dtype)
+            w = w + (w_q - w).detach()


The QAT “int5” STE path uses scale = row_clip / 15.0 but clamps to [-16, 15], which is 32 levels and doesn’t match the stated [-15, 15] (31 levels) quantization used elsewhere (clip_range=15). This inconsistency makes it hard to reason about what quantization the model is being trained for. Please make the clamp range consistent with the intended scheme (and update the README/submission metadata accordingly).

Copilot · 2026-03-23T23:14:18Z

train_gpt.py

+            for name, t in base_model.state_dict().items():
+                ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)


The per-step EMA update iterates for name, t in base_model.state_dict().items() on every training step. Building a full state_dict each step can be a significant CPU overhead and allocates many temporary objects. Consider tracking EMA over named_parameters()/buffers directly (or caching the parameter list once) to avoid repeated state_dict() construction in the hot path.

Suggested change

for name, t in base_model.state_dict().items():

ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)

# Lazily cache parameter/buffer references to avoid rebuilding state_dict() every step

if not hasattr(base_model, "_ema_source_tensors"):

src_tensors: dict[str, Tensor] = {}

for n, p in base_model.named_parameters():

src_tensors[n] = p

for n, b in base_model.named_buffers():

src_tensors[n] = b

base_model._ema_source_tensors = src_tensors

for name, ema_t in ema_state.items():

src_t = base_model._ema_source_tensors.get(name)

if src_t is None:

continue

ema_t.mul_(ema_decay).add_(src_t.detach().float(), alpha=1.0 - ema_decay)

Copilot · 2026-03-23T23:14:19Z

train_gpt.py

+    seq_len = eval_seq_len or args.train_seq_len
+    total_tokens = val_tokens.numel() - 1
+    window_starts = [ws for ws in range(0, total_tokens, stride)
+                     if min(ws + seq_len, total_tokens) - ws >= 1]


eval_val_sliding() includes a final partial window whenever it has at least 1 token. When total_tokens is not an exact multiple of stride, that last window will typically re-score tokens that were already scored as the “last stride” of the previous window (double counting the tail). Consider using the same inclusion condition as the TTT evaluator (wlen >= stride or ws == 0) or otherwise adjusting the scoring slice for the final window to ensure each token is counted exactly once.

Suggested change

if min(ws + seq_len, total_tokens) - ws >= 1]

if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0]

Copilot · 2026-03-23T23:14:19Z

train_gpt.py

-Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.
-"""
-
+"""V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x)."""


The root train_gpt.py no longer follows the repo’s stated goal of being a newcomer-friendly baseline (and it drops the existing guidance comment). The repository README explicitly says the best/SOTA configs should live under /records (README.md:204), and train_gpt_mlx.py documents a 1500-line hard stop. Please restore the baseline-oriented header and move record-specific/experimental code paths into the /records/.../train_gpt.py snapshot instead of upstreaming them into the root script.

Suggested change

"""V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x)."""

"""Minimal GPT training script (baseline, V23 configuration).

This file is intended to remain a reasonably small, newcomer-friendly reference

implementation of GPT training: a single script you can read top-to-bottom to

understand how data loading, the model, and the training loop fit together.

For the latest record-setting / SOTA configurations and more experimental

variants, please see the snapshots under `records/` (as described in the

repository README). Those scripts may be longer and more specialized.

Current baseline config: V23 — int5 GPTQ + 33.6M model

(MHA 8/8, BigramHash 8192, MLP 3.5x).

"""

Copilot · 2026-03-23T23:14:19Z

train_gpt.py

+            rd = self.rope_dims
+            if seq_len > self.train_seq_len:
+                scale = seq_len / self.train_seq_len
+                new_base = self.base * (scale ** (rd / (rd - 2)))
+                inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd))


Rotary.forward() computes scale ** (rd / (rd - 2)) when seq_len > train_seq_len. If ROPE_DIMS is set to 2 via env (or any value <= 2), this will divide by zero (or become numerically unstable). Please validate rope_dims at initialization (e.g., require even and > 2 when extrapolation is enabled) or guard the extrapolation formula for small rd.

Copilot · 2026-03-23T23:14:19Z

train_gpt.py

+        # --- Phase 2: TRAIN on this chunk (already scored = legal) ---
+        is_last_chunk = (ci == num_chunks - 1)
+        if not is_last_chunk and ttt_epochs > 0:
+            chunk_start = ci * ttt_chunk_tokens
+            chunk_end = min((ci + 1) * ttt_chunk_tokens, total_tokens)
+            chunk_seqs = (chunk_end - chunk_start) // seq_len
+            if chunk_seqs > 0:
+                # Cosine LR across chunks
+                cos_lr = ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1)))
+                for pg in optimizer.param_groups:
+                    pg["lr"] = cos_lr
+                my_seq_s = (chunk_seqs * rank) // world_size
+                my_seq_e = (chunk_seqs * (rank + 1)) // world_size
+                my_chunk_seqs = my_seq_e - my_seq_s
+                for _ep in range(ttt_epochs):
+                    for bs in range(0, my_chunk_seqs, batch_seqs):
+                        be = min(bs + batch_seqs, my_chunk_seqs)
+                        actual_bs = my_seq_s + bs
+                        start_tok = chunk_start + actual_bs * seq_len
+                        end_tok = chunk_start + (my_seq_s + be) * seq_len + 1
+                        if end_tok > val_tokens.numel():
+                            continue
+                        local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64)
+                        x = local[:-1].reshape(-1, seq_len)
+                        y = local[1:].reshape(-1, seq_len)
+                        optimizer.zero_grad(set_to_none=True)
+                        with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                            ttt_loss = base_model(x, y)
+                        ttt_loss.backward()


In eval_val_sliding_ttt(), the model is left in eval() mode during the adaptation/training phase (there’s no base_model.train() before computing ttt_loss/backward()). That can silently disable any training-mode behaviors (and also disables the QAT STE path which checks self.training). Please switch to train() for Phase 2 (and then back to eval() for scoring) so the adaptation step is actually run under training semantics.

EthanYangTW · 2026-03-23T23:51:17Z

closed bc eval time over 600 second

EthanYangTW added 4 commits March 23, 2026 23:50

int5 GPTQ + 33.6M model: 3-seed mean ~1.1179 BPB

20414e0

33.6M params (MHA 8/8, BigramHash 8192, MLP 3.5x) quantized to int5 with GPTQ error compensation. Artifact fits under 16MB (15.3-15.5MB). Seeds: 1337 (1.1170), 42 (1.1182), 7 (1.1184)

Add 3-seed training logs

3df3c5e

Seed 1337: de843ef6 (TTT 1.1170) Seed 42: b6560b60 (TTT 1.1182) Seed 7: c1c18644 (TTT 1.1184)

Fix log files: use actual training output, not source code

bdc6da9

seed1337.log - TTT 1.1170 seed42.log - TTT 1.1182 seed7.log - TTT 1.1184

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)

e159597

Add proper /records submission with submission.json, README, train_gpt.py, and 3-seed logs.

Copilot AI review requested due to automatic review settings March 23, 2026 23:07

Copilot started reviewing on behalf of EthanYangTW March 23, 2026 23:08 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

EthanYangTW closed this Mar 23, 2026

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

raahilshah mentioned this pull request Mar 26, 2026

Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)#585

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)#585
EthanYangTW wants to merge 4 commits intoopenai:mainfrom
EthanYangTW:submission/int5-gptq-33m

EthanYangTW commented Mar 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

EthanYangTW commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
+    distributed = dist.is_available() and dist.is_initialized()
+    rank = dist.get_rank() if distributed else 0
+    if rank == 0:
+        print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)

		for name, t in base_model.state_dict().items():
		ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)

-            for name, t in base_model.state_dict().items():
-                ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)
+            # Lazily cache parameter/buffer references to avoid rebuilding state_dict() every step
+            if not hasattr(base_model, "_ema_source_tensors"):
+                src_tensors: dict[str, Tensor] = {}
+                for n, p in base_model.named_parameters():
+                    src_tensors[n] = p
+                for n, b in base_model.named_buffers():
+                    src_tensors[n] = b
+                base_model._ema_source_tensors = src_tensors
+            for name, ema_t in ema_state.items():
+                src_t = base_model._ema_source_tensors.get(name)
+                if src_t is None:
+                    continue
+                ema_t.mul_(ema_decay).add_(src_t.detach().float(), alpha=1.0 - ema_decay)

	if min(ws + seq_len, total_tokens) - ws >= 1]
	if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0]

-"""V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x)."""
+"""Minimal GPT training script (baseline, V23 configuration).
+This file is intended to remain a reasonably small, newcomer-friendly reference
+implementation of GPT training: a single script you can read top-to-bottom to
+understand how data loading, the model, and the training loop fit together.
+For the latest record-setting / SOTA configurations and more experimental
+variants, please see the snapshots under `records/` (as described in the
+repository README). Those scripts may be longer and more specialized.
+Current baseline config: V23 — int5 GPTQ + 33.6M model
+(MHA 8/8, BigramHash 8192, MLP 3.5x).
+"""

Conversation

EthanYangTW commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Innovation

Results

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

EthanYangTW commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EthanYangTW commented Mar 23, 2026 •

edited

Loading