Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)#585
Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)#585EthanYangTW wants to merge 4 commits intoopenai:mainfrom
Conversation
33.6M params (MHA 8/8, BigramHash 8192, MLP 3.5x) quantized to int5 with GPTQ error compensation. Artifact fits under 16MB (15.3-15.5MB). Seeds: 1337 (1.1170), 42 (1.1182), 7 (1.1184)
Seed 1337: de843ef6 (TTT 1.1170) Seed 42: b6560b60 (TTT 1.1182) Seed 7: c1c18644 (TTT 1.1184)
seed1337.log - TTT 1.1170 seed42.log - TTT 1.1182 seed7.log - TTT 1.1184
Add proper /records submission with submission.json, README, train_gpt.py, and 3-seed logs.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new record submission (“int5 GPTQ + 33.6M + legal score-first TTT”) and updates the root training script to match that record configuration.
Changes:
- Replace/supersede the root
train_gpt.pywith a much more complex record-style script (FlashAttention v3 integration, sliding eval + legal TTT, EMA/SWA, GPTQ calibration + mixed quant export, etc.). - Add a new record folder under
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/with the exact script snapshot, README, submission metadata, and 3-seed logs. - Add additional seed logs under the repository-level
logs/directory.
Reviewed changes
Copilot reviewed 4 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| train_gpt.py | Root script rewritten to the record configuration, adding GPTQ + TTT + sliding eval + many new model/optimizer features. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py | Record snapshot of the training script used to generate the submission artifacts. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md | Record writeup describing the innovations, architecture, quantization pipeline, and reproduction steps. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/submission.json | Submission metadata for the leaderboard entry (scores, sizes, method summary). |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed1337.log | Seed 1337 run log. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed42.log | Seed 42 run log. |
| records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed7.log | Seed 7 run log. |
| logs/seed1337.log | Extra copy of the seed 1337 log at repo root. |
| logs/seed42.log | Extra copy of the seed 42 log at repo root. |
| logs/seed7.log | Extra copy of the seed 7 log at repo root. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| result[name + ".q"] = q | ||
| result[name + ".scale"] = s | ||
| meta[name] = {"type": "int8"} | ||
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) |
There was a problem hiding this comment.
mixed_quantize_int6_gptq() prints its summary unconditionally. In distributed runs this produces duplicated lines (once per rank), as visible in the provided logs. Please guard this print behind rank == 0 / master_process (pass rank into the function or log at the call site) to keep logs deterministic and easier to parse.
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) | |
| distributed = dist.is_available() and dist.is_initialized() | |
| rank = dist.get_rank() if distributed else 0 | |
| if rank == 0: | |
| print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True) |
| if master_process: | ||
| with open("final_model.int8.ptz", "wb") as f: | ||
| with open("final_model.int6.ptz", "wb") as f: | ||
| f.write(quant_blob) | ||
| quant_file_bytes = os.path.getsize("final_model.int8.ptz") | ||
| quant_file_bytes = len(quant_blob) | ||
| code_bytes = len(code.encode("utf-8")) | ||
| ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1) | ||
| log0( | ||
| f"Serialized model int8+zlib: {quant_file_bytes} bytes " | ||
| f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)" | ||
| ) | ||
| log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes") | ||
|
|
||
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | ||
| log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes") |
There was a problem hiding this comment.
The artifact filename/logging uses final_model.int6.ptz and messages like int6+{_COMPRESSOR}, but the PR description/record metadata claim an int5 quantization scheme ([-15,15], 31 levels). This mismatch is confusing for reviewers and for any downstream tooling that expects naming to reflect the actual quantization format. Please align naming (filenames, log strings, helper function names) with the true quantization level being produced.
| quant_blob = zstandard.ZstdCompressor(level=22).compress(quant_raw) if _COMPRESSOR == "zstd" else zlib.compress(quant_raw, 9) | ||
| if master_process: | ||
| with open("final_model.int6.ptz", "wb") as f: | ||
| f.write(quant_blob) | ||
| quant_file_bytes = len(quant_blob) | ||
| code_bytes = len(code.encode("utf-8")) | ||
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | ||
| log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes") |
There was a problem hiding this comment.
This record script writes final_model.int6.ptz and logs int6+{_COMPRESSOR}, but the submission/README claim an int5 scheme ([-15,15], 31 levels). Please align filenames/log strings and helper names with the actual quantization format to avoid confusion during verification.
| q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] | ||
| if _HAS_FA3: | ||
| y = flash_attn_3_func(q, k, v, causal=True).contiguous() | ||
| else: | ||
| y = F.scaled_dot_product_attention( | ||
| q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2), | ||
| attn_mask=None, is_causal=True, | ||
| enable_gqa=(self.num_kv_heads != self.num_heads), | ||
| ).transpose(1, 2) |
There was a problem hiding this comment.
When FlashAttention is available, the code calls flash_attn_3_func(q, k, v, ...) even if num_kv_heads != num_heads. Unlike the SDPA fallback (which enables GQA), FlashAttention may not accept mismatched head counts and can error if someone overrides NUM_KV_HEADS. Consider gating the FlashAttention path to num_kv_heads == num_heads (or using the FlashAttention API that explicitly supports GQA/MQA) and otherwise fall back to scaled_dot_product_attention(..., enable_gqa=True).
| if CastedLinear._qat_enabled and self.training and w.ndim == 2: | ||
| with torch.no_grad(): | ||
| w32 = self.weight.float() | ||
| row_clip = torch.quantile(w32.abs(), 0.9995, dim=1) | ||
| scale = (row_clip / 15.0).clamp_min(1.0 / 15.0) | ||
| w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -16, 15) * scale[:, None]).to(x.dtype) | ||
| w = w + (w_q - w).detach() |
There was a problem hiding this comment.
The QAT “int5” STE path uses scale = row_clip / 15.0 but clamps to [-16, 15], which is 32 levels and doesn’t match the stated [-15, 15] (31 levels) quantization used elsewhere (clip_range=15). This inconsistency makes it hard to reason about what quantization the model is being trained for. Please make the clamp range consistent with the intended scheme (and update the README/submission metadata accordingly).
| for name, t in base_model.state_dict().items(): | ||
| ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) |
There was a problem hiding this comment.
The per-step EMA update iterates for name, t in base_model.state_dict().items() on every training step. Building a full state_dict each step can be a significant CPU overhead and allocates many temporary objects. Consider tracking EMA over named_parameters()/buffers directly (or caching the parameter list once) to avoid repeated state_dict() construction in the hot path.
| for name, t in base_model.state_dict().items(): | |
| ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) | |
| # Lazily cache parameter/buffer references to avoid rebuilding state_dict() every step | |
| if not hasattr(base_model, "_ema_source_tensors"): | |
| src_tensors: dict[str, Tensor] = {} | |
| for n, p in base_model.named_parameters(): | |
| src_tensors[n] = p | |
| for n, b in base_model.named_buffers(): | |
| src_tensors[n] = b | |
| base_model._ema_source_tensors = src_tensors | |
| for name, ema_t in ema_state.items(): | |
| src_t = base_model._ema_source_tensors.get(name) | |
| if src_t is None: | |
| continue | |
| ema_t.mul_(ema_decay).add_(src_t.detach().float(), alpha=1.0 - ema_decay) |
| seq_len = eval_seq_len or args.train_seq_len | ||
| total_tokens = val_tokens.numel() - 1 | ||
| window_starts = [ws for ws in range(0, total_tokens, stride) | ||
| if min(ws + seq_len, total_tokens) - ws >= 1] |
There was a problem hiding this comment.
eval_val_sliding() includes a final partial window whenever it has at least 1 token. When total_tokens is not an exact multiple of stride, that last window will typically re-score tokens that were already scored as the “last stride” of the previous window (double counting the tail). Consider using the same inclusion condition as the TTT evaluator (wlen >= stride or ws == 0) or otherwise adjusting the scoring slice for the final window to ensure each token is counted exactly once.
| if min(ws + seq_len, total_tokens) - ws >= 1] | |
| if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] |
| Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines. | ||
| """ | ||
|
|
||
| """V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x).""" |
There was a problem hiding this comment.
The root train_gpt.py no longer follows the repo’s stated goal of being a newcomer-friendly baseline (and it drops the existing guidance comment). The repository README explicitly says the best/SOTA configs should live under /records (README.md:204), and train_gpt_mlx.py documents a 1500-line hard stop. Please restore the baseline-oriented header and move record-specific/experimental code paths into the /records/.../train_gpt.py snapshot instead of upstreaming them into the root script.
| """V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x).""" | |
| """Minimal GPT training script (baseline, V23 configuration). | |
| This file is intended to remain a reasonably small, newcomer-friendly reference | |
| implementation of GPT training: a single script you can read top-to-bottom to | |
| understand how data loading, the model, and the training loop fit together. | |
| For the latest record-setting / SOTA configurations and more experimental | |
| variants, please see the snapshots under `records/` (as described in the | |
| repository README). Those scripts may be longer and more specialized. | |
| Current baseline config: V23 — int5 GPTQ + 33.6M model | |
| (MHA 8/8, BigramHash 8192, MLP 3.5x). | |
| """ |
| rd = self.rope_dims | ||
| if seq_len > self.train_seq_len: | ||
| scale = seq_len / self.train_seq_len | ||
| new_base = self.base * (scale ** (rd / (rd - 2))) | ||
| inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd)) |
There was a problem hiding this comment.
Rotary.forward() computes scale ** (rd / (rd - 2)) when seq_len > train_seq_len. If ROPE_DIMS is set to 2 via env (or any value <= 2), this will divide by zero (or become numerically unstable). Please validate rope_dims at initialization (e.g., require even and > 2 when extrapolation is enabled) or guard the extrapolation formula for small rd.
| # --- Phase 2: TRAIN on this chunk (already scored = legal) --- | ||
| is_last_chunk = (ci == num_chunks - 1) | ||
| if not is_last_chunk and ttt_epochs > 0: | ||
| chunk_start = ci * ttt_chunk_tokens | ||
| chunk_end = min((ci + 1) * ttt_chunk_tokens, total_tokens) | ||
| chunk_seqs = (chunk_end - chunk_start) // seq_len | ||
| if chunk_seqs > 0: | ||
| # Cosine LR across chunks | ||
| cos_lr = ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) | ||
| for pg in optimizer.param_groups: | ||
| pg["lr"] = cos_lr | ||
| my_seq_s = (chunk_seqs * rank) // world_size | ||
| my_seq_e = (chunk_seqs * (rank + 1)) // world_size | ||
| my_chunk_seqs = my_seq_e - my_seq_s | ||
| for _ep in range(ttt_epochs): | ||
| for bs in range(0, my_chunk_seqs, batch_seqs): | ||
| be = min(bs + batch_seqs, my_chunk_seqs) | ||
| actual_bs = my_seq_s + bs | ||
| start_tok = chunk_start + actual_bs * seq_len | ||
| end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 | ||
| if end_tok > val_tokens.numel(): | ||
| continue | ||
| local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) | ||
| x = local[:-1].reshape(-1, seq_len) | ||
| y = local[1:].reshape(-1, seq_len) | ||
| optimizer.zero_grad(set_to_none=True) | ||
| with torch.autocast(device_type="cuda", dtype=torch.bfloat16): | ||
| ttt_loss = base_model(x, y) | ||
| ttt_loss.backward() |
There was a problem hiding this comment.
In eval_val_sliding_ttt(), the model is left in eval() mode during the adaptation/training phase (there’s no base_model.train() before computing ttt_loss/backward()). That can silently disable any training-mode behaviors (and also disables the QAT STE path which checks self.training). Please switch to train() for Phase 2 (and then back to eval() for scoring) so the adaptation step is actually run under training semantics.
|
closed bc eval time over 600 second |
Summary
Key Innovation
int5 GPTQ: fewer unique values (31 vs 63) = lower entropy = better zstd compression (~0.46 bytes/param vs int6's ~0.58). GPTQ Hessian-aware error compensation makes the quality loss minimal (0.001 BPB quant tax).
This unlocks a 33.6M param model (MHA 8/8, BigramHash 8192, MLP 3.5x) that was previously impossible to fit under 16MB with int6.
Results
Test plan
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/