Skip to content

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)#585

Closed
EthanYangTW wants to merge 4 commits intoopenai:mainfrom
EthanYangTW:submission/int5-gptq-33m
Closed

Record: int5 GPTQ + 33.6M model (3-seed mean val_bpb=1.1179)#585
EthanYangTW wants to merge 4 commits intoopenai:mainfrom
EthanYangTW:submission/int5-gptq-33m

Conversation

@EthanYangTW
Copy link
Copy Markdown

@EthanYangTW EthanYangTW commented Mar 23, 2026

Summary

  • 3-seed mean val_bpb = 1.1179 (std 0.0008)
  • Breakthrough: int5 quantization ([-15,15], 31 levels) with GPTQ error compensation enables 33.6M params in 16MB
  • All 3 artifacts under 16MB (15.53, 15.36, 15.28 MB)
  • Legal score-first TTT (every token scored before any gradient update)

Key Innovation

int5 GPTQ: fewer unique values (31 vs 63) = lower entropy = better zstd compression (~0.46 bytes/param vs int6's ~0.58). GPTQ Hessian-aware error compensation makes the quality loss minimal (0.001 BPB quant tax).

This unlocks a 33.6M param model (MHA 8/8, BigramHash 8192, MLP 3.5x) that was previously impossible to fit under 16MB with int6.

Results

Seed Sliding BPB TTT BPB Artifact
1337 1.1244 1.1170 15.53 MB
42 1.1249 1.1182 15.36 MB
7 1.1250 1.1184 15.28 MB
Mean 1.1179

Test plan

  • Verify artifact sizes are all under 16,000,000 bytes
  • Verify 3-seed statistical significance (p < 0.01)
  • Verify training completes within 600s on 8xH100 SXM
  • Verify eval completes within 600s
  • Reproduce from records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/

33.6M params (MHA 8/8, BigramHash 8192, MLP 3.5x) quantized to int5
with GPTQ error compensation. Artifact fits under 16MB (15.3-15.5MB).

Seeds: 1337 (1.1170), 42 (1.1182), 7 (1.1184)
Seed 1337: de843ef6 (TTT 1.1170)
Seed 42: b6560b60 (TTT 1.1182)
Seed 7: c1c18644 (TTT 1.1184)
seed1337.log - TTT 1.1170
seed42.log - TTT 1.1182
seed7.log - TTT 1.1184
Add proper /records submission with submission.json, README, train_gpt.py, and 3-seed logs.
Copilot AI review requested due to automatic review settings March 23, 2026 23:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new record submission (“int5 GPTQ + 33.6M + legal score-first TTT”) and updates the root training script to match that record configuration.

Changes:

  • Replace/supersede the root train_gpt.py with a much more complex record-style script (FlashAttention v3 integration, sliding eval + legal TTT, EMA/SWA, GPTQ calibration + mixed quant export, etc.).
  • Add a new record folder under records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/ with the exact script snapshot, README, submission metadata, and 3-seed logs.
  • Add additional seed logs under the repository-level logs/ directory.

Reviewed changes

Copilot reviewed 4 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
train_gpt.py Root script rewritten to the record configuration, adding GPTQ + TTT + sliding eval + many new model/optimizer features.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_gpt.py Record snapshot of the training script used to generate the submission artifacts.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md Record writeup describing the innovations, architecture, quantization pipeline, and reproduction steps.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/submission.json Submission metadata for the leaderboard entry (scores, sizes, method summary).
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed1337.log Seed 1337 run log.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed42.log Seed 42 run log.
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/train_seed7.log Seed 7 run log.
logs/seed1337.log Extra copy of the seed 1337 log at repo root.
logs/seed42.log Extra copy of the seed 42 log at repo root.
logs/seed7.log Extra copy of the seed 7 log at repo root.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

result[name + ".q"] = q
result[name + ".scale"] = s
meta[name] = {"type": "int8"}
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed_quantize_int6_gptq() prints its summary unconditionally. In distributed runs this produces duplicated lines (once per rank), as visible in the provided logs. Please guard this print behind rank == 0 / master_process (pass rank into the function or log at the call site) to keep logs deterministic and easier to parse.

Suggested change
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)
distributed = dist.is_available() and dist.is_initialized()
rank = dist.get_rank() if distributed else 0
if rank == 0:
print(f"gptq_quantize: {gptq_count} GPTQ layers, {naive_count} naive layers", flush=True)

Copilot uses AI. Check for mistakes.
Comment on lines 1509 to +1515
if master_process:
with open("final_model.int8.ptz", "wb") as f:
with open("final_model.int6.ptz", "wb") as f:
f.write(quant_blob)
quant_file_bytes = os.path.getsize("final_model.int8.ptz")
quant_file_bytes = len(quant_blob)
code_bytes = len(code.encode("utf-8"))
ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1)
log0(
f"Serialized model int8+zlib: {quant_file_bytes} bytes "
f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)"
)
log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")

log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The artifact filename/logging uses final_model.int6.ptz and messages like int6+{_COMPRESSOR}, but the PR description/record metadata claim an int5 quantization scheme ([-15,15], 31 levels). This mismatch is confusing for reviewers and for any downstream tooling that expects naming to reflect the actual quantization format. Please align naming (filenames, log strings, helper function names) with the true quantization level being produced.

Copilot uses AI. Check for mistakes.
Comment on lines +1508 to +1515
quant_blob = zstandard.ZstdCompressor(level=22).compress(quant_raw) if _COMPRESSOR == "zstd" else zlib.compress(quant_raw, 9)
if master_process:
with open("final_model.int6.ptz", "wb") as f:
f.write(quant_blob)
quant_file_bytes = len(quant_blob)
code_bytes = len(code.encode("utf-8"))
log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes")
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This record script writes final_model.int6.ptz and logs int6+{_COMPRESSOR}, but the submission/README claim an int5 scheme ([-15,15], 31 levels). Please align filenames/log strings and helper names with the actual quantization format to avoid confusion during verification.

Copilot uses AI. Check for mistakes.
Comment on lines +459 to +467
q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None]
if _HAS_FA3:
y = flash_attn_3_func(q, k, v, causal=True).contiguous()
else:
y = F.scaled_dot_product_attention(
q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2),
attn_mask=None, is_causal=True,
enable_gqa=(self.num_kv_heads != self.num_heads),
).transpose(1, 2)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When FlashAttention is available, the code calls flash_attn_3_func(q, k, v, ...) even if num_kv_heads != num_heads. Unlike the SDPA fallback (which enables GQA), FlashAttention may not accept mismatched head counts and can error if someone overrides NUM_KV_HEADS. Consider gating the FlashAttention path to num_kv_heads == num_heads (or using the FlashAttention API that explicitly supports GQA/MQA) and otherwise fall back to scaled_dot_product_attention(..., enable_gqa=True).

Copilot uses AI. Check for mistakes.
Comment on lines +353 to +359
if CastedLinear._qat_enabled and self.training and w.ndim == 2:
with torch.no_grad():
w32 = self.weight.float()
row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
scale = (row_clip / 15.0).clamp_min(1.0 / 15.0)
w_q = (torch.clamp(torch.round(w32 / scale[:, None]), -16, 15) * scale[:, None]).to(x.dtype)
w = w + (w_q - w).detach()
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QAT “int5” STE path uses scale = row_clip / 15.0 but clamps to [-16, 15], which is 32 levels and doesn’t match the stated [-15, 15] (31 levels) quantization used elsewhere (clip_range=15). This inconsistency makes it hard to reason about what quantization the model is being trained for. Please make the clamp range consistent with the intended scheme (and update the README/submission metadata accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +1403 to +1404
for name, t in base_model.state_dict().items():
ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-step EMA update iterates for name, t in base_model.state_dict().items() on every training step. Building a full state_dict each step can be a significant CPU overhead and allocates many temporary objects. Consider tracking EMA over named_parameters()/buffers directly (or caching the parameter list once) to avoid repeated state_dict() construction in the hot path.

Suggested change
for name, t in base_model.state_dict().items():
ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)
# Lazily cache parameter/buffer references to avoid rebuilding state_dict() every step
if not hasattr(base_model, "_ema_source_tensors"):
src_tensors: dict[str, Tensor] = {}
for n, p in base_model.named_parameters():
src_tensors[n] = p
for n, b in base_model.named_buffers():
src_tensors[n] = b
base_model._ema_source_tensors = src_tensors
for name, ema_t in ema_state.items():
src_t = base_model._ema_source_tensors.get(name)
if src_t is None:
continue
ema_t.mul_(ema_decay).add_(src_t.detach().float(), alpha=1.0 - ema_decay)

Copilot uses AI. Check for mistakes.
seq_len = eval_seq_len or args.train_seq_len
total_tokens = val_tokens.numel() - 1
window_starts = [ws for ws in range(0, total_tokens, stride)
if min(ws + seq_len, total_tokens) - ws >= 1]
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_val_sliding() includes a final partial window whenever it has at least 1 token. When total_tokens is not an exact multiple of stride, that last window will typically re-score tokens that were already scored as the “last stride” of the previous window (double counting the tail). Consider using the same inclusion condition as the TTT evaluator (wlen >= stride or ws == 0) or otherwise adjusting the scoring slice for the final window to ensure each token is counted exactly once.

Suggested change
if min(ws + seq_len, total_tokens) - ws >= 1]
if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0]

Copilot uses AI. Check for mistakes.
Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.
"""

"""V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x)."""
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root train_gpt.py no longer follows the repo’s stated goal of being a newcomer-friendly baseline (and it drops the existing guidance comment). The repository README explicitly says the best/SOTA configs should live under /records (README.md:204), and train_gpt_mlx.py documents a 1500-line hard stop. Please restore the baseline-oriented header and move record-specific/experimental code paths into the /records/.../train_gpt.py snapshot instead of upstreaming them into the root script.

Suggested change
"""V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x)."""
"""Minimal GPT training script (baseline, V23 configuration).
This file is intended to remain a reasonably small, newcomer-friendly reference
implementation of GPT training: a single script you can read top-to-bottom to
understand how data loading, the model, and the training loop fit together.
For the latest record-setting / SOTA configurations and more experimental
variants, please see the snapshots under `records/` (as described in the
repository README). Those scripts may be longer and more specialized.
Current baseline config: V23int5 GPTQ + 33.6M model
(MHA 8/8, BigramHash 8192, MLP 3.5x).
"""

Copilot uses AI. Check for mistakes.
Comment on lines +389 to +393
rd = self.rope_dims
if seq_len > self.train_seq_len:
scale = seq_len / self.train_seq_len
new_base = self.base * (scale ** (rd / (rd - 2)))
inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd))
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rotary.forward() computes scale ** (rd / (rd - 2)) when seq_len > train_seq_len. If ROPE_DIMS is set to 2 via env (or any value <= 2), this will divide by zero (or become numerically unstable). Please validate rope_dims at initialization (e.g., require even and > 2 when extrapolation is enabled) or guard the extrapolation formula for small rd.

Copilot uses AI. Check for mistakes.
Comment on lines +869 to +897
# --- Phase 2: TRAIN on this chunk (already scored = legal) ---
is_last_chunk = (ci == num_chunks - 1)
if not is_last_chunk and ttt_epochs > 0:
chunk_start = ci * ttt_chunk_tokens
chunk_end = min((ci + 1) * ttt_chunk_tokens, total_tokens)
chunk_seqs = (chunk_end - chunk_start) // seq_len
if chunk_seqs > 0:
# Cosine LR across chunks
cos_lr = ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1)))
for pg in optimizer.param_groups:
pg["lr"] = cos_lr
my_seq_s = (chunk_seqs * rank) // world_size
my_seq_e = (chunk_seqs * (rank + 1)) // world_size
my_chunk_seqs = my_seq_e - my_seq_s
for _ep in range(ttt_epochs):
for bs in range(0, my_chunk_seqs, batch_seqs):
be = min(bs + batch_seqs, my_chunk_seqs)
actual_bs = my_seq_s + bs
start_tok = chunk_start + actual_bs * seq_len
end_tok = chunk_start + (my_seq_s + be) * seq_len + 1
if end_tok > val_tokens.numel():
continue
local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64)
x = local[:-1].reshape(-1, seq_len)
y = local[1:].reshape(-1, seq_len)
optimizer.zero_grad(set_to_none=True)
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
ttt_loss = base_model(x, y)
ttt_loss.backward()
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In eval_val_sliding_ttt(), the model is left in eval() mode during the adaptation/training phase (there’s no base_model.train() before computing ttt_loss/backward()). That can silently disable any training-mode behaviors (and also disables the QAT STE path which checks self.training). Please switch to train() for Phase 2 (and then back to eval() for scoring) so the adaptation step is actually run under training semantics.

Copilot uses AI. Check for mistakes.
@EthanYangTW
Copy link
Copy Markdown
Author

closed bc eval time over 600 second

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants