Skip to content

int5 GPTQ + 33.6M model: 1.1179 BPB (3-seed mean)#544

Closed
EthanYangTW wants to merge 2 commits intoopenai:mainfrom
EthanYangTW:submission/int5-gptq-33m
Closed

int5 GPTQ + 33.6M model: 1.1179 BPB (3-seed mean)#544
EthanYangTW wants to merge 2 commits intoopenai:mainfrom
EthanYangTW:submission/int5-gptq-33m

Conversation

@EthanYangTW
Copy link
Copy Markdown

@EthanYangTW EthanYangTW commented Mar 23, 2026

Summary

33.6M parameter model quantized to int5 with GPTQ error compensation, fitting under 16MB. First submission to achieve int5 quantization on a 33.6M model within the artifact size limit.

Architecture: 11L, 512d, MHA 8/8, MLP 3.5x (1792), BigramHash 8192, XSA all layers
Quantization: int5 per-row GPTQ (clip_range=15) + Early QAT (threshold 0.5) + EMA 0.997
TTT: Legal score-first AdamW, chunk=131072, last 2 blocks unfrozen

Results

Seed Sliding BPB TTT BPB Artifact
1337 1.1244 1.1170 15.53 MB
42 1.1249 1.1182 15.36 MB
7 1.1250 1.1184 15.28 MB
Mean 1.1248 1.1179

Logs

Seed 1337

model_params:33580124
XSA:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ws:8 gqa:8/8
lr:embed=0.035 matrix=0.025 scalar=0.025 batch:786432 wall:600s seed:1337
step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms
step:500/20000 train_loss:2.3701 train_time:48612ms step_avg:97.22ms
step:1000/20000 train_loss:2.2455 train_time:97399ms step_avg:97.40ms
step:1500/20000 train_loss:2.1909 train_time:146193ms step_avg:97.46ms
step:2000/20000 train_loss:2.0314 train_time:195030ms step_avg:97.52ms
step:2500/20000 train_loss:2.1357 train_time:243879ms step_avg:97.55ms
step:3000/20000 train_loss:2.1196 train_time:292698ms step_avg:97.57ms
step:3500/20000 train_loss:2.1280 train_time:341516ms step_avg:97.58ms
step:4000/20000 val_loss:2.0061 val_bpb:1.1881 train_time:390334ms step_avg:97.58ms
late_qat:enabled step:4399 scale:0.4999
swa:start step:5450
step:6142/20000 val_loss:1.9016 val_bpb:1.1262 train_time:600020ms step_avg:97.69ms
best_averaging:ema val_bpb:1.1253
pruning:2.0% magnitude pruning applied
gptq:calibrated 68 layers in 2.6s
Serialized model int6+zstd: 15450968 bytes
Total submission size int6+zstd: 15526951 bytes
final_int6_sliding_window val_loss:1.8985 val_bpb:1.1244 stride:32 eval_time:164429ms
final_int6_sliding_window_exact val_loss:1.89850579 val_bpb:1.12440579
ttt:start chunks=474 chunk_tokens=131072 stride=32 lr=0.0001 epochs=3 opt=adamw freeze_first=2
  ttt_chunk [1/474] bpb=1.205191 time=1.2s
  ttt_chunk [101/474] bpb=1.125280 time=112.0s
  ttt_chunk [201/474] bpb=1.125951 time=222.8s
  ttt_chunk [301/474] bpb=1.121763 time=333.6s
  ttt_chunk [401/474] bpb=1.118378 time=444.4s
  ttt_chunk [474/474] bpb=1.117889 time=524.3s
final_int6_ttt val_loss:1.8860 val_bpb:1.1170 stride:32 eval_time:525132ms
final_int6_ttt_exact val_loss:1.88598432 val_bpb:1.11698985

Seed 42

model_params:33580124
seed:42
step:4000/20000 val_loss:2.0087 val_bpb:1.1897
late_qat:enabled step:4393 scale:0.4999
swa:start step:5450
step:6138/20000 val_loss:1.9040 val_bpb:1.1276 train_time:600007ms step_avg:97.75ms
best_averaging:ema val_bpb:1.1267
pruning:2.0% magnitude pruning applied
gptq:calibrated 68 layers in 2.5s
Serialized model int6+zstd: 15284494 bytes
Total submission size int6+zstd: 15360477 bytes
final_int6_sliding_window_exact val_loss:1.89928270 val_bpb:1.12486593
ttt:start chunks=474 chunk_tokens=131072 stride=32 lr=0.0001 epochs=3 opt=adamw freeze_first=2
  ttt_chunk [1/474] bpb=1.199439 time=1.2s
  ttt_chunk [101/474] bpb=1.125688 time=112.3s
  ttt_chunk [474/474] bpb=1.119019 time=525.9s
ttt:done val_loss=1.888037 val_bpb=1.118206 elapsed=525.9s
final_int6_ttt_exact val_loss:1.88803710 val_bpb:1.11820563

Seed 7

model_params:33580124
seed:7
best_averaging:ema val_bpb:1.1264
Serialized model int6+zstd: 15206146 bytes
Total submission size int6+zstd: 15282129 bytes
final_int6_sliding_window_exact val_loss:1.89958170 val_bpb:1.12504301
ttt:start chunks=474 chunk_tokens=131072 stride=32 lr=0.0001 epochs=2 opt=adamw freeze_first=2
ttt:done val_loss=1.888400 val_bpb=1.118421 elapsed=503.5s
final_int6_ttt_exact val_loss:1.88840031 val_bpb:1.11842074

Reproduction

pip install --break-system-packages zstandard
pip install --break-system-packages flash-attn --no-build-isolation
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80

SEED=1337 PRUNE_PCT=0.02 TTT_EPOCHS=3 TTT_LR=0.0001 \
TTT_OPTIMIZER=adamw TTT_FREEZE_BLOCKS=2 TTT_CHUNK_TOKENS=131072 \
EVAL_STRIDE=32 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

33.6M params (MHA 8/8, BigramHash 8192, MLP 3.5x) quantized to int5
with GPTQ error compensation. Artifact fits under 16MB (15.3-15.5MB).

Seeds: 1337 (1.1170), 42 (1.1182), 7 (1.1184)
Copilot AI review requested due to automatic review settings March 23, 2026 15:51
Seed 1337: de843ef6 (TTT 1.1170)
Seed 42: b6560b60 (TTT 1.1182)
Seed 7: c1c18644 (TTT 1.1184)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates train_gpt.py to implement and export a new 33.6M-parameter GPT variant targeting int5-style (clip_range=15) GPTQ-assisted quantization under the 16MB artifact limit, with added sliding-window evaluation and score-first TTT evaluation.

Changes:

  • Expands the model architecture (11 layers, BigramHash embedding, optional XSA, optional value embeddings, RoPE tweaks, smear gating, optional DTG/ln scaling).
  • Adds new evaluation modes (separate eval sequence length, sliding-window BPB eval, and “legal score-first” TTT eval).
  • Replaces the prior int8 export path with a mixed int8/int5-like (named “int6” in code) quantization pipeline including GPTQ calibration, optional pruning, and zstd/zlib compression.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines.
"""

"""V23: int5 GPTQ + 33.6M model (MHA 8/8, BigramHash 8192, MLP 3.5x)."""
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_gpt_mlx.py’s module docstring states a hard stop that both train_gpt.py and train_gpt_mlx.py should not exceed 1500 lines for newcomer readability, but train_gpt.py is now 1580 lines. Please move record-specific / experimental code (e.g., GPTQ/TTT helpers) into /records or otherwise reduce the core script length to stay within that stated limit.

Copilot uses AI. Check for mistakes.
Comment on lines +389 to +396
rd = self.rope_dims
if seq_len > self.train_seq_len:
scale = seq_len / self.train_seq_len
new_base = self.base * (scale ** (rd / (rd - 2)))
inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd))
else:
inv_freq = self.inv_freq.to(device)
t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rotary.forward() computes new_base = self.base * (scale ** (rd / (rd - 2))), which will divide by zero when rope_dims is 2 (and behaves poorly when rope_dims <= 2). Add validation that rope_dims is either 0 (meaning full head_dim) or an even value >= 4 to avoid runtime errors.

Copilot uses AI. Check for mistakes.
Comment on lines +592 to +597
if rope_dims > 0:
head_dim = model_dim // num_heads
for block in self.blocks:
block.attn.rope_dims = rope_dims
block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims)
self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else []
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When rope_dims > 0, it’s applied without checking it’s even and <= head_dim. If rope_dims is odd or larger than head_dim, Rotary will build cos/sin tables with incompatible last-dimension sizes and apply_rotary_emb() can error at runtime. Consider validating 0 < rope_dims <= head_dim and rope_dims % 2 == 0 up front (e.g., in GPT.__init__).

Copilot uses AI. Check for mistakes.
Comment on lines 433 to +436
self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32))
self.rotary = Rotary(self.head_dim, base=rope_base)

def forward(self, x: Tensor) -> Tensor:
self.rope_dims = 0
self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024)
self.use_xsa = False
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CausalSelfAttention hard-codes train_seq_len=1024 when constructing Rotary, but the default Hyperparameters.train_seq_len is now 2048. This means RoPE will always take the “seq_len > train_seq_len” scaling branch during training/eval at 2048, which is easy to do unintentionally. If the scaling should be tied to the actual training context length, thread args.train_seq_len (or a dedicated hyperparameter) into Rotary construction.

Copilot uses AI. Check for mistakes.
Comment on lines +494 to +500
def bigram_hash(self, tokens: Tensor) -> Tensor:
t = tokens.to(torch.int32)
mod = self.bigram_vocab_size - 1
out = torch.empty_like(t)
out[..., 0] = mod
out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod
return out.long()
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigramHashEmbedding.bigram_hash() sets mod = bigram_vocab_size - 1 and then does % mod, which will raise a division-by-zero error if BIGRAM_VOCAB_SIZE is 1 (and produces negative indices if it’s 0 but still instantiated). Add input validation in BigramHashEmbedding.__init__ (or where it’s constructed) to require bigram_vocab_size >= 2 when enabled.

Copilot uses AI. Check for mistakes.
if CastedLinear._qat_enabled and self.training and w.ndim == 2:
with torch.no_grad():
w32 = self.weight.float()
row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QAT path in CastedLinear.forward() computes torch.quantile(..., dim=1) for every 2D weight on every forward pass when enabled. This is extremely expensive (quantile involves sorting / heavy reductions) and is likely to blow the 10-minute training budget once late-QAT turns on. Consider precomputing per-row clip values periodically, using a cheaper statistic (e.g., amax / EMA), or restricting QAT to a small subset of layers.

Suggested change
row_clip = torch.quantile(w32.abs(), 0.9995, dim=1)
# Use a cheaper per-row amax-based EMA for clipping instead of per-forward quantile.
if not hasattr(self, "_qat_row_clip_ema"):
# Initialize EMA with current per-row maximum absolute values.
self._qat_row_clip_ema = w32.abs().amax(dim=1)
else:
current = w32.abs().amax(dim=1)
ema_decay = 0.9
self._qat_row_clip_ema.mul_(ema_decay).add_(current, alpha=1.0 - ema_decay)
row_clip = self._qat_row_clip_ema

Copilot uses AI. Check for mistakes.

with torch.no_grad():
for name, t in base_model.state_dict().items():
ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ema_state is initialized and updated as GPU float32 copies of the full state_dict() each step. This can significantly increase GPU memory usage (potentially causing OOM) and adds per-step overhead. Consider keeping EMA on CPU (and updating from .detach().cpu()), or limiting EMA to a subset of parameters.

Suggested change
ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay)
# Keep EMA state on CPU to avoid maintaining a full float32 copy on GPU.
ema_t = ema_state[name]
if ema_t.device != torch.device("cpu"):
ema_t = ema_t.detach().cpu()
ema_state[name] = ema_t
t_cpu = t.detach().float().cpu()
ema_t.mul_(ema_decay).add_(t_cpu, alpha=1.0 - ema_decay)

Copilot uses AI. Check for mistakes.
Comment on lines +1061 to +1068
if not t.is_floating_point() or t.numel() <= 65536:
result[name] = t.to(torch.float16) if t.is_floating_point() else t
meta[name] = "passthrough"
continue
if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS):
result[name] = t.float()
meta[name] = "passthrough_ctrl"
continue
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mixed_quantize_int6_gptq, the early branch if not t.is_floating_point() or t.numel() <= 65536 runs before the control-tensor check, so small control tensors (matching CONTROL_TENSOR_NAME_PATTERNS) will be downcast to fp16 and never get the intended passthrough_ctrl handling. If control tensors are meant to stay fp32 for stability/accuracy, check CONTROL_TENSOR_NAME_PATTERNS before the numel() <= 65536 passthrough/downcast logic.

Suggested change
if not t.is_floating_point() or t.numel() <= 65536:
result[name] = t.to(torch.float16) if t.is_floating_point() else t
meta[name] = "passthrough"
continue
if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS):
result[name] = t.float()
meta[name] = "passthrough_ctrl"
continue
# Non-floating tensors are always passed through as-is.
if not t.is_floating_point():
result[name] = t
meta[name] = "passthrough"
continue
# Control tensors are kept in float32 for stability/accuracy, regardless of size.
if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS):
result[name] = t.float()
meta[name] = "passthrough_ctrl"
continue
# Small floating-point tensors are passed through in float16.
if t.numel() <= 65536:
result[name] = t.to(torch.float16)
meta[name] = "passthrough"
continue

Copilot uses AI. Check for mistakes.
Comment on lines +937 to +956
def quantize_int6_per_row(t: Tensor, clip_range: int = 15) -> tuple[Tensor, Tensor]:
t32 = t.float()
if t32.ndim == 2:
best_q, best_s, best_err = None, None, float('inf')
for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]:
if pct < 1.0:
row_clip = torch.quantile(t32.abs(), pct, dim=1)
else:
row_clip = t32.abs().amax(dim=1)
s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16)
q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8)
recon = q.float() * s.float()[:, None]
err = (t32 - recon).pow(2).mean().item()
if err < best_err:
best_q, best_s, best_err = q, s, err
return best_q, best_s
amax = t32.abs().max().item()
scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16)
q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8)
return q, scale
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several helpers and artifacts are named int6 (e.g., quantize_int6_per_row, final_model.int6.ptz), but the PR description/docstring calls this “int5 GPTQ”. Since clip_range=15 yields 31 signed levels (an int5-like scheme), please rename the functions/files/metadata to match the actual quantization format to avoid confusion for readers and future tooling.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants