Skip to content

GPTQ Int6 + SGD Test-Time Training — A800 1.1190 bpb#610

Open
ChaosCodes wants to merge 1 commit intoopenai:mainfrom
ChaosCodes:submission/gptq-ttt-1119
Open

GPTQ Int6 + SGD Test-Time Training — A800 1.1190 bpb#610
ChaosCodes wants to merge 1 commit intoopenai:mainfrom
ChaosCodes:submission/gptq-ttt-1119

Conversation

@ChaosCodes
Copy link
Copy Markdown

Summary

  • 11-layer 512d GPT with PR#414's 10-technique stack + LeakyReLU(0.5)² activation
  • GPTQ int6 quantization: Hessian-guided column-wise quantization replacing naive per-row rounding, reducing quantization error by 33.6% (Hessian-weighted MSE)
  • SGD test-time training (TTT): Continues training on validation data in a causal (score-first) manner with cosine LR decay, adapting last 9/11 layers

Key Results

Metric Value
A800 bpb (GPTQ + TTT) 1.1190
A800 bpb (GPTQ only) 1.1214
A800 bpb (sliding window, no TTT) 1.1243
Estimated H100 bpb ~1.122
Artifact size 15,750,888 bytes (98.4% of 16MB)
Training time 1200s on 8×A800-SXM4-80GB

Techniques

Architecture (PR#414 stack): XSA4, EMA, U-Net skip, SmearGate, BigramHash, PartialRoPE, LNScale, ValueEmbed, LateQAT, SWA

Novel contributions:

  1. LeakyReLU(0.5)² replacing GELU² — saves 0.0026 bpb by improving gradient flow
  2. GPTQ int6 — data-dependent quantization using 256 calibration samples, block-128 updates
  3. SGD TTT — Simple SGD (lr=0.002, momentum=0.9) with cosine schedule over 900 chunks of 32K tokens, 3 epochs/chunk

Compression

zstd level-21 with long-distance matching (LDM) for model artifact compression.

Files

  • train_gpt.py — Full training + GPTQ + TTT evaluation pipeline
  • eval_gptq.py — Standalone GPTQ evaluation script
  • eval_ttt.py — Standalone TTT evaluation script
  • submission.json — Structured results metadata
  • train.log — Complete training log
  • README.md — Detailed writeup with technique descriptions and ablations

See records/track_10min_16mb/2026-03-24_GPTQ_TTT/README.md for full details.

11-layer 512d GPT with PR#414 10-technique stack + LeakyReLU² activation,
post-training GPTQ int6 quantization, and SGD test-time training with
cosine LR decay. Artifact size: 15.75MB (under 16MB limit).

Techniques: XSA4, EMA, U-Net skip, SmearGate, BigramHash, PartialRoPE,
LNScale, ValueEmbed, LateQAT, SWA, LeakyReLU², GPTQ int6, SGD TTT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant