Skip to content

Record: Int6 + MLP 3x + STE QAT + NorMuon + sliding window (val_bpb 1.1666)#137

Open
abhishekgahlot2 wants to merge 2 commits intoopenai:mainfrom
abhishekgahlot2:record/int6-qat-normuon
Open

Record: Int6 + MLP 3x + STE QAT + NorMuon + sliding window (val_bpb 1.1666)#137
abhishekgahlot2 wants to merge 2 commits intoopenai:mainfrom
abhishekgahlot2:record/int6-qat-normuon

Conversation

@abhishekgahlot2
Copy link
Copy Markdown

Int6 mixed quantization with STE fake-int6 QAT, 3x MLP expansion, NorMuon optimizer, SWA checkpoint averaging, and sliding window eval.

what changed

MLP 3x expansion (hidden=1536): 21.8M params. Extra capacity paid for by int6 quantization.

STE fake-int6 QAT: weights fake-quantized to int6 via straight-through estimator throughout training. Reduces quantization penalty from ~0.008 to ~0.001 BPB.

NorMuon optimizer: per-neuron row-wise RMS normalization after Newton-Schulz orthogonalization.

SWA checkpoint averaging: collects checkpoints every 200 steps during warmdown and averages them.

Mixed quantization: int6 per-row on MLP and attention weights, fp16 passthrough for tied embedding, zstd-22 compression.

Sliding window eval (stride=64): each token scored with nearly full context.

seq_len=2048, batch=786K, grad_clip=0.3, matrix_lr=0.02, Muon momentum=0.99, Muon WD=0.01, warmdown=3000 iters, logit softcap=15.

results

8xH100 80GB HBM3 (Modal, 10 min wallclock, seed 1337):

metric val_loss val_bpb artifact
pre-quant 2.007 1.1887
post-quant (standard) 2.0055 1.1877 15.22 MB
post-quant (sliding window) 1.9697 1.1666 15.22 MB

6,065 steps at 98.9ms/step. Quant loss: 0.001 BPB. Sliding window eval: 156s.

test plan

  • 8xH100 SXM 80GB, 10 min wallclock
  • final_mixed_roundtrip_exact val_bpb:1.18774689
  • final_sliding_window_exact val_bpb:1.16658140
  • Artifact: 15,216,221 bytes (under 16MB)
  • Full training log included
  • Additional seed runs for p<0.01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant