Skip to content

Non-record: 11L Int6 QAT + SmearGate + SWA(0.4) + WD=0.04 (3-seed mean val_bpb=1.1488)#385

Open
dentity007 wants to merge 1 commit intoopenai:mainfrom
NathanMaine:submission/11L-SWA04-WD04-NathanMaine
Open

Non-record: 11L Int6 QAT + SmearGate + SWA(0.4) + WD=0.04 (3-seed mean val_bpb=1.1488)#385
dentity007 wants to merge 1 commit intoopenai:mainfrom
NathanMaine:submission/11L-SWA04-WD04-NathanMaine

Conversation

@dentity007
Copy link
Copy Markdown

Summary

Mean val_bpb = 1.1488 (3-seed verified, std=0.0006)

Builds on @baudrillardsgh0st's technique stack (PR #194). Two hyperparameter changes informed by 30+ experiments:

  1. Muon WD=0.04 (vs 0.038) — improves int6 quantization quality
  2. SWA_START_FRAC=0.4 (vs 0.5) — 33% more checkpoint diversity for smoother weight averaging
Seed val_bpb Steps ms/step Artifact
42 1.1482 8,393 71.79 15.2MB
7 1.1489 8,380 71.59 15.2MB
1337 1.1494 8,390 71.51 15.3MB
Mean 1.1488

Std: 0.0006 (4× tighter than SOTA's 0.0024)

Key finding

SWA_START_FRAC=0.4 captures ~33% more checkpoints, producing smoother weight distributions that survive int6 quantization better. Combined with WD=0.04, the quantization gap is minimized.

Submission checklist

  • 3-seed verification (mean=1.1488, std=0.0006)
  • All artifacts < 16MB (max 15.3MB)
  • Wallclock < 600s on 8×H100
  • Train logs included (3 seeds)
  • Reproducible train_gpt.py included
  • README with detailed explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant