Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) by chanwoo-park-official · Pull Request #312 · openai/parameter-golf

chanwoo-park-official · 2026-03-21T05:04:35Z

Summary

This PR reports a standalone run with Canon ACD (CANON_SET=ACD, CANON_KERNEL=3) plus mixed int6 quantization (INT6_CATEGORIES=mlp,attn).

Approach

Model: 9-layer decoder-only Transformer, model_dim=512, num_heads=8, num_kv_heads=4, mlp_mult=3.0
MLP: ReLU-squared style MLP (repo default)
Context extras: Bigram hash embedding (bigram_vocab_size=2048, bigram_dim=128) + SmearGate
Quantization: mixed PTQ, mlp/attn=int6, other large tensors int8
Optimizer: mixed Muon + Adam
Schedule: momentum warmup (0.92 -> 0.99), warmdown (WARMDOWN_ITERS=3000), SWA near end
Eval: both roundtrip and sliding-window (EVAL_STRIDE=64); sliding bpb is main comparison

Canon Placement

A: before attention
B: on concatenated QKV (most expensive)
C: before MLP
D: in widened MLP hidden stream
This run uses ACD (keeps Canon effect while avoiding B cost)
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
Zeyuan Allen-Zhu (2025), full version: https://ssrn.com/abstract=5240330

Config Highlights

torchrun --nproc_per_node=8
TRAIN_BATCH_TOKENS=524288, TRAIN_SEQ_LEN=2048
EVAL_SEQ_LEN=2048, EVAL_STRIDE=64, EVAL_BATCH_SEQS=32
MATRIX_LR=0.025, SCALAR_LR=0.025, TIED_EMBED_LR=0.035
MUON_WEIGHT_DECAY=0.04, ADAM_WEIGHT_DECAY=0.04
SWA_ENABLED=1, SWA_EVERY=200, SWA_START_LRMUL=0.5
ITERATIONS=7200, WARMUP_STEPS=20, WARMDOWN_ITERS=3000, MAX_WALLCLOCK_SECONDS=600
VOCAB_SIZE=1024, SEED=1337

Results

final_int6_sliding_window val_bpb (stride=64): 1.16682362
Serialized int6 model: 13,196,032 bytes
Code size (train_gpt.py): 71,315 bytes
Total submission size: 13,267,347 bytes (<16MB)
SWA checkpoints averaged: 8
Data loading overhead: data_loading_step_avg=0.64ms

Repro

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
env \
  RUN_ID=frontier_canon_acd_k3_8gpu \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 SEED=1337 \
  TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=2048 \
  EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 EVAL_BATCH_SEQS=32 \
  ITERATIONS=7200 WARMUP_STEPS=20 WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 \
  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
  MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
  MUON_WEIGHT_DECAY=0.04 ADAM_WEIGHT_DECAY=0.04 \
  SWA_ENABLED=1 SWA_EVERY=200 SWA_START_LRMUL=0.5 \
  INT6_CATEGORIES=mlp,attn \
  CANON_SET=ACD CANON_KERNEL=3 CANON_RESIDUAL=1 CANON_ACTIVATION=0 CANON_BIAS=0 \
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Add Canon fast-conv path and ACD K3 run report

0902543

chanwoo-park-official changed the title ~~Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)~~ Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) Mar 21, 2026

notapplica mentioned this pull request Mar 21, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

chanwoo-park-official added 2 commits March 22, 2026 00:47

new submission

f3fe28f

xx

ba8b7c8

chanwoo-park-official mentioned this pull request Mar 22, 2026

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296) #400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312
chanwoo-park-official wants to merge 3 commits intoopenai:mainfrom
chanwoo-park-official:feat/canon-fastconv-acd-report

chanwoo-park-official commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chanwoo-park-official commented Mar 21, 2026

Summary

Approach

Canon Placement

Config Highlights

Results

Repro

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant