Skip to content

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312

Open
chanwoo-park-official wants to merge 3 commits intoopenai:mainfrom
chanwoo-park-official:feat/canon-fastconv-acd-report
Open

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312
chanwoo-park-official wants to merge 3 commits intoopenai:mainfrom
chanwoo-park-official:feat/canon-fastconv-acd-report

Conversation

@chanwoo-park-official
Copy link
Copy Markdown

Summary

This PR reports a standalone run with Canon ACD (CANON_SET=ACD, CANON_KERNEL=3) plus mixed int6 quantization (INT6_CATEGORIES=mlp,attn).

Approach

  • Model: 9-layer decoder-only Transformer, model_dim=512, num_heads=8, num_kv_heads=4, mlp_mult=3.0
  • MLP: ReLU-squared style MLP (repo default)
  • Context extras: Bigram hash embedding (bigram_vocab_size=2048, bigram_dim=128) + SmearGate
  • Quantization: mixed PTQ, mlp/attn=int6, other large tensors int8
  • Optimizer: mixed Muon + Adam
  • Schedule: momentum warmup (0.92 -> 0.99), warmdown (WARMDOWN_ITERS=3000), SWA near end
  • Eval: both roundtrip and sliding-window (EVAL_STRIDE=64); sliding bpb is main comparison

Canon Placement

  • A: before attention
  • B: on concatenated QKV (most expensive)
  • C: before MLP
  • D: in widened MLP hidden stream
  • This run uses ACD (keeps Canon effect while avoiding B cost)
  • Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
    Zeyuan Allen-Zhu (2025), full version: https://ssrn.com/abstract=5240330

Config Highlights

  • torchrun --nproc_per_node=8
  • TRAIN_BATCH_TOKENS=524288, TRAIN_SEQ_LEN=2048
  • EVAL_SEQ_LEN=2048, EVAL_STRIDE=64, EVAL_BATCH_SEQS=32
  • MATRIX_LR=0.025, SCALAR_LR=0.025, TIED_EMBED_LR=0.035
  • MUON_WEIGHT_DECAY=0.04, ADAM_WEIGHT_DECAY=0.04
  • SWA_ENABLED=1, SWA_EVERY=200, SWA_START_LRMUL=0.5
  • ITERATIONS=7200, WARMUP_STEPS=20, WARMDOWN_ITERS=3000, MAX_WALLCLOCK_SECONDS=600
  • VOCAB_SIZE=1024, SEED=1337

Results

  • final_int6_sliding_window val_bpb (stride=64): 1.16682362
  • Serialized int6 model: 13,196,032 bytes
  • Code size (train_gpt.py): 71,315 bytes
  • Total submission size: 13,267,347 bytes (<16MB)
  • SWA checkpoints averaged: 8
  • Data loading overhead: data_loading_step_avg=0.64ms

Repro

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
env \
  RUN_ID=frontier_canon_acd_k3_8gpu \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 SEED=1337 \
  TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=2048 \
  EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 EVAL_BATCH_SEQS=32 \
  ITERATIONS=7200 WARMUP_STEPS=20 WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 \
  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
  MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
  MUON_WEIGHT_DECAY=0.04 ADAM_WEIGHT_DECAY=0.04 \
  SWA_ENABLED=1 SWA_EVERY=200 SWA_START_LRMUL=0.5 \
  INT6_CATEGORIES=mlp,attn \
  CANON_SET=ACD CANON_KERNEL=3 CANON_RESIDUAL=1 CANON_ACTIVATION=0 CANON_BIAS=0 \
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

@chanwoo-park-official chanwoo-park-official changed the title Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant