Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508)#465
Open
LoquiAuris wants to merge 1 commit intoopenai:mainfrom
Open
Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508)#465LoquiAuris wants to merge 1 commit intoopenai:mainfrom
LoquiAuris wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record Submission: 10L d=512 Int5-MLP + Int6-Attn + BigramHash + SmearGate
Author: Loqui Auris (@LoquiAuris)
val_bpb: 1.1508 (mean of 3 seeds, std=0.00012)
Artifact size: 15,680,288 bytes (15.68 MB)
Training time: ~10 minutes on 8×H100
Results
Despite the tokenizer efficiency advantage, sp1024 with 10 layers at full d=512 width outperformed all sp8192 configurations. The layer count advantage (10L vs 6-8L) at d=512 exceeds the tokenizer efficiency gain on H100 with full training.
However, the Int6 embedding finding remains significant: it enables large-vocabulary models within severe artifact constraints and may prove valuable as quantization techniques improve and more layers become feasible at larger vocab sizes.
Development Process
This submission was developed through systematic architecture search:
Local Testing Methodology
All architecture decisions were validated through 500-step local runs on Apple Silicon (MPS backend) using AdamW, then confirmed on 8×H100 with the full Muon + SWA + PR #162 stack. Local-to-H100 scaling ratio was approximately 1.85-1.95×.
Hardware & Cost
Files
Record Submission: 10L d=512 Int5-MLP + Int6-Attn + BigramHash + SmearGate Author: Loqui Auris ([@LoquiAuris](https://github.com/loquiauris)) val_bpb: 1.1508 (mean of 3 seeds, std=0.00012) Artifact size: 15,680,288 bytes (15.68 MB) Training time: ~10 minutes on 8×H100 Results SeedPost-quant val_bpbArtifact (bytes)421.1509715,680,28813371.1507715,654,63220241.1507415,639,761Mean1.15083 ±0.00012— Approach Architecture Standard PR #162 transformer stack with the following configuration:train_gpt.py— Complete training script with environment variable configurationtrain.log— Training log from seed 42 (primary submission)submission.json— Submission metadataREADME.md— This file10 layers, d_model=512, 8 attention heads, 4 KV heads (GQA)
3× FFN expansion (hidden=1536) with ReLU² activation
SmearGate: learned blend with previous token representation
BigramHash: 4096 buckets, dim=128, projected to 512
U-Net skip connections between symmetric layer pairs
RMSNorm, logit softcap=30.0, orthogonal initialization
RoPE positional encoding (persistent=False)
Tied embeddings via F.linear(x, tok_emb.weight)
Vocabulary: sp1024 (1,024 BPE tokens)
Training
Optimizer: Muon (matrix_lr=0.02, momentum=0.99 with warmup from 0.92 over 1500 steps) + AdamW for embeddings and scalars
Weight decay: 0.04 (Muon), 0.01 (AdamW)
Gradient clipping: 0.3
Sequence length: 2048
Batch size: 786,432 tokens
Warmup: 20 steps
Warmdown: 3000 iterations (cosine schedule)
SWA: start_frac=0.5, checkpoint every 50 steps, 29 checkpoints averaged
Steps completed: ~7,600 in 10 minutes
Quantization & Compression
MLP weights: Int5 per-row symmetric (clip=15)
Attention weights: Int6 per-row symmetric (clip=31)
Embeddings: FP16 passthrough
Norms, gates, control tensors: FP16 passthrough
Compression: zstd level 22
Evaluation
Sliding window with stride=64, seq_len=2048
Key Finding: Int6 Embedding Quantization
During development, we explored using sp8192 (8,192-token vocabulary) to improve tokenizer efficiency. The sp8192 tokenizer encodes at 3.79 bytes/token vs sp1024's 2.44 — a 55% improvement that directly reduces bits-per-byte.
The challenge: sp8192's embedding table at d=512 costs 8.39 MB in FP16, consuming over half the 16 MB budget and limiting the model to 6-8 layers.
We discovered that embedding tables can be quantized to Int6 (6-bit per-row symmetric) with negligible quality loss:
Embed quantizationval_bpbPenalty vs FP16FP16 (baseline)2.2352—Int82.2354+0.0002Int62.2357+0.0005
A penalty of +0.0005 bpb is within noise. This enabled sp8192 at d=512 — a combination previously considered impossible under the 16 MB constraint.
sp8192 + Int6 Embed Results (H100)
ConfigPost-quant bpbArtifactHeadroomsp8192 d=512 6L Int6-embed1.201011.97 MB4.0 MBsp8192 d=512 7L Int6-embed1.186313.57 MB2.4 MBsp8192 d=512 8L Int6-embed1.179414.99 MB1.0 MBsp8192 d=384 9L FP16-embed1.188912.63 MB3.4 MBsp1024 d=512 10L (this submission)1.151015.68 MB0.3 MB
Despite the tokenizer efficiency advantage, sp1024 with 10 layers at full d=512 width outperformed all sp8192 configurations. The layer count advantage (10L vs 6-8L) at d=512 exceeds the tokenizer efficiency gain on H100 with full training.
However, the Int6 embedding finding remains significant: it enables large-vocabulary models within severe artifact constraints and may prove valuable as quantization techniques improve and more layers become feasible at larger vocab sizes.
Development Process
This submission was developed through systematic architecture search:
Tokenizer exploration: Tested sp1024, sp2048, sp4096, sp8192 — identified the embedding size vs model capacity trade-off as the key constraint
Width vs depth analysis: Confirmed d=512 (width) > d=384/448 (narrower + deeper) across all tokenizer sizes at this parameter budget
Int6 embedding discovery: Found that embedding quantization to 6-bit has negligible quality impact (+0.0005 bpb), unlocking large vocabularies at full model width
8 H100 configurations tested across 2 pod sessions, plus extensive local testing on Apple Silicon (500-step ablations)
Final result: sp1024 d=512 10L produces the best bpb by maximizing layer count at full width within the 16 MB budget
Local Testing Methodology
All architecture decisions were validated through 500-step local runs on Apple Silicon (MPS backend) using AdamW, then confirmed on 8×H100 with the full Muon + SWA + PR #162 stack. Local-to-H100 scaling ratio was approximately 1.85-1.95×.
Hardware & Cost
Training: 8×H100 SXM (RunPod)
Local testing: Apple Silicon (MPS)
Total H100 time: ~2.5 hours across 2 pod sessions
Estimated cost: ~$65 in RunPod credits
Files
train_gpt.py — Complete training script with environment variable configuration
train.log — Training log from seed 42 (primary submission)
submission.json — Submission metadata
README.md — This file