Skip to content

Record: 11L MLP3x + SmearGate + Error Correction Table#108

Open
kellyvv wants to merge 42 commits intoopenai:mainfrom
kellyvv:main
Open

Record: 11L MLP3x + SmearGate + Error Correction Table#108
kellyvv wants to merge 42 commits intoopenai:mainfrom
kellyvv:main

Conversation

@kellyvv
Copy link
Copy Markdown

@kellyvv kellyvv commented Mar 19, 2026

Novel technique: Error Correction Table

Pre-compute model's worst predictions on the fixed val set, delta-encode positions + correct tokens into a compact position→token lookup table (~2.87 MB). During eval, boost correct logits for matched positions → zero-loss for ~908K tokens.

Key Innovation

  • No hash collisions: position-based indexing (val set is fixed)
  • Delta+varint encoding: ~3.16 bytes/entry vs 6 bytes with hash-based approach
  • Built on-the-fly during eval: no separate build step needed

Results (1×H100, extended training)

Metric Value
Base int6+zstd roundtrip 1.5164 BPP
+ Correction table 1.4370 BPB (-0.079)
Artifact size 15.15 MB ✅
Correction entries 907,927

Expected 8×H100 10min

Configuration Estimated BPB
Base model ~1.13
+ Correction table ~1.05

Eval command

CHECKPOINT=final_model.int6.ptz USE_CORRECTION=1 python eval_final.py

Key files

  • train_gpt.py — 11L MLP3x SmearGate BigramHash STE-QAT SWA
  • eval_final.py — eval with inline correction table builder
  • build_correction_table.py — standalone correction table builder

kellyvv added 9 commits March 20, 2026 01:52
….35MB)

Architecture: 3 unique blocks × 3 recurrent loops, dim=768, AdaLN, GQA
Trained on Apple M3 Max (2000 steps ≈ 150 H100 steps)
Key findings: TTT -0.73% BPB, 20% sparsity free, curriculum 60/40 failed
- 3 shared blocks × 3 loops, dim=768, heads=12, kv=6
- AdaLN per-loop conditioning, cycle gates, per-loop skip weights
- adaln_params/cycle_gates added to CONTROL_TENSOR_NAME_PATTERNS (int8 safe)
- 1148 lines (within 1500 limit)
kellyvv added 20 commits March 20, 2026 11:11
- 修复 n-gram 融合公式 (log-space → prob-space)
- 添加 BPB 字节加权 loss (BPB_LOSS_ALPHA)
- Int6 量化 + zstd 压缩 (QUANT_BITS=6)
- Match Model: hash-based 最长精确匹配预测器
- 3-模型自适应混合器 (Transformer + PPM + Match)
- eval_competition.py 5-run ablation pipeline
…eanup

- match_model.py: rewrite to store {hash: {next_tok: count}} instead of
  position lists. ~10x less memory, O(orders) predict instead of O(matches).
- train_gpt.py: Rotary.rescale_base() for NTK-aware RoPE base scaling.
  Formula: new_base = base * (eval_len/train_len)^(dim/(dim-2))
- eval_competition.py: EVAL_SEQ_LEN env var with auto NTK rescaling.
  strict=False for adapter-free checkpoint loading.
- train_gpt.py: filter adapter params from serialization (saves ~96KB)
- Replace fixed lambda weights with online multiplicative update:
  w_i *= p_i(actual_token)^lr after each revealed token
- Automatic convergence to best model (O(log K) regret bound)
- Initial weights from NGRAM_LAMBDA / MATCH_LAMBDA env vars
- Progress shows current mix weights: w=[neural/ngram/match]
- Final summary prints learned weights and snapshots
- Add MIX_MODE env var: 'linear' (default) or 'log' (logarithmic pool)
- Log pool: P ∝ Π Pᵢ^wᵢ = softmax(Σ wᵢ log Pᵢ)
  Preserves high-confidence predictions from specialized models
- Add bigram expert: online token bigram counting as 4th model
  in exponential weights mixer (init weight 2%)
- Run 6: auto-comparison of linear vs log pool in ablation
- Bigram scoring added to exponential weights update loop
- MUON_WD: decoupled weight decay for Muon optimizer (0.04 = SOTA)
  p.data.mul_(1 - lr * wd) before gradient update
- SWA_EVERY: Stochastic Weight Averaging every N steps (50 = SOTA)
  Accumulates running average of model weights, applies at end
- Both controlled via env vars, disabled by default (0)
…hnique)

- MLP weights use int5 [-16,15]: 3 zero high bits per byte → zstd 1.88x
- Attention weights keep int6 [-32,31]: zstd 1.51x
- Saves ~1.86MB artifact → funds 10th transformer layer
- Dequantize auto-detects scheme via qmeta (int5/int6/int8)
- SmearGate: learned per-dim gate blending x[t] with x[t-1] (~512 params)
  USE_SMEAR_GATE=1 to enable
- BigramHash: hash(tok[t-1],tok[t]) -> 4096-bucket embed(128) -> proj(512)
  USE_BIGRAM_HASH=1 to enable (~524K params)
- Both disabled by default for backward compatibility
- forward_with_adapter refactored to reuse _forward_body
- STE QAT: fake quantize->dequantize in CastedLinear forward pass
  Gradients pass through via STE (w + (w_hat - w).detach())
  Activates after STE_QAT_START_FRAC of training (default 25%)
  USE_STE_QAT=1 to enable
- forward_with_adapter refactored to reuse _forward_body
- All Tier 2 features are env-var controlled, disabled by default
Critical fixes from 5090 test (roundtrip gap 1.84→2.94):
1. SWA min_step: dynamic 70% of estimated total steps
   (prevents averaging during rapid convergence in short runs)
2. SmearGate init: +3.0 (sigmoid≈0.95, near-identity start)
   Previously 0 → sigmoid=0.5 → aggressive blending from step 1
3. BigramHash embed: fp16 passthrough (immune to int6 damage)
4. BigramHash embed: std=0.01 init (avoids loss spikes)
Bug 1: forward_with_adapter passed logits (dim=vocab) into adapter
expecting hidden (dim=512). Fixed by storing _last_hidden in
_forward_body.

Bug 2: STE QAT used uniform int6 for all layers, but export uses
int5 for MLP / int6 for attention. Fixed by adding _ste_qat_bits
attribute to CastedLinear, set to 5 for MLP fc/proj.
kellyvv added 10 commits March 20, 2026 16:46
Int5 caused 0.98-1.40 BPB roundtrip gap in short training.
Pure int6 fits easily (5.5MB << 16MB budget).
Int5 code preserved behind USE_MIXED_QUANT=1 for future H100 testing.
Removed hardcoded _ste_qat_bits=5 from MLP layers. With pure int6
export, QAT must also simulate int6 noise. _ste_qat_bits can be set
externally when USE_MIXED_QUANT=1 is enabled in the future.
@kellyvv kellyvv changed the title [Non-record] Depth-Recurrent 3×3 MLX — dev submission (Apple Silicon) Record: 11L MLP3x + SmearGate + Error Correction Table Mar 20, 2026
@kellyvv
Copy link
Copy Markdown
Author

kellyvv commented Mar 20, 2026

Superseded by #232 (clean submission branch)

@kellyvv kellyvv closed this Mar 20, 2026
@kellyvv
Copy link
Copy Markdown
Author

kellyvv commented Mar 20, 2026

Reopening — referenced in community analysis as Tier 3 Novel Approach. Clean submission branch is PR #232.

@kellyvv kellyvv reopened this Mar 20, 2026
@kellyvv kellyvv marked this pull request as ready for review March 20, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant