Add opt-in MoD routing, SquareGLU MLP, EMA warmdown distillation, and Grokfast#1
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c577b1cf00
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| squareglu=args.squareglu_enabled, | ||
| mod_enabled=args.mod_enabled, | ||
| mod_attn_keep_ratio=args.mod_attn_keep_ratio, | ||
| mod_mlp_keep_ratio=args.mod_mlp_keep_ratio, | ||
| mod_min_keep_tokens=args.mod_min_keep_tokens, |
There was a problem hiding this comment.
Pass optional model flags into eval_model before strict load
These new constructor flags add extra parameters to base_model (for example mlp_gate_bank and blocks.*.mod_*_router.*), but the later eval_model = GPT(...) path still uses default squareglu=False/mod_enabled=False before load_state_dict(..., strict=True). In runs where any of these opt-in features are enabled, final int6 roundtrip evaluation will fail with missing/unexpected keys, so the training run cannot complete its export/eval pipeline.
Useful? React with 👍 / 👎.
| topv = torch.topk(scores, k=keep_k, dim=1).values[:, -1:] | ||
| mask = (scores >= topv).to(dtype=x.dtype).unsqueeze(-1) |
There was a problem hiding this comment.
Make MoD mask differentiable to let router learn
The routing mask is built from a hard threshold (scores >= topv), which is non-differentiable; as a result the router receives no gradient through the mask. With the current zero-initialized router weights/biases, scores start tied and this comparison keeps all tokens, so MOD_* routing is effectively inert even when keep_ratio < 1.
Useful? React with 👍 / 👎.
Motivation
Description
Hyperparametersfor toggling and configuring the features:MOD_*,SQUAREGLU_ENABLED,EMA_DISTILL_*, andGROKFAST_*, and documented them in the submission README andsubmission.json.Blockwith per-block routers and a_make_mod_maskthat produces top-k token masks used to selectively scale attention and MLP outputs while remaining legal/score-first during eval.SquareGLU-style gated MLP as an opt-in path inMLP, added anmlp_gate_bankparameter bank, wired gate bank initialization, included it in bank/optimizer grouping, and passed gate slices into block forward calls.mlp_gate_bank, EMA teacher creation, distillation accumulation/logging, Grokfast gradient augmentation before optimizer phases, and small logging additions for the distillation metric.Testing
python -m py_compile records/track_10min_16mb/2026-03-31_LeakyReLU2_LegalTTT_NGramCache_XSA/train_gpt.pywhich completed successfully and reported no syntax errors.submission.jsonto document the new opt-in flags and keep baseline-reported metrics unchanged; those documentation files were validated for consistency via simple file checks.Codex Task