Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B#5716
Conversation
Add three pure helpers to trl/trainer/utils.py: - compute_flops_per_token(config, seq_len): training FLOPs per token for a causal LM. Handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2). Uses the non-causal attention convention (PaLM / Megatron / nanoGPT). - compute_mfu(flops_per_token, tps, world_size, peak_flops): MFU as a percentage. Caller is responsible for correcting tps for cp/sp/tp over-counting. - adjusted_mfu(mfu, config, seq_len): convert non-causal MFU to causal-corrected MFU (Llama / DS Ulysses convention). No integration with SFTTrainer in this PR — these are standalone helpers usable from any training loop. A follow-up PR can wire them into SFTTrainer.log.
💡 Codex ReviewThe new generator imports Line 40 in f1a5f81 The comment above this map documents that PR revisions are only for testing and should be removed after the tiny-model Hub PR is merged, but this change commits ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| bos_token_id=151643, | ||
| eos_token_id=151645, | ||
| # Forwarded via kwargs (not Qwen3MoeConfig fields, but PretrainedConfig accepts arbitrary kwargs): | ||
| head_dim=128, |
There was a problem hiding this comment.
that will be useful for MFU utils. We can land this and I'll rebase 👍🏼 @qgallouedec
Reduce the config diff between
tiny-Qwen3MoeForCausalLMand the referenceQwen/Qwen3-30B-A3Bby mirroring eight non-size config fields:vocab_size=151936(ref padded vocab; previouslylen(tokenizer.vocab) = 151669)max_position_embeddings=40960(ref override; default is32768)rope_theta=1000000.0(ref override; default is10000.0)norm_topk_prob=True(ref override; default isFalse)bos_token_id=151643,eos_token_id=151645(ref overrides; previouslyNone)head_dim=128,max_window_layers=48(ref-only fields, forwarded via kwargs)Remaining diffs are intentional size reductions (
hidden_size,intermediate_size,num_attention_heads,num_experts,num_experts_per_tok,num_hidden_layers,num_key_value_heads).Before
After
Note
Low Risk
Low risk: changes are limited to tiny-model generation defaults and a test-only
from_pretrainedmonkeypatch that is a no-op unlessMODEL_REVISIONSis populated.Overview
Updates the
tinyQwen3 MoE model generation script to mirror upstreamQwen/Qwen3-30B-A3Bconfig values (e.g., paddedvocab_size, RoPE/positioning settings, BOS/EOS IDs, and forwarded kwargs likehead_dimandmax_window_layers) to reduce config drift.Expands the
tests/conftest.pyrevision-injection monkeypatch to also wrapAutoConfig,AutoModelForCausalLM, andAutoModelForSequenceClassificationfrom_pretrainedcalls, ensuring test runs consistently use specified PR revisions whenMODEL_REVISIONSis set.Reviewed by Cursor Bugbot for commit 7132985. Bugbot is set up for automated code reviews on this repo. Configure here.