Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B by qgallouedec · Pull Request #5716 · huggingface/trl

qgallouedec · 2026-05-06T18:45:15Z

Reduce the config diff between tiny-Qwen3MoeForCausalLM and the reference Qwen/Qwen3-30B-A3B by mirroring eight non-size config fields:

vocab_size=151936 (ref padded vocab; previously len(tokenizer.vocab) = 151669)
max_position_embeddings=40960 (ref override; default is 32768)
rope_theta=1000000.0 (ref override; default is 10000.0)
norm_topk_prob=True (ref override; default is False)
bos_token_id=151643, eos_token_id=151645 (ref overrides; previously None)
head_dim=128, max_window_layers=48 (ref-only fields, forwarded via kwargs)

Remaining diffs are intentional size reductions (hidden_size, intermediate_size, num_attention_heads, num_experts, num_experts_per_tok, num_hidden_layers, num_key_value_heads).

Before

$ python -m scripts.generate_tiny_models.for_causal_lm.qwen3_moe_for_causal_lm
tokenizer_config.json: 9.73kB [00:00, 47.1MB/s]
vocab.json: 2.78MB [00:00, 10.1MB/s]
merges.txt: 1.67MB [00:00, 9.17MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 27.1MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 4.04MB/s]
[smoke_test] Qwen3MoeForCausalLM: OK (output shape (1, 4, 151669))
model.safetensors.index.json: 1.70MB [00:00, 43.5MB/s]
Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:05<00:00,  3.04it/s]
[dtype_check] Qwen/Qwen3-30B-A3B: all matched tensors have the reference dtype
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 963/963 [00:00<00:00, 16.5MB/s]
[config_diff] Qwen/Qwen3-30B-A3B vs tiny (15 differences)
  bos_token_id                                     151643                             → None
  eos_token_id                                     151645                             → None
  head_dim                                         128                                → <missing>
  hidden_size                                      2048                               → 8
  intermediate_size                                6144                               → 32
  max_position_embeddings                          40960                              → 32768
  max_window_layers                                48                                 → <missing>
  norm_topk_prob                                   True                               → False
  num_attention_heads                              32                                 → 4
  num_experts                                      128                                → 4
  num_experts_per_tok                              8                                  → 2
  num_hidden_layers                                48                                 → 2
  num_key_value_heads                              4                                  → 2
  rope_theta                                       1000000.0                          → 10000.0
  vocab_size                                       151936                             → 151669

After

$ python -m scripts.generate_tiny_models.for_causal_lm.qwen3_moe_for_causal_lm --create-pr
[smoke_test] Qwen3MoeForCausalLM: OK (output shape (1, 4, 151936))
Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 58.88it/s]
[dtype_check] Qwen/Qwen3-30B-A3B: all matched tensors have the reference dtype
[config_diff] Qwen/Qwen3-30B-A3B vs tiny (7 differences)
  hidden_size                                      2048                               → 8
  intermediate_size                                6144                               → 32
  num_attention_heads                              32                                 → 4
  num_experts                                      128                                → 4
  num_experts_per_tok                              8                                  → 2
  num_hidden_layers                                48                                 → 2
  num_key_value_heads                              4                                  → 2

Note

Low Risk
Low risk: changes are limited to tiny-model generation defaults and a test-only from_pretrained monkeypatch that is a no-op unless MODEL_REVISIONS is populated.

Overview
Updates the tiny Qwen3 MoE model generation script to mirror upstream Qwen/Qwen3-30B-A3B config values (e.g., padded vocab_size, RoPE/positioning settings, BOS/EOS IDs, and forwarded kwargs like head_dim and max_window_layers) to reduce config drift.

Expands the tests/conftest.py revision-injection monkeypatch to also wrap AutoConfig, AutoModelForCausalLM, and AutoModelForSequenceClassification from_pretrained calls, ensuring test runs consistently use specified PR revisions when MODEL_REVISIONS is set.

^{Reviewed by Cursor Bugbot for commit 7132985. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add three pure helpers to trl/trainer/utils.py: - compute_flops_per_token(config, seq_len): training FLOPs per token for a causal LM. Handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2). Uses the non-causal attention convention (PaLM / Megatron / nanoGPT). - compute_mfu(flops_per_token, tps, world_size, peak_flops): MFU as a percentage. Caller is responsible for correcting tps for cp/sp/tp over-counting. - adjusted_mfu(mfu, config, seq_len): convert non-causal MFU to causal-corrected MFU (Llama / DS Ulysses convention). No integration with SFTTrainer in this PR — these are standalone helpers usable from any training loop. A follow-up PR can wire them into SFTTrainer.log.

chatgpt-codex-connector · 2026-05-06T18:47:31Z

💡 Codex Review

trl/scripts/generate_tiny_models/for_causal_lm/qwen3_moe_for_causal_lm.py

Lines 18 to 25 in f1a5f81

    
           from .._common import ( 
        
               check_dtype_pattern, 
        
               check_transformers_version, 
        
               init_weights_tiny_model, 
        
               print_config_diff, 
        
               push_to_hub, 
        
               smoke_test, 
        
           )

Make the new generator importable

The new generator imports .._common, but this commit does not add scripts/generate_tiny_models/_common.py or any package module providing those helpers, so running this file cannot reach the config update logic. In this checkout, invoking the command from the commit message also resolves scripts.generate_tiny_models to the existing scripts/generate_tiny_models.py module before this package path, which confirms the new per-model generator is not currently executable; please add the missing package/helper structure or keep this logic in the existing generator.

trl/tests/conftest.py

Line 40 in f1a5f81

"trl-internal-testing/tiny-Qwen3MoeForCausalLM": "refs/pr/1",

Remove the temporary Hub PR override

The comment above this map documents that PR revisions are only for testing and should be removed after the tiny-model Hub PR is merged, but this change commits refs/pr/1 as the default for every test run. That means CI and local tests will continue exercising a Hub PR ref rather than the published main revision of trl-internal-testing/tiny-Qwen3MoeForCausalLM, which can mask regressions in the real fixture or fail if the PR ref is unavailable; the repository should not keep this override once the model update is ready to merge.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

HuggingFaceDocBuilderDev · 2026-05-06T18:50:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ion" This reverts commit 6f7d989.

AmineDiro · 2026-05-07T14:08:20Z

+    bos_token_id=151643,
+    eos_token_id=151645,
+    # Forwarded via kwargs (not Qwen3MoeConfig fields, but PretrainedConfig accepts arbitrary kwargs):
+    head_dim=128,


that will be useful for MFU utils. We can land this and I'll rebase 👍🏼 @qgallouedec

AmineDiro and others added 9 commits May 4, 2026 08:53

added unit tests

f4e1491

remove tautological test_dimensional_correctness

33b5234

tmp

93cdda1

Fix FLOPs computation for experts based on transformers version

c944cd6

revert conftest

98c4d7c

Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B

f1a5f81

Merge branch 'main' into mfu-utils

5027b01

Merge branch 'mfu-utils' into align-qwen3-moe

8cbd755

qgallouedec requested a review from AmineDiro May 6, 2026 18:48

qgallouedec mentioned this pull request May 6, 2026

Add MFU helpers #5698

Merged

1 task

qgallouedec added 3 commits May 6, 2026 19:57

Update model_id for tiny-Qwen3 MoE to use SequenceClassification

6f7d989

Revert "Update model_id for tiny-Qwen3 MoE to use SequenceClassificat…

a2b2027

…ion" This reverts commit 6f7d989.

fix

eeb46bc

AmineDiro reviewed May 7, 2026

View reviewed changes

AmineDiro approved these changes May 7, 2026

View reviewed changes

qgallouedec changed the base branch from mfu-utils to main May 7, 2026 15:32

qgallouedec added 2 commits May 7, 2026 15:47

rebase on main

50c0f74

style

7132985

qgallouedec merged commit 4601166 into main May 7, 2026
13 checks passed

qgallouedec deleted the align-qwen3-moe branch May 7, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B#5716

Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B#5716
qgallouedec merged 14 commits into
mainfrom
align-qwen3-moe

qgallouedec commented May 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 6, 2026

Uh oh!

AmineDiro May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qgallouedec commented May 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

💡 Codex Review

Uh oh!

HuggingFaceDocBuilderDev commented May 6, 2026

Uh oh!

AmineDiro May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qgallouedec commented May 6, 2026 •

edited by cursor Bot

Loading