Skip to content

Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B#5716

Merged
qgallouedec merged 14 commits into
mainfrom
align-qwen3-moe
May 7, 2026
Merged

Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B#5716
qgallouedec merged 14 commits into
mainfrom
align-qwen3-moe

Conversation

@qgallouedec

@qgallouedec qgallouedec commented May 6, 2026

Copy link
Copy Markdown
Member

Reduce the config diff between tiny-Qwen3MoeForCausalLM and the reference Qwen/Qwen3-30B-A3B by mirroring eight non-size config fields:

  • vocab_size=151936 (ref padded vocab; previously len(tokenizer.vocab) = 151669)
  • max_position_embeddings=40960 (ref override; default is 32768)
  • rope_theta=1000000.0 (ref override; default is 10000.0)
  • norm_topk_prob=True (ref override; default is False)
  • bos_token_id=151643, eos_token_id=151645 (ref overrides; previously None)
  • head_dim=128, max_window_layers=48 (ref-only fields, forwarded via kwargs)

Remaining diffs are intentional size reductions (hidden_size, intermediate_size, num_attention_heads, num_experts, num_experts_per_tok, num_hidden_layers, num_key_value_heads).

Before

$ python -m scripts.generate_tiny_models.for_causal_lm.qwen3_moe_for_causal_lm
tokenizer_config.json: 9.73kB [00:00, 47.1MB/s]
vocab.json: 2.78MB [00:00, 10.1MB/s]
merges.txt: 1.67MB [00:00, 9.17MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 27.1MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 4.04MB/s]
[smoke_test] Qwen3MoeForCausalLM: OK (output shape (1, 4, 151669))
model.safetensors.index.json: 1.70MB [00:00, 43.5MB/s]
Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:05<00:00,  3.04it/s]
[dtype_check] Qwen/Qwen3-30B-A3B: all matched tensors have the reference dtype
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 963/963 [00:00<00:00, 16.5MB/s]
[config_diff] Qwen/Qwen3-30B-A3B vs tiny (15 differences)
  bos_token_id                                     151643                             → None
  eos_token_id                                     151645                             → None
  head_dim                                         128                                → <missing>
  hidden_size                                      2048                               → 8
  intermediate_size                                6144                               → 32
  max_position_embeddings                          40960                              → 32768
  max_window_layers                                48                                 → <missing>
  norm_topk_prob                                   True                               → False
  num_attention_heads                              32                                 → 4
  num_experts                                      128                                → 4
  num_experts_per_tok                              8                                  → 2
  num_hidden_layers                                48                                 → 2
  num_key_value_heads                              4                                  → 2
  rope_theta                                       1000000.0                          → 10000.0
  vocab_size                                       151936                             → 151669

After

$ python -m scripts.generate_tiny_models.for_causal_lm.qwen3_moe_for_causal_lm --create-pr
[smoke_test] Qwen3MoeForCausalLM: OK (output shape (1, 4, 151936))
Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 58.88it/s]
[dtype_check] Qwen/Qwen3-30B-A3B: all matched tensors have the reference dtype
[config_diff] Qwen/Qwen3-30B-A3B vs tiny (7 differences)
  hidden_size                                      2048                               → 8
  intermediate_size                                6144                               → 32
  num_attention_heads                              32                                 → 4
  num_experts                                      128                                → 4
  num_experts_per_tok                              8                                  → 2
  num_hidden_layers                                48                                 → 2
  num_key_value_heads                              4                                  → 2

Note

Low Risk
Low risk: changes are limited to tiny-model generation defaults and a test-only from_pretrained monkeypatch that is a no-op unless MODEL_REVISIONS is populated.

Overview
Updates the tiny Qwen3 MoE model generation script to mirror upstream Qwen/Qwen3-30B-A3B config values (e.g., padded vocab_size, RoPE/positioning settings, BOS/EOS IDs, and forwarded kwargs like head_dim and max_window_layers) to reduce config drift.

Expands the tests/conftest.py revision-injection monkeypatch to also wrap AutoConfig, AutoModelForCausalLM, and AutoModelForSequenceClassification from_pretrained calls, ensuring test runs consistently use specified PR revisions when MODEL_REVISIONS is set.

Reviewed by Cursor Bugbot for commit 7132985. Bugbot is set up for automated code reviews on this repo. Configure here.

AmineDiro and others added 9 commits May 4, 2026 08:53
Add three pure helpers to trl/trainer/utils.py:

- compute_flops_per_token(config, seq_len): training FLOPs per token
  for a causal LM. Handles dense and MoE (Mixtral, Qwen3-MoE,
  DeepSeek-V2). Uses the non-causal attention convention (PaLM /
  Megatron / nanoGPT).

- compute_mfu(flops_per_token, tps, world_size, peak_flops): MFU as a
  percentage. Caller is responsible for correcting tps for cp/sp/tp
  over-counting.

- adjusted_mfu(mfu, config, seq_len): convert non-causal MFU to
  causal-corrected MFU (Llama / DS Ulysses convention).

No integration with SFTTrainer in this PR — these are standalone
helpers usable from any training loop. A follow-up PR can wire them
into SFTTrainer.log.
@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

from .._common import (
check_dtype_pattern,
check_transformers_version,
init_weights_tiny_model,
print_config_diff,
push_to_hub,
smoke_test,
)

P2 Badge Make the new generator importable

The new generator imports .._common, but this commit does not add scripts/generate_tiny_models/_common.py or any package module providing those helpers, so running this file cannot reach the config update logic. In this checkout, invoking the command from the commit message also resolves scripts.generate_tiny_models to the existing scripts/generate_tiny_models.py module before this package path, which confirms the new per-model generator is not currently executable; please add the missing package/helper structure or keep this logic in the existing generator.


"trl-internal-testing/tiny-Qwen3MoeForCausalLM": "refs/pr/1",

P2 Badge Remove the temporary Hub PR override

The comment above this map documents that PR revisions are only for testing and should be removed after the tiny-model Hub PR is merged, but this change commits refs/pr/1 as the default for every test run. That means CI and local tests will continue exercising a Hub PR ref rather than the published main revision of trl-internal-testing/tiny-Qwen3MoeForCausalLM, which can mask regressions in the real fixture or fail if the PR ref is unavailable; the repository should not keep this override once the model update is ready to merge.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@qgallouedec qgallouedec requested a review from AmineDiro May 6, 2026 18:48
@qgallouedec qgallouedec mentioned this pull request May 6, 2026
1 task
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

bos_token_id=151643,
eos_token_id=151645,
# Forwarded via kwargs (not Qwen3MoeConfig fields, but PretrainedConfig accepts arbitrary kwargs):
head_dim=128,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be useful for MFU utils. We can land this and I'll rebase 👍🏼 @qgallouedec

@qgallouedec qgallouedec changed the base branch from mfu-utils to main May 7, 2026 15:32
@qgallouedec qgallouedec merged commit 4601166 into main May 7, 2026
13 checks passed
@qgallouedec qgallouedec deleted the align-qwen3-moe branch May 7, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants