Add MFU helpers by AmineDiro · Pull Request #5698 · huggingface/trl

AmineDiro · 2026-05-04T09:28:38Z

What does this PR do?

Add three pure helpers to trl/trainer/utils.py:

compute_flops_per_token(config, seq_len): training FLOPs per token for a causal LM. Handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2). Uses the non-causal attention convention (PaLM / Megatron / nanoGPT).
compute_mfu(flops_per_token, tps, world_size, peak_flops): MFU as a percentage. Caller is responsible for correcting tps for cp/sp/tp over-counting.
adjusted_mfu(mfu, config, seq_len): convert non-causal MFU to causal-corrected MFU (Llama / DS Ulysses convention).

NOTE: for now this defaults to the clutser H100 bf16 flops for peak_flops_per_device. We'll probably push a dict of flops/dtype/hw to lookup flops in general case.

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.

Note

Low Risk
Adds new pure math helpers and tests without changing training control flow; main risk is incorrect FLOPs/MFU estimation due to assumptions about model config fields or Transformers version differences.

Overview
Adds three new utility functions in trl/trainer/utils.py to estimate training FLOPs per token (compute_flops_per_token) for dense and MoE causal LMs, compute Model FLOPs Utilization (compute_mfu), and apply a causal-attention correction (adjusted_mfu).

Extends tests/test_utils.py with focused unit tests that validate scaling behavior, tied vs untied embeddings, MoE expert-count deltas, and MFU formula correctness.

^{Reviewed by Cursor Bugbot for commit 676a378. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add three pure helpers to trl/trainer/utils.py: - compute_flops_per_token(config, seq_len): training FLOPs per token for a causal LM. Handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2). Uses the non-causal attention convention (PaLM / Megatron / nanoGPT). - compute_mfu(flops_per_token, tps, world_size, peak_flops): MFU as a percentage. Caller is responsible for correcting tps for cp/sp/tp over-counting. - adjusted_mfu(mfu, config, seq_len): convert non-causal MFU to causal-corrected MFU (Llama / DS Ulysses convention). No integration with SFTTrainer in this PR — these are standalone helpers usable from any training loop. A follow-up PR can wire them into SFTTrainer.log.

HuggingFaceDocBuilderDev · 2026-05-04T09:31:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor · 2026-05-04T09:35:12Z

+        dense_mlp_flops = 2 * 3 * h * config.intermediate_size  # interspersed dense layers
+        sparse_step = config.decoder_sparse_step
+        total_layer_flops = sum(
+            attn_flops + (moe_mlp_flops if layer_idx % sparse_step == 0 else dense_mlp_flops) for layer_idx in range(L)


MoE branch crashes for Mixtral configs

High Severity

The MoE branch unconditionally accesses config.moe_intermediate_size and config.decoder_sparse_step, but Mixtral (explicitly listed as supported in the docstring) has neither attribute. Mixtral uses intermediate_size for its expert FFN dimension and has all-MoE layers with no dense/sparse interleaving. This causes an AttributeError at runtime for any Mixtral config.

Additional Locations (1)

trl/trainer/utils.py#L1354-L1357

^{Reviewed by Cursor Bugbot for commit 2b92bab. Configure here.}

is that correct @AmineDiro

I think this branch would benefit from some unit tests

Yes that's true, I originally had a TODO for Mixtral models. This MFU was specific for Qwen family of models.

cursor · 2026-05-04T09:35:12Z

+
+    # MoE dispatch: `num_experts_per_tok` is the canonical MoE marker — present on Mixtral,
+    # Qwen3-MoE, DeepSeek-V2, etc.; absent on dense configs.
+    num_experts_per_tok = getattr(config, "num_experts_per_tok", None)


Usage of getattr violates project rules

Low Severity

getattr(config, "num_experts_per_tok", None) violates the AGENTS.md rule that explicitly says to avoid hasattr and getattr, describing their use as "almost always a symptom of overly defensive programming." The rule recommends expressing checks explicitly or dropping the conditional entirely.

^{Triggered by project rule: ../.ai/AGENTS.md}

^{Reviewed by Cursor Bugbot for commit 2b92bab. Configure here.}

I think here it's the only way

qgallouedec

lgtm!

qgallouedec · 2026-05-06T16:05:55Z

I can't comment, be fyi

qgallouedec · 2026-05-06T18:49:25Z

I can't comment, be fyi

check #5716

AmineDiro requested review from albertvillanova, kashif and qgallouedec May 4, 2026 09:28

cursor Bot reviewed May 4, 2026

View reviewed changes

This was referenced May 6, 2026

🛣️ Path to 30B MoE long-context SFT training #5712

Closed

🛣️ Path to 30B MoE long-context SFT training #5713

Open

AmineDiro added 2 commits May 6, 2026 15:39

added unit tests

f4e1491

remove tautological test_dimensional_correctness

33b5234

qgallouedec approved these changes May 6, 2026

View reviewed changes

qgallouedec and others added 4 commits May 6, 2026 18:20

tmp

93cdda1

Fix FLOPs computation for experts based on transformers version

c944cd6

revert conftest

98c4d7c

Merge branch 'main' into mfu-utils

5027b01

AmineDiro and others added 3 commits May 7, 2026 17:50

remove _load_moe_config

727caef

Merge branch 'main' into mfu-utils

670b7dd

style

676a378

qgallouedec merged commit 47b3778 into main May 7, 2026
13 checks passed

qgallouedec deleted the mfu-utils branch May 7, 2026 18:55

himanshushukla12 mentioned this pull request May 7, 2026

Sync fork with upstream huggingface/trl (1307 commits) himanshushukla12/trl#1

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MFU helpers#5698

Add MFU helpers#5698
qgallouedec merged 10 commits into
mainfrom
mfu-utils

AmineDiro commented May 4, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 4, 2026

Uh oh!

cursor Bot May 4, 2026

Uh oh!

qgallouedec May 4, 2026

Uh oh!

AmineDiro May 5, 2026

Uh oh!

cursor Bot May 4, 2026

Uh oh!

qgallouedec May 4, 2026

Uh oh!

qgallouedec left a comment

Uh oh!

qgallouedec commented May 6, 2026

Uh oh!

qgallouedec commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AmineDiro commented May 4, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

AI writing disclosure

Uh oh!

HuggingFaceDocBuilderDev commented May 4, 2026

Uh oh!

cursor Bot May 4, 2026

Choose a reason for hiding this comment

MoE branch crashes for Mixtral configs

Uh oh!

qgallouedec May 4, 2026

Choose a reason for hiding this comment

Uh oh!

AmineDiro May 5, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 4, 2026

Choose a reason for hiding this comment

Usage of getattr violates project rules

Uh oh!

qgallouedec May 4, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented May 6, 2026

Uh oh!

qgallouedec commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AmineDiro commented May 4, 2026 •

edited by cursor Bot

Loading

Usage of `getattr` violates project rules