batched and grouped experts implementations by IlyasMoutawwakil · Pull Request #42697 · huggingface/transformers

IlyasMoutawwakil · 2025-12-08T09:11:32Z

What does this PR do?

I have started experimenting with pure pytorch MoE implementations following the HF exporters PR while trying to find a traceable/exportable variant for onnx/openvino.

In this PR I copy the attn_implementation API into a similar experts_implementation API, and added two new implementations:

batched_mm (the exportable one) which uses torch.bmm, is fastest on single batch size / small inputs.
grouped_mm (the pytorch custom kernel one) inspired from torchtitan's moe imp (using torch._grouped_mm), which is generally fastest.

benchmark

An initial benchmark shows promising results on (A100), I know that the torch._grouped_mm uses bfloat16 or something under the hood, so these might not be apples to apples (i'm still looking for more references on this function and how to use it "equivalently")

MoE Implementations Benchmark

Benchmark script: bench.py

It uses qwen2_moe ("Qwen/Qwen1.5-MoE-A2.7B", bfloat16) where latency and memory are for the forward pass / prefill

TLDR; for very small inputs batched_mm can be extremely fast and even faster with compilation, for bigger inputs grouped_mm is unbeatable but it doesn't seem to get much faster with torch compilation.

Batch Size	Seq Length	Torch Compile	Implementation	Mean Latency (ms)	Median Latency (ms)	P90 Latency (ms)	Peak Mem (MB)
1	16	False	eager	271.80	272.94	295.34	27324.65
1	16	True	eager	351.86	351.64	384.64	27329.29
1	16	max-autotune-no-cudagraphs	eager	352.52	352.15	382.79	27329.29
1	16	False	batched_mm	52.03	52.07	52.67	28382.50
1	16	True	batched_mm	53.04	53.04	53.11	28029.63
1	16	max-autotune-no-cudagraphs	batched_mm	23.87	23.86	24.02	27329.29
1	16	False	grouped_mm	64.27	64.09	65.49	27329.29
1	16	True	grouped_mm	59.45	59.52	60.99	27329.29
1	16	max-autotune-no-cudagraphs	grouped_mm	59.61	59.55	60.89	27329.29
1	128	False	eager	471.73	472.65	487.97	27396.46
1	128	True	eager	637.32	613.70	845.01	27429.82
1	128	max-autotune-no-cudagraphs	eager	620.21	619.35	657.74	27429.82
1	128	False	batched_mm	316.67	316.94	317.92	35854.56
1	128	True	batched_mm	370.29	370.29	370.57	33031.64
1	128	max-autotune-no-cudagraphs	batched_mm	151.87	150.38	158.01	27429.82
1	128	False	grouped_mm	78.50	78.53	80.00	27429.82
1	128	True	grouped_mm	72.95	72.99	74.60	27429.82
1	128	max-autotune-no-cudagraphs	grouped_mm	72.71	72.89	73.55	27429.82
4	16	False	eager	431.87	433.38	448.01	27391.57
4	16	True	eager	566.63	569.74	598.98	27372.12
4	16	max-autotune-no-cudagraphs	eager	563.13	567.79	588.25	27372.12
4	16	False	batched_mm	163.41	163.38	164.84	31585.54
4	16	True	batched_mm	189.18	189.08	189.79	30173.45
4	16	max-autotune-no-cudagraphs	batched_mm	79.15	79.10	79.74	27372.11
4	16	False	grouped_mm	75.23	75.18	76.74	27372.11
4	16	True	grouped_mm	70.35	70.40	71.71	27372.12
4	16	max-autotune-no-cudagraphs	grouped_mm	70.26	70.43	71.32	27372.12
4	128	False	eager	526.88	522.75	570.01	27632.62
4	128	True	eager	678.18	677.54	690.97	27762.46
4	128	max-autotune-no-cudagraphs	eager	676.22	677.07	681.91	27762.45
4	128	False	batched_mm	1235.25	1235.33	1237.90	61465.85
4	128	True	batched_mm	1505.00	1503.31	1536.10	50174.26
4	128	max-autotune-no-cudagraphs	batched_mm	572.37	570.81	589.74	27762.45
4	128	False	grouped_mm	80.95	81.06	81.70	27762.45
4	128	True	grouped_mm	79.67	79.69	80.54	27762.45
4	128	max-autotune-no-cudagraphs	grouped_mm	83.29	79.83	111.83	27762.46

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-12-08T09:20:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.