Enable renormalize(naive) routing for fp8 per-tensor by IwakuraRein · Pull Request #2030 · flashinfer-ai/flashinfer

IwakuraRein · 2025-11-03T23:42:56Z

📌 Description

Disable expert weights in the FC1 except for Llama routing.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Made token_scales available for Llama4 routing.
- Corrected GEMM1 input so the proper data is used during MoE processing.
Tests
- Added FP8PerTensorMoe to test parameterization.
- Expanded coverage for Renormalize, DeepSeekV3, Qwen3 and Llama4 routing configurations.

coderabbitai · 2025-11-03T23:43:05Z

Walkthrough

Expose a new token_scales pointer in the MoE workspace, set it from expert_weights.data_ptr() in launcher paths when routing_method_type is Llama4, consume token_scales in the PermuteGemm1 call instead of expert_weights, and add FP8PerTensorMoe to MoE test parameterizations.

Changes

Cohort / File(s)	Summary
Kernel launcher modifications `csrc/trtllm_fused_moe_kernel_launcher.cu`	In `Fp8PerTensorLauncher::prepare_routing`, when `routing_method_type == Llama4` assign `workspace.token_scales = expert_weights.data_ptr()` so `token_scales` can be consumed by permuteGemm1; no other flow or error-handling changes.
Runner modifications `csrc/trtllm_fused_moe_runner.cu`	In `Runner::run`, change the `mPermuteGemm1.run` call to pass `workspace.token_scales` (instead of `workspace.expert_weights`) as the GEMM1 input and adjust subsequent argument ordering accordingly.
Header / workspace struct `include/flashinfer/trtllm/fused_moe/runner.h`	Add `void* token_scales = nullptr;` to `MoEWorkspace` with comment indicating it is consumed by the `permuteGemm1` kernel; clarify `expert_weights` remains used by finalize.
Test coverage expansion `tests/moe/test_trtllm_gen_fused_moe.py`	Add `FP8PerTensorMoe` (id `FP8_Tensor`) to parameterized MoE tests and include it among compatible implementations for Qwen3, Renormalize, Qwen3_next, DeepSeekV3, and Llama4 routing scenarios.

Sequence Diagram

sequenceDiagram
    autonumber
    participant Launcher as Kernel Launcher
    participant Workspace as MoE Workspace
    participant Runner as PermuteGemm1
    participant Finalize as Finalize Kernel

    Note over Launcher: check routing_method_type
    alt routing_method_type == Llama4
        Launcher->>Workspace: workspace.token_scales = expert_weights.data_ptr()
    else other routing
        Launcher->>Workspace: token_scales left nullptr/unmodified
    end

    Launcher->>Workspace: prepare workspace
    Workspace->>Runner: mPermuteGemm1.run(input = token_scales, ...)
    Runner->>Runner: permute / GEMM using token_scales
    Workspace->>Finalize: provide expert_weights for finalize
    Finalize->>Finalize: finalize outputs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Focus areas:
- Verify the Llama4 conditional is applied in all relevant launcher paths.
- Confirm token_scales lifetime and memory layout from expert_weights.data_ptr() matches permuteGemm1 expectations.
- Check mPermuteGemm1.run argument reordering for correct parameter mapping.
- Review added test parametrizations for correctness and coverage.

Suggested reviewers

cyx-6
wenscarl
djmmoss
yzh119
joker-eph
aleozlx
jiahanc

Poem

🐇 I found a pointer, light and neat,
I set token_scales to guide the feat.
GEMM1 takes a hop, the kernels sing,
Tests applaud the tiny spring.
🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title 'Enable renormalize(naive) routing for fp8 per-tensor' does not match the PR description which states 'Disable expert weights in the FC1 except for Llama routing,' indicating a mismatch between the stated objective and the actual description.	Align the PR title and description to clearly communicate the actual change. Either update the title to reflect disabling expert weights in FC1, or update the description to clarify the renormalize routing enablement.
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The PR description is vague and generic; it states 'Disable expert weights in the FC1 except for Llama routing' but lacks specific details about what changes were made, why they're needed, or how they relate to the actual code modifications shown in the raw_summary.	Provide a clear, detailed description explaining the purpose of these changes, which files were modified, and how the modifications enable FP8 per-tensor routing for the specified routing methods.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fp8-renormalize-routing

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2b7bfc and b2d228c.

📒 Files selected for processing (1)

csrc/trtllm_fused_moe_kernel_launcher.cu (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

csrc/trtllm_fused_moe_kernel_launcher.cu

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jiahanc

LGTM, thanks for the contribution!

jiahanc · 2025-11-10T18:05:15Z

/bot run

flashinfer-bot · 2025-11-10T18:05:31Z

GitLab MR !124 has been created, and the CI pipeline #38224308 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: siyuanf <siyuanf@nvidia.com>

jiahanc · 2025-11-10T18:08:50Z

/bot run

flashinfer-bot · 2025-11-10T18:09:45Z

GitLab MR !125 has been created, and the CI pipeline #38224491 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

jiahanc · 2025-11-10T22:20:28Z

/bot run

flashinfer-bot · 2025-11-10T22:20:50Z

GitLab MR !125 has been updated with latest changes, and the CI pipeline #38236290 is currently running. I'll report back once the pipeline job completes.

djmmoss

LGTM

)  ## 📌 Description Disable expert weights in the FC1 except for Llama routing. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Re-enabled Renormalize routing that was previously blocked. * Made token_scales available for Llama4 routing. * Corrected GEMM1 input so the proper data source is used during MoE processing. * **Tests** * Added FP8PerTensorMoe to test parameterization. * Expanded Renormalize and DeepSeekV3 test coverage and removed related skips.  Signed-off-by: siyuanf <siyuanf@nvidia.com>

IwakuraRein marked this pull request as ready for review November 10, 2025 17:29

IwakuraRein requested review from aleozlx, cyx-6, djmmoss, jiahanc, joker-eph, wenscarl and yzh119 as code owners November 10, 2025 17:29

IwakuraRein force-pushed the fp8-renormalize-routing branch 2 times, most recently from 5a48ddf to 17fb87a Compare November 10, 2025 17:55

jiahanc approved these changes Nov 10, 2025

View reviewed changes

IwakuraRein closed this Nov 10, 2025

IwakuraRein force-pushed the fp8-renormalize-routing branch from 17fb87a to d42fb90 Compare November 10, 2025 18:04

Enable renormalize(naive) routing for fp8 per-tensor

b2b7bfc

Signed-off-by: siyuanf <siyuanf@nvidia.com>

IwakuraRein reopened this Nov 10, 2025

IwakuraRein enabled auto-merge (squash) November 10, 2025 18:09

fix bug

b2d228c

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

djmmoss approved these changes Nov 11, 2025

View reviewed changes

IwakuraRein merged commit fbdb439 into main Nov 11, 2025
4 checks passed

IwakuraRein deleted the fp8-renormalize-routing branch November 11, 2025 02:19

coderabbitai Bot mentioned this pull request Jan 20, 2026

chore/feat: A2A + MoE benchmark; add routed counterpart for trtllm_gen_fp8_fused_moe #2379

Merged

5 tasks

coderabbitai Bot mentioned this pull request Feb 23, 2026

Bf16 routed moe #2594

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 16, 2026

feat: preparing TRTLLM MoE backend to support more kernels #2794

Open

5 tasks

coderabbitai Bot mentioned this pull request Apr 20, 2026

feat(MoE): FP8 MoE per-channel quant support (note: need cubin!) #2809

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable renormalize(naive) routing for fp8 per-tensor#2030

Enable renormalize(naive) routing for fp8 per-tensor#2030
IwakuraRein merged 2 commits intomainfrom
fp8-renormalize-routing

IwakuraRein commented Nov 3, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

jiahanc left a comment •

edited

Loading

Uh oh!

jiahanc commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 10, 2025

Uh oh!

jiahanc commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 10, 2025

Uh oh!

jiahanc commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 10, 2025

Uh oh!

djmmoss left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

IwakuraRein commented Nov 3, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

jiahanc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 10, 2025

Uh oh!

jiahanc commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 10, 2025

Uh oh!

jiahanc commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 10, 2025

Uh oh!

djmmoss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IwakuraRein commented Nov 3, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 3, 2025 •

edited

Loading

jiahanc left a comment •

edited

Loading