[Test] Optimize test_trtllm_gen_fused_moe.py by jiahanc · Pull Request #2072 · flashinfer-ai/flashinfer

jiahanc · 2025-11-10T20:02:25Z

📌 Description

Currently test_llm_gen_fused_moe took long time, make some optimization to speed up

Add autotuner choice to turn on/off autotuner in test
Optimize token count for better coverage
Optimize check_accuracy to speed up

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Tests
- Add configurable autotune flag (enabled/disabled) across MoE test paths and parametrized cases; propagate it through runtime/test flows
- Ensure autotune-aware kernel invocations respect a max-token tuning limit (4096)
- Clear tuner state between runs to improve cache isolation
Bug Fixes
- Tighten accuracy checks to enforce finiteness and use isclose-based matching with early exit when thresholds met

coderabbitai · 2025-11-10T20:02:34Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This change threads a new enable_autotune flag through MoE test and runtime call paths, adds TUNE_MAX_NUM_TOKENS = 4096 for autotuned kernel invocations, tightens accuracy checks to require finiteness and isclose matching, and clears the AutoTuner cache between runs.

Changes

Cohort / File(s)	Change Summary
MoE autotune flag & propagation `tests/moe/test_trtllm_gen_fused_moe.py`	Added `enable_autotune` (default from config, True) to `CUDAGraphMoE` and threaded it through `call_moe`, `compute_production`, `_compute_moe_actual_unified`, kernel kwargs, and runtime args.
Autotuner parameters & constant `tests/moe/test_trtllm_gen_fused_moe.py`	Introduced `TUNE_MAX_NUM_TOKENS = 4096`; autotune calls now receive `tune_max_num_tokens` when enabled. Replaced `autotune(True)` with `autotune(self.enable_autotune)` in CUDA graph warmup.
Test parametrization & routing config `tests/moe/test_trtllm_gen_fused_moe.py`	Extended routing/test parametrizations to include `enable_autotune` (True/False) and propagated that flag into production and reference paths.
AutoTuner cache management `tests/moe/test_trtllm_gen_fused_moe.py`	Added `AutoTuner.get().clear_cache()` between runs to avoid cross-configuration reuse.
Accuracy validation `tests/moe/test_trtllm_gen_fused_moe.py`	`check_accuracy` now enforces finiteness (`isfinite`) and uses `isclose`-based matching with early exit when match ratio threshold is reached.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Suite
    participant CUDAGraph as CUDAGraphMoE
    participant AutoTuner
    participant Kernel as MoE Kernel

    Test->>CUDAGraph: init(..., enable_autotune)
    CUDAGraph->>CUDAGraph: store flag

    alt enable_autotune == true
        Test->>CUDAGraph: call_moe(..., enable_autotune=True)
        CUDAGraph->>AutoTuner: autotune(enable=True, tune_max_num_tokens=4096)
        AutoTuner->>Kernel: probe variants
        Kernel-->>AutoTuner: metrics
        AutoTuner->>Kernel: launch chosen variant
    else enable_autotune == false
        Test->>CUDAGraph: call_moe(..., enable_autotune=False)
        CUDAGraph->>AutoTuner: autotune(enable=False)
        AutoTuner->>Kernel: launch default variant
    end

    Kernel-->>CUDAGraph: results
    CUDAGraph-->>Test: outputs
    Test->>Test: check_accuracy (isfinite + isclose)
    Test->>AutoTuner: clear_cache()

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Concentrated changes in a single test file with consistent flag propagation.
Areas to double-check:
- All call sites and kernel kwargs consistently receive enable_autotune.
- tune_max_num_tokens passed only on autotune-enabled paths.
- Placement of AutoTuner.get().clear_cache() between cases.
- check_accuracy finiteness and isclose thresholds.

Suggested reviewers

cyx-6
yzh119

Poem

"I hopped through code with cautious paws,
threaded tunes and checked the laws.
Cache wiped clean, numbers bright,
kernels tuned through day and night.
🐇 — a rabbit's testing cause."

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Test] Optimize test_trtllm_gen_fused_moe.py' directly describes the main change: optimization of a specific test file. It is concise and clearly summarizes the primary objective of the PR.
Description check	✅ Passed	The description covers the main optimization goals (autotuner choice, token count optimization, check_accuracy speedup) and includes completed pre-commit checks. However, the test checklist items are not confirmed as complete, and related issues section is empty.
Docstring Coverage	✅ Passed	Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56e959c1c34aa201d135f3711e1840750df21d50 and cca7e66.

📒 Files selected for processing (1)

tests/moe/test_trtllm_gen_fused_moe.py (29 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/moe/test_trtllm_gen_fused_moe.py (1)

flashinfer/autotuner.py (2)

get (362-365)

autotune (251-262)

🪛 Ruff (0.14.4)

tests/moe/test_trtllm_gen_fused_moe.py

1426-1426: Create your own exception

(TRY002)

1426-1426: Avoid specifying long messages outside the exception class

(TRY003)

1428-1428: Create your own exception

(TRY002)

1428-1428: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (15)

tests/moe/test_trtllm_gen_fused_moe.py (15)

52-53: Good refactor following previous feedback.

The constant TUNE_MAX_NUM_TOKENS properly centralizes the magic number 4096, improving maintainability as suggested in the prior review.

83-83: LGTM: Autotune configuration properly initialized.

The enable_autotune field defaults to True and is correctly extracted from the config dictionary, consistent with other configuration parameters.

114-116: LGTM: Warmup respects autotune configuration.

The warmup phase now correctly uses self.enable_autotune instead of a hardcoded True, enabling conditional autotuning during CUDA graph capture.

215-215: LGTM: Tuning limit properly applied.

The tune_max_num_tokens parameter correctly uses the centralized TUNE_MAX_NUM_TOKENS constant, constraining autotuned kernel selection.

560-574: LGTM: Autotune flag properly threaded through FP4Moe.

The enable_autotune flag is correctly extracted from kwargs with a sensible default and propagated to the CUDA graph configuration.

772-809: LGTM: Autotune control integrated into FP8 block-scale path.

The enable_autotune flag is properly extracted, applied via context manager, and the tuning limit is correctly passed to the kernel.

950-986: LGTM: Autotune control integrated into FP8 per-tensor path.

The implementation follows the same pattern as other MoE variants, with proper flag extraction and propagation.

1116-1138: LGTM: Autotune control integrated into BF16 path.

All MoE implementations now consistently support the enable_autotune flag with proper defaults and tuning limits.

1423-1441: Excellent optimization of accuracy checking.

The refactored function improves performance in two ways:

Uses torch.isfinite() to check for both NaN and Inf in a single pass (as discussed in previous review)

Implements early exit when match_ratio >= percent, avoiding unnecessary error computation

2014-2014: LGTM: Central propagation of autotune flag.

The enable_autotune flag is correctly added to kernel_kwargs in the unified compute path, ensuring it reaches all MoE implementations.

2254-2271: LGTM: Test framework properly propagates autotune configuration.

The enable_autotune flag is correctly extracted from routing_config and passed through to production execution, completing the configuration chain.

2286-2365: Test optimization aligns with stated goals.

The token count adjustments (testing 8, 768, 3072 instead of 1, 8, 1024, 3072) and selective autotuning configuration reduce runtime while maintaining kernel coverage, consistent with the PR objectives and past review discussions.

2428-2490: Consistent test optimization for DeepSeekV3 routing.

Token counts and autotuning configuration follow the same optimization pattern as Renormalize tests, with DSLite (non-real-world config) having autotuning disabled.

2553-2581: TopK routing optimized appropriately.

Token counts remain limited for GeGlu compatibility, intermediate size coverage is adjusted, and autotuning is properly enabled.

2628-2654: Llama4 routing optimization complete.

Token count adjustments and autotuning configuration are consistent with other routing test optimizations.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-11-10T20:02:40Z

Summary of Changes

Hello @jiahanc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the test_trtllm_gen_fused_moe.py test suite. The changes introduce finer control over the autotuner's behavior during testing, refine the token counts used in parameterized tests for better coverage, and enhance the accuracy checking mechanism for improved reliability and performance. These modifications aim to make the test suite more efficient and robust.

Highlights

Autotuner Control: Introduced a mechanism to explicitly enable or disable the autotuner during tests, allowing for more controlled experimentation and performance analysis.
Test Coverage Optimization: Adjusted the num_tokens parameters in parameterized tests to improve test coverage and efficiency, focusing on more representative token counts.
Accuracy Check Enhancement: Refactored the check_accuracy function to be more robust and potentially faster by using torch.isclose and handling non-finite values more efficiently.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several valuable optimizations to the test_trtllm_gen_fused_moe.py test file. The changes include adding a configuration option to enable or disable the autotuner, which enhances test flexibility. The test parameterization has been refined by adjusting token counts for better coverage and speed. Furthermore, the check_accuracy function has been significantly optimized by leveraging torch.isclose, leading to faster test execution. Clearing the autotuner cache between test runs is also a great addition for ensuring test isolation. Overall, these are solid improvements. I have one minor suggestion to improve code maintainability by replacing a repeated magic number with a named constant.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tests/moe/test_trtllm_gen_fused_moe.py (2)
211-211: Consider extracting magic number to a constant.

The value tune_max_num_tokens=4096 is hardcoded in 4 different locations. While acceptable for test code, defining it as a module-level constant would improve maintainability.

Example:
# Near the top of the file
TUNE_MAX_NUM_TOKENS = 4096
Then use tune_max_num_tokens=TUNE_MAX_NUM_TOKENS at each call site.

Also applies to: 803-803, 980-980, 1132-1132

1419-1437: Improved accuracy check with early exit, but consider using AssertionError.

The refactored check_accuracy function is more efficient with:

torch.isfinite for cleaner finite value validation

torch.isclose for better element-wise comparison

Early return when match_ratio >= percent

However, the exceptions raised at lines 1422 and 1424 use generic Exception. For test code, AssertionError would be more idiomatic.

Apply this diff:
-    if not torch.isfinite(a).all():
-        raise Exception("Non-finite values in reference output")
-    if not torch.isfinite(b).all():
-        raise Exception("Non-finite values in actual output")
+    if not torch.isfinite(a).all():
+        raise AssertionError("Non-finite values in reference output")
+    if not torch.isfinite(b).all():
+        raise AssertionError("Non-finite values in actual output")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d42fb90 and 7be4a657929b4d0d0ff4cf6a2d09915ff5a537bc.

📒 Files selected for processing (1)

tests/moe/test_trtllm_gen_fused_moe.py (30 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/moe/test_trtllm_gen_fused_moe.py (1)

flashinfer/autotuner.py (4)

AutoTuner (335-784)

autotune (251-262)

get (362-365)

clear_cache (778-780)

🪛 Ruff (0.14.4)

tests/moe/test_trtllm_gen_fused_moe.py

1422-1422: Create your own exception

(TRY002)

1422-1422: Avoid specifying long messages outside the exception class

(TRY003)

1424-1424: Create your own exception

(TRY002)

1424-1424: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (11)

tests/moe/test_trtllm_gen_fused_moe.py (11)

35-35: LGTM: Import additions support the autotuner control feature.

The new imports for AutoTuner and autotune are properly used throughout the file to enable/disable autotuning during tests and clear the cache between runs.

79-79: LGTM: Enable autotune flag properly integrated.

The enable_autotune flag is correctly extracted from config with a sensible default of True, maintaining backward compatibility.

110-110: LGTM: Autotune context manager correctly applied during warmup.

The autotune(self.enable_autotune) context manager properly controls autotuning behavior during the warmup phase.

556-556: LGTM: Consistent enable_autotune propagation across MoE implementations.

The enable_autotune flag is consistently extracted from kwargs with a default of True across all MoE implementation types (FP4, FP8BlockScale, FP8PerTensor, BF16), maintaining backward compatibility.

Also applies to: 768-768, 946-946, 1112-1112

569-569: LGTM: Autotune context managers correctly applied.

The autotune context manager is properly used with the enable_autotune flag to control autotuning behavior during kernel execution across all MoE implementations.

Also applies to: 780-780, 954-954, 1115-1115

2112-2113: LGTM: AutoTuner cache clearing ensures test isolation.

Clearing the AutoTuner cache between test runs prevents cross-configuration tactic reuse, ensuring each test configuration gets fresh autotuning. This is a good practice for test isolation.

2253-2254: LGTM: Enable autotune properly integrated into test harness.

The enable_autotune flag is correctly extracted from routing_config and passed through to compute_production, enabling per-test-case control of autotuning behavior.

Also applies to: 2270-2270

2010-2010: LGTM: Consistent enable_autotune default in unified computation.

The enable_autotune flag extraction with default True maintains consistency with other parts of the codebase.

2313-2313: LGTM: Good test coverage for enable_autotune flag.

The routing configurations include both enable_autotune: True and False values, ensuring test coverage for both autotuner-enabled and disabled code paths. This aligns well with the PR objectives.

Also applies to: 2329-2329, 2345-2345, 2438-2438, 2454-2454, 2470-2470, 2561-2561, 2635-2635

2285-2285: LGTM: Token count adjustments improve test coverage.

The num_tokens parameter values have been adjusted to [8, 768, 3072] for most tests, improving coverage as stated in the PR objectives. The TopK test appropriately maintains [8, 128] due to GeGlu constraints.

Also applies to: 2411-2411, 2536-2536, 2611-2611

2560-2560: Verify intentional removal of intermediate_size=384 from TopK routing.

The compatible_intermediate_size list was changed from [384, 512, 768, 1024] to [512, 768, 1024], removing 384. While this may be intentional to improve test execution speed, please confirm:

Is the removal of 384 intentional for TopK routing?

Note that the test parametrization at line 2538 still includes 384, which will now be skipped by skip_checks logic.

jiahanc · 2025-11-10T20:14:23Z

/bot run

flashinfer-bot · 2025-11-10T20:15:04Z

GitLab MR !126 has been created, and the CI pipeline #38230531 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-11-11T04:38:04Z

[FAILED] Pipeline #38230531: 13/17 passed

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

jiahanc · 2025-11-12T04:40:59Z

/bot run

flashinfer-bot · 2025-11-12T04:41:19Z

GitLab MR !126 has been updated with latest changes, and the CI pipeline #38313925 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-11-12T08:44:35Z

[SUCCESS] Pipeline #38313925: 15/17 passed

gemini-code-assist Bot reviewed Nov 10, 2025

View reviewed changes

Comment thread tests/moe/test_trtllm_gen_fused_moe.py Outdated

jiahanc commented Nov 10, 2025

View reviewed changes

Comment thread tests/moe/test_trtllm_gen_fused_moe.py

coderabbitai Bot reviewed Nov 10, 2025

View reviewed changes

jiahanc requested review from bkryu and yzh119 November 11, 2025 06:23

yzh119 reviewed Nov 11, 2025

View reviewed changes

Comment thread tests/moe/test_trtllm_gen_fused_moe.py

Comment thread tests/moe/test_trtllm_gen_fused_moe.py Outdated

Comment thread tests/moe/test_trtllm_gen_fused_moe.py

jiahanc added 3 commits November 11, 2025 20:40

add autotune choice

c0bb1c7

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix

bae839e

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

per comment

cca7e66

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

jiahanc force-pushed the opt_test branch from 20fe560 to cca7e66 Compare November 12, 2025 04:40

yzh119 approved these changes Nov 12, 2025

View reviewed changes

yzh119 merged commit 6765cad into flashinfer-ai:main Nov 12, 2025
4 checks passed

coderabbitai Bot mentioned this pull request Nov 14, 2025

refactor: update dpsk fused_moe test [1] #2088

Merged

5 tasks

coderabbitai Bot mentioned this pull request Dec 15, 2025

feat: Cold L2 Cache Benchmarking with Rotating Buffers #2213

Merged

5 tasks

coderabbitai Bot mentioned this pull request Dec 25, 2025

[perf]optimize nvfp4 #2267

Closed

5 tasks

coderabbitai Bot mentioned this pull request Feb 5, 2026

chore: update benchmark scripts; fix trtllm-gen moe comments #2412

Merged

5 tasks

This was referenced Feb 24, 2026

fix: trtllm_mxint4_block_scale_moe unit test to index output list #2627

Merged

fix: autotuner cache key mismatch for trtllm-gen FP8 block scale MoE and FP8 routed MoE #2640

Closed

feat: Autotuner support CUDA graph and cold L2 cache #2663

Merged

coderabbitai Bot mentioned this pull request Mar 4, 2026

feat: Add autotuner config caching, thread safety, and documentation #2554

Merged

5 tasks

BingooYang pushed a commit to BingooYang/flashinfer that referenced this pull request Mar 13, 2026

[Test] Optimize test_trtllm_gen_fused_moe.py (flashinfer-ai#2072)

197643e

coderabbitai Bot mentioned this pull request Mar 25, 2026

fix moe benchmark #2886

Open

5 tasks

coderabbitai Bot mentioned this pull request Mar 31, 2026

fix test error regarding logits_types #2918

Merged

5 tasks

This was referenced Apr 9, 2026

Prevent MoE autotuner buffer overflow on large token buckets #3025

Merged

Fix autotuner crash and add autotuning tests for routed MoE #3028

Closed

coderabbitai Bot mentioned this pull request May 4, 2026

[Bugfix] Fix fused MoE autotuning correctness issues by filtering clusterDimZ #3227

Merged

5 tasks

Conversation

jiahanc commented Nov 10, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist Bot commented Nov 10, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 10, 2025

Uh oh!

flashinfer-bot commented Nov 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiahanc commented Nov 12, 2025

Uh oh!

flashinfer-bot commented Nov 12, 2025

Uh oh!

flashinfer-bot commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiahanc commented Nov 10, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 10, 2025 •

edited

Loading