Skip to content

[Test] Optimize test_trtllm_gen_fused_moe.py#2072

Merged
yzh119 merged 3 commits intoflashinfer-ai:mainfrom
jiahanc:opt_test
Nov 12, 2025
Merged

[Test] Optimize test_trtllm_gen_fused_moe.py#2072
yzh119 merged 3 commits intoflashinfer-ai:mainfrom
jiahanc:opt_test

Conversation

@jiahanc
Copy link
Copy Markdown
Collaborator

@jiahanc jiahanc commented Nov 10, 2025

📌 Description

Currently test_llm_gen_fused_moe took long time, make some optimization to speed up

  • Add autotuner choice to turn on/off autotuner in test
  • Optimize token count for better coverage
  • Optimize check_accuracy to speed up

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Tests
    • Add configurable autotune flag (enabled/disabled) across MoE test paths and parametrized cases; propagate it through runtime/test flows
    • Ensure autotune-aware kernel invocations respect a max-token tuning limit (4096)
    • Clear tuner state between runs to improve cache isolation
  • Bug Fixes
    • Tighten accuracy checks to enforce finiteness and use isclose-based matching with early exit when thresholds met

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 10, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This change threads a new enable_autotune flag through MoE test and runtime call paths, adds TUNE_MAX_NUM_TOKENS = 4096 for autotuned kernel invocations, tightens accuracy checks to require finiteness and isclose matching, and clears the AutoTuner cache between runs.

Changes

Cohort / File(s) Change Summary
MoE autotune flag & propagation
tests/moe/test_trtllm_gen_fused_moe.py
Added enable_autotune (default from config, True) to CUDAGraphMoE and threaded it through call_moe, compute_production, _compute_moe_actual_unified, kernel kwargs, and runtime args.
Autotuner parameters & constant
tests/moe/test_trtllm_gen_fused_moe.py
Introduced TUNE_MAX_NUM_TOKENS = 4096; autotune calls now receive tune_max_num_tokens when enabled. Replaced autotune(True) with autotune(self.enable_autotune) in CUDA graph warmup.
Test parametrization & routing config
tests/moe/test_trtllm_gen_fused_moe.py
Extended routing/test parametrizations to include enable_autotune (True/False) and propagated that flag into production and reference paths.
AutoTuner cache management
tests/moe/test_trtllm_gen_fused_moe.py
Added AutoTuner.get().clear_cache() between runs to avoid cross-configuration reuse.
Accuracy validation
tests/moe/test_trtllm_gen_fused_moe.py
check_accuracy now enforces finiteness (isfinite) and uses isclose-based matching with early exit when match ratio threshold is reached.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Suite
    participant CUDAGraph as CUDAGraphMoE
    participant AutoTuner
    participant Kernel as MoE Kernel

    Test->>CUDAGraph: init(..., enable_autotune)
    CUDAGraph->>CUDAGraph: store flag

    alt enable_autotune == true
        Test->>CUDAGraph: call_moe(..., enable_autotune=True)
        CUDAGraph->>AutoTuner: autotune(enable=True, tune_max_num_tokens=4096)
        AutoTuner->>Kernel: probe variants
        Kernel-->>AutoTuner: metrics
        AutoTuner->>Kernel: launch chosen variant
    else enable_autotune == false
        Test->>CUDAGraph: call_moe(..., enable_autotune=False)
        CUDAGraph->>AutoTuner: autotune(enable=False)
        AutoTuner->>Kernel: launch default variant
    end

    Kernel-->>CUDAGraph: results
    CUDAGraph-->>Test: outputs
    Test->>Test: check_accuracy (isfinite + isclose)
    Test->>AutoTuner: clear_cache()
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Concentrated changes in a single test file with consistent flag propagation.
  • Areas to double-check:
    • All call sites and kernel kwargs consistently receive enable_autotune.
    • tune_max_num_tokens passed only on autotune-enabled paths.
    • Placement of AutoTuner.get().clear_cache() between cases.
    • check_accuracy finiteness and isclose thresholds.

Suggested reviewers

  • cyx-6
  • yzh119

Poem

"I hopped through code with cautious paws,
threaded tunes and checked the laws.
Cache wiped clean, numbers bright,
kernels tuned through day and night.
🐇 — a rabbit's testing cause."

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title '[Test] Optimize test_trtllm_gen_fused_moe.py' directly describes the main change: optimization of a specific test file. It is concise and clearly summarizes the primary objective of the PR.
Description check ✅ Passed The description covers the main optimization goals (autotuner choice, token count optimization, check_accuracy speedup) and includes completed pre-commit checks. However, the test checklist items are not confirmed as complete, and related issues section is empty.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56e959c1c34aa201d135f3711e1840750df21d50 and cca7e66.

📒 Files selected for processing (1)
  • tests/moe/test_trtllm_gen_fused_moe.py (29 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/moe/test_trtllm_gen_fused_moe.py (1)
flashinfer/autotuner.py (2)
  • get (362-365)
  • autotune (251-262)
🪛 Ruff (0.14.4)
tests/moe/test_trtllm_gen_fused_moe.py

1426-1426: Create your own exception

(TRY002)


1426-1426: Avoid specifying long messages outside the exception class

(TRY003)


1428-1428: Create your own exception

(TRY002)


1428-1428: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (15)
tests/moe/test_trtllm_gen_fused_moe.py (15)

52-53: Good refactor following previous feedback.

The constant TUNE_MAX_NUM_TOKENS properly centralizes the magic number 4096, improving maintainability as suggested in the prior review.


83-83: LGTM: Autotune configuration properly initialized.

The enable_autotune field defaults to True and is correctly extracted from the config dictionary, consistent with other configuration parameters.


114-116: LGTM: Warmup respects autotune configuration.

The warmup phase now correctly uses self.enable_autotune instead of a hardcoded True, enabling conditional autotuning during CUDA graph capture.


215-215: LGTM: Tuning limit properly applied.

The tune_max_num_tokens parameter correctly uses the centralized TUNE_MAX_NUM_TOKENS constant, constraining autotuned kernel selection.


560-574: LGTM: Autotune flag properly threaded through FP4Moe.

The enable_autotune flag is correctly extracted from kwargs with a sensible default and propagated to the CUDA graph configuration.


772-809: LGTM: Autotune control integrated into FP8 block-scale path.

The enable_autotune flag is properly extracted, applied via context manager, and the tuning limit is correctly passed to the kernel.


950-986: LGTM: Autotune control integrated into FP8 per-tensor path.

The implementation follows the same pattern as other MoE variants, with proper flag extraction and propagation.


1116-1138: LGTM: Autotune control integrated into BF16 path.

All MoE implementations now consistently support the enable_autotune flag with proper defaults and tuning limits.


1423-1441: Excellent optimization of accuracy checking.

The refactored function improves performance in two ways:

  • Uses torch.isfinite() to check for both NaN and Inf in a single pass (as discussed in previous review)
  • Implements early exit when match_ratio >= percent, avoiding unnecessary error computation

2014-2014: LGTM: Central propagation of autotune flag.

The enable_autotune flag is correctly added to kernel_kwargs in the unified compute path, ensuring it reaches all MoE implementations.


2254-2271: LGTM: Test framework properly propagates autotune configuration.

The enable_autotune flag is correctly extracted from routing_config and passed through to production execution, completing the configuration chain.


2286-2365: Test optimization aligns with stated goals.

The token count adjustments (testing 8, 768, 3072 instead of 1, 8, 1024, 3072) and selective autotuning configuration reduce runtime while maintaining kernel coverage, consistent with the PR objectives and past review discussions.


2428-2490: Consistent test optimization for DeepSeekV3 routing.

Token counts and autotuning configuration follow the same optimization pattern as Renormalize tests, with DSLite (non-real-world config) having autotuning disabled.


2553-2581: TopK routing optimized appropriately.

Token counts remain limited for GeGlu compatibility, intermediate size coverage is adjusted, and autotuning is properly enabled.


2628-2654: Llama4 routing optimization complete.

Token count adjustments and autotuning configuration are consistent with other routing test optimizations.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @jiahanc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the test_trtllm_gen_fused_moe.py test suite. The changes introduce finer control over the autotuner's behavior during testing, refine the token counts used in parameterized tests for better coverage, and enhance the accuracy checking mechanism for improved reliability and performance. These modifications aim to make the test suite more efficient and robust.

Highlights

  • Autotuner Control: Introduced a mechanism to explicitly enable or disable the autotuner during tests, allowing for more controlled experimentation and performance analysis.
  • Test Coverage Optimization: Adjusted the num_tokens parameters in parameterized tests to improve test coverage and efficiency, focusing on more representative token counts.
  • Accuracy Check Enhancement: Refactored the check_accuracy function to be more robust and potentially faster by using torch.isclose and handling non-finite values more efficiently.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several valuable optimizations to the test_trtllm_gen_fused_moe.py test file. The changes include adding a configuration option to enable or disable the autotuner, which enhances test flexibility. The test parameterization has been refined by adjusting token counts for better coverage and speed. Furthermore, the check_accuracy function has been significantly optimized by leveraging torch.isclose, leading to faster test execution. Clearing the autotuner cache between test runs is also a great addition for ensuring test isolation. Overall, these are solid improvements. I have one minor suggestion to improve code maintainability by replacing a repeated magic number with a named constant.

Comment thread tests/moe/test_trtllm_gen_fused_moe.py Outdated
Comment thread tests/moe/test_trtllm_gen_fused_moe.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/moe/test_trtllm_gen_fused_moe.py (2)

211-211: Consider extracting magic number to a constant.

The value tune_max_num_tokens=4096 is hardcoded in 4 different locations. While acceptable for test code, defining it as a module-level constant would improve maintainability.

Example:

# Near the top of the file
TUNE_MAX_NUM_TOKENS = 4096

Then use tune_max_num_tokens=TUNE_MAX_NUM_TOKENS at each call site.

Also applies to: 803-803, 980-980, 1132-1132


1419-1437: Improved accuracy check with early exit, but consider using AssertionError.

The refactored check_accuracy function is more efficient with:

  • torch.isfinite for cleaner finite value validation
  • torch.isclose for better element-wise comparison
  • Early return when match_ratio >= percent

However, the exceptions raised at lines 1422 and 1424 use generic Exception. For test code, AssertionError would be more idiomatic.

Apply this diff:

-    if not torch.isfinite(a).all():
-        raise Exception("Non-finite values in reference output")
-    if not torch.isfinite(b).all():
-        raise Exception("Non-finite values in actual output")
+    if not torch.isfinite(a).all():
+        raise AssertionError("Non-finite values in reference output")
+    if not torch.isfinite(b).all():
+        raise AssertionError("Non-finite values in actual output")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d42fb90 and 7be4a657929b4d0d0ff4cf6a2d09915ff5a537bc.

📒 Files selected for processing (1)
  • tests/moe/test_trtllm_gen_fused_moe.py (30 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/moe/test_trtllm_gen_fused_moe.py (1)
flashinfer/autotuner.py (4)
  • AutoTuner (335-784)
  • autotune (251-262)
  • get (362-365)
  • clear_cache (778-780)
🪛 Ruff (0.14.4)
tests/moe/test_trtllm_gen_fused_moe.py

1422-1422: Create your own exception

(TRY002)


1422-1422: Avoid specifying long messages outside the exception class

(TRY003)


1424-1424: Create your own exception

(TRY002)


1424-1424: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (11)
tests/moe/test_trtllm_gen_fused_moe.py (11)

35-35: LGTM: Import additions support the autotuner control feature.

The new imports for AutoTuner and autotune are properly used throughout the file to enable/disable autotuning during tests and clear the cache between runs.


79-79: LGTM: Enable autotune flag properly integrated.

The enable_autotune flag is correctly extracted from config with a sensible default of True, maintaining backward compatibility.


110-110: LGTM: Autotune context manager correctly applied during warmup.

The autotune(self.enable_autotune) context manager properly controls autotuning behavior during the warmup phase.


556-556: LGTM: Consistent enable_autotune propagation across MoE implementations.

The enable_autotune flag is consistently extracted from kwargs with a default of True across all MoE implementation types (FP4, FP8BlockScale, FP8PerTensor, BF16), maintaining backward compatibility.

Also applies to: 768-768, 946-946, 1112-1112


569-569: LGTM: Autotune context managers correctly applied.

The autotune context manager is properly used with the enable_autotune flag to control autotuning behavior during kernel execution across all MoE implementations.

Also applies to: 780-780, 954-954, 1115-1115


2112-2113: LGTM: AutoTuner cache clearing ensures test isolation.

Clearing the AutoTuner cache between test runs prevents cross-configuration tactic reuse, ensuring each test configuration gets fresh autotuning. This is a good practice for test isolation.


2253-2254: LGTM: Enable autotune properly integrated into test harness.

The enable_autotune flag is correctly extracted from routing_config and passed through to compute_production, enabling per-test-case control of autotuning behavior.

Also applies to: 2270-2270


2010-2010: LGTM: Consistent enable_autotune default in unified computation.

The enable_autotune flag extraction with default True maintains consistency with other parts of the codebase.


2313-2313: LGTM: Good test coverage for enable_autotune flag.

The routing configurations include both enable_autotune: True and False values, ensuring test coverage for both autotuner-enabled and disabled code paths. This aligns well with the PR objectives.

Also applies to: 2329-2329, 2345-2345, 2438-2438, 2454-2454, 2470-2470, 2561-2561, 2635-2635


2285-2285: LGTM: Token count adjustments improve test coverage.

The num_tokens parameter values have been adjusted to [8, 768, 3072] for most tests, improving coverage as stated in the PR objectives. The TopK test appropriately maintains [8, 128] due to GeGlu constraints.

Also applies to: 2411-2411, 2536-2536, 2611-2611


2560-2560: Verify intentional removal of intermediate_size=384 from TopK routing.

The compatible_intermediate_size list was changed from [384, 512, 768, 1024] to [512, 768, 1024], removing 384. While this may be intentional to improve test execution speed, please confirm:

  1. Is the removal of 384 intentional for TopK routing?
  2. Note that the test parametrization at line 2538 still includes 384, which will now be skipped by skip_checks logic.

@jiahanc
Copy link
Copy Markdown
Collaborator Author

jiahanc commented Nov 10, 2025

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !126 has been created, and the CI pipeline #38230531 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #38230531: 13/17 passed

@jiahanc jiahanc requested review from bkryu and yzh119 November 11, 2025 06:23
Comment thread tests/moe/test_trtllm_gen_fused_moe.py
Comment thread tests/moe/test_trtllm_gen_fused_moe.py Outdated
Comment thread tests/moe/test_trtllm_gen_fused_moe.py
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
@jiahanc
Copy link
Copy Markdown
Collaborator Author

jiahanc commented Nov 12, 2025

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !126 has been updated with latest changes, and the CI pipeline #38313925 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[SUCCESS] Pipeline #38313925: 15/17 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants