Skip to content

feat: Autotuner support CUDA graph and cold L2 cache#2663

Merged
yzh119 merged 6 commits intoflashinfer-ai:mainfrom
amitz-nv:autotuner-improvements-from-trtllm
Mar 3, 2026
Merged

feat: Autotuner support CUDA graph and cold L2 cache#2663
yzh119 merged 6 commits intoflashinfer-ai:mainfrom
amitz-nv:autotuner-improvements-from-trtllm

Conversation

@amitz-nv
Copy link
Copy Markdown
Contributor

@amitz-nv amitz-nv commented Mar 1, 2026

📌 Description

Adds support for CUDA graph and cold L2 cache in the autotuner.
Mostly copied from TRTLLM, see https://github.com/NVIDIA/TensorRT-LLM/blob/63c33c7c9a705e6d194a53b7ed54bbaa11494f7d/tensorrt_llm/_torch/autotuner.py#L1134

Currently:

  • Both are disabled by default
  • Both are enabled specifically in trtllm_fp4_block_scale_moe_op

🔍 Related Issues

Not aware of any.

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • CUDA-graph profiling with batched replay to speed and stabilize kernel profiling.
    • Cold L2-cache profiling mode and automatic L2 cache-size detection for more realistic tuning.
    • Improved CUDA timing/delay handling and clearer CUDA error diagnostics.
  • Chores

    • Tuning configuration gains and propagates new flags (use_cuda_graph, use_cold_l2_cache) through autotuning.
    • MoE autotuning now accepts extra tuning hints; tests updated to match new profiling signatures.

…t are enabled for trtllm_fp4_block_scale_moe_op

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two tuning flags (use_cuda_graph, use_cold_l2_cache) and implements batched profiling plus an optional CUDA-graph capture/replay path in the autotuner; MoE runner paths propagate these flags into tuning-config refinement; small test updates to accept the new tuning_config parameter.

Changes

Cohort / File(s) Summary
Autotuner core
flashinfer/autotuner.py
Added use_cuda_graph and use_cold_l2_cache to TuningConfig. _profile_single_kernel signature extended to accept tuning_config and looser tactic typing. Implemented batched-input preparation (_prepare_input_tensors_with_batches), L2-cache sizing helper (_get_l2_cache_size_in_bytes), CUDA-graph capture/replay path (with _CUDA_GRAPH_DELAY_MICRO_SECS delay handling), and updated choose_one to forward tuning_config.
MoE tuning callers
flashinfer/fused_moe/core.py
MoERunner.refine_tuning_config now accepts **kwargs. Call sites updated to pass use_cold_l2_cache=True and use_cuda_graph=True into tuning-config refinement for fused MoE operator variants.
Tests
tests/autotuner/test_autotuner_core.py
Test stubs updated to accept an optional tuning_config parameter in fake profiling helpers to match new profiling signature.

Sequence Diagram(s)

sequenceDiagram
participant Caller as Caller
participant AutoTuner as AutoTuner
participant Runner as TunableRunner
participant CUDA as CUDA

Caller->>AutoTuner: choose_one(op, runners, tuning_config, inputs)
AutoTuner->>AutoTuner: _prepare_input_tensors_with_batches(inputs, tuning_config)
AutoTuner->>Runner: _profile_single_kernel(runner, batched_inputs, tactic, tuning_config)
alt tuning_config.use_cuda_graph
AutoTuner->>CUDA: beginGraphCapture() (host stream)
Runner->>CUDA: execute kernels (warmup + timed)
CUDA-->>Runner: endGraphCapture()/createGraph()
AutoTuner->>CUDA: replayGraph() (with graph delay)
else not use_cuda_graph
Runner->>CUDA: execute kernels directly (warmup + timed) on stream
end
Runner-->>AutoTuner: timing result
AutoTuner-->>Caller: chosen runner + tactic
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested labels

op: moe

Suggested reviewers

  • IwakuraRein
  • bkryu
  • nvmbreughe
  • kahyunnam
  • jimmyzho
  • nv-yunzheq

Poem

🐰 I hopped through streams to catch a spark,
Captured a graph before it grew dark,
Bunched my tensors in tidy stacks,
Timed the kernels, then hopped back.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main feature addition: CUDA graph and cold L2 cache support in the autotuner.
Description check ✅ Passed The description covers what the PR does, references the source (TRTLLM), and documents the feature scope, but the Pre-commit Checks and Tests sections show incomplete items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the autotuner's capability to perform more realistic and accurate performance profiling. By incorporating support for CUDA graphs and simulating cold L2 cache conditions, the autotuner can now identify optimal kernel implementations that perform better in real-world scenarios where L2 cache might not be warm or operations are part of a larger CUDA graph. This leads to improved overall performance for operations leveraging these tuning advancements.

Highlights

  • CUDA Graph Support: Added functionality to the autotuner to profile kernels using CUDA graphs, which can provide more accurate performance measurements for graph-compatible operations.
  • Cold L2 Cache Simulation: Implemented a mechanism to simulate a 'cold' L2 cache state during profiling. This involves creating multiple input tensor batches to reduce L2 cache hits, leading to more realistic performance tuning results, especially for kernels sensitive to cache warm-up.
  • Configurable Tuning Options: Introduced use_cold_l2_cache and use_cuda_graph boolean flags to the TuningConfig dataclass, allowing these features to be enabled or disabled per tuning configuration.
  • Integration in MoE Operations: Explicitly enabled both CUDA graph and cold L2 cache simulation for the trtllm_fp4_block_scale_moe_op to optimize its performance under these specific conditions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/autotuner.py
    • Imported cuda.bindings.driver for CUDA API interactions.
    • Added use_cold_l2_cache and use_cuda_graph fields to TuningConfig.
    • Introduced _CUDA_GRAPH_DELAY_MICRO_SECS constant for graph profiling delays.
    • Modified _profile_single_kernel to accept tuning_config and incorporate CUDA graph capture/replay logic.
    • Implemented _prepare_input_tensors_with_batches to create multiple input buffers for cold L2 cache simulation.
    • Added _get_l2_cache_size_in_bytes, _checkCudaErrors, and _cudaGetErrorEnum helper methods for CUDA driver API interaction and error handling.
  • flashinfer/fused_moe/core.py
    • Updated MoERunner.refine_tuning_config to accept arbitrary keyword arguments (**kwargs).
    • Passed **kwargs to TuningConfig constructor within refine_tuning_config.
    • Enabled use_cold_l2_cache=True and use_cuda_graph=True for trtllm_fp4_block_scale_moe_op's tuning configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@amitz-nv amitz-nv changed the title feat: Support CUDA graph and cold L2 cache in autotuner feat: Autotuner support CUDA graph and cold L2 cache Mar 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for CUDA graph and cold L2 cache to the autotuner, which is a great feature for improving profiling accuracy. The implementation is mostly copied from TRT-LLM and looks solid. I have one suggestion to refactor a small piece of duplicated code to improve maintainability.

Comment thread flashinfer/autotuner.py
Comment on lines +601 to +641
def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
graph = torch.cuda.CUDAGraph()

with torch.cuda.stream(stream):
if tuning_config.use_cuda_graph:
with torch.cuda.graph(graph):
for r in range(repeat):
runner(
input_tensor_batches[r % len(input_tensor_batches)],
tactic=tactic,
**kwargs,
)

stream.synchronize()

# Delay the profiled kernel launch to eliminate affects of host time overhead in profiling.
delay_kernel_time_usec = (
self._CUDA_GRAPH_DELAY_MICRO_SECS
if tuning_config.use_cuda_graph
else self.stream_delay_micro_secs
)
delay_kernel(delay_kernel_time_usec)

start.record()

if tuning_config.use_cuda_graph:
graph.replay()
else:
for r in range(repeat):
runner(
input_tensor_batches[r % len(input_tensor_batches)],
tactic=tactic,
**kwargs,
)

end.record()
stream.synchronize()

# Delay the profiled kernel launch to eliminate affects of host time overhead in profiling.
# TODO: This is build time sensitive, O(tactic_num * impl_num * num_profile * tunable_ops)
# Consider apply a preprofiling to estimate the kernel execution time, then decide the necessity.
if self.stream_delay_micro_secs > 0:
delay_kernel(self.stream_delay_micro_secs)
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
return start.elapsed_time(end) / repeat
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve code clarity and reduce duplication, you can extract the repeated kernel execution loop into a nested helper function within pure_profile. This makes the logic for both CUDA graph and regular execution paths cleaner and easier to maintain.

        def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)
            graph = torch.cuda.CUDAGraph()

            def _run_kernels():
                for r in range(repeat):
                    runner(
                        input_tensor_batches[r % len(input_tensor_batches)],
                        tactic=tactic,
                        **kwargs,
                    )

            with torch.cuda.stream(stream):
                if tuning_config.use_cuda_graph:
                    with torch.cuda.graph(graph):
                        _run_kernels()

                stream.synchronize()

                # Delay the profiled kernel launch to eliminate affects of host time overhead in profiling.
                delay_kernel_time_usec = (
                    self._CUDA_GRAPH_DELAY_MICRO_SECS
                    if tuning_config.use_cuda_graph
                    else self.stream_delay_micro_secs
                )
                delay_kernel(delay_kernel_time_usec)

                start.record()

                if tuning_config.use_cuda_graph:
                    graph.replay()
                else:
                    _run_kernels()

                end.record()
                stream.synchronize()

                return start.elapsed_time(end) / repeat

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/fused_moe/core.py (1)

1284-1310: ⚠️ Potential issue | 🟡 Minor

Consider documenting or enforcing hashable **kwargs in refine_tuning_config.

The @functools.lru_cache decorator requires all function arguments to be hashable. While current callers at line 1893 only pass booleans (use_cold_l2_cache=True, use_cuda_graph=True), which are safely hashable, the method signature accepts arbitrary **kwargs without constraints. Passing unhashable values (lists, dicts, etc.) through **kwargs would cause a TypeError at runtime. Add type hints restricting kwargs to hashable types or document the limitation to prevent future misuse.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 1284 - 1310, The cached
classmethod refine_tuning_config is decorated with functools.lru_cache but
accepts arbitrary **kwargs which may be unhashable and will raise TypeError;
update the signature and implementation to enforce hashable kwargs (e.g., change
to **kwargs: Mapping[str, Hashable] or accept a single frozen mapping) or
normalize/validate incoming kwargs into a hashable form before using them in the
cached call (for example convert nested dicts/lists to frozensets/tuples), and
update callers or document the constraint; reference refine_tuning_config, the
functools.lru_cache decorator, and the class attributes
tuning_config_with_hidden_states_scales / tuning_config_no_hidden_states_scales
when making the change.
🧹 Nitpick comments (3)
flashinfer/autotuner.py (3)

601-647: Consider moving CUDAGraph instantiation inside the conditional.

The torch.cuda.CUDAGraph() object is created unconditionally on line 604 but only used when use_cuda_graph is True. This creates unnecessary overhead when CUDA graphs are disabled.

♻️ Proposed refactor
         def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
             start = torch.cuda.Event(enable_timing=True)
             end = torch.cuda.Event(enable_timing=True)
-            graph = torch.cuda.CUDAGraph()

             with torch.cuda.stream(stream):
                 if tuning_config.use_cuda_graph:
+                    graph = torch.cuda.CUDAGraph()
                     with torch.cuda.graph(graph):
                         for r in range(repeat):
                             runner(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 601 - 647, The CUDAGraph object is
always instantiated in pure_profile even though it's only used when
tuning_config.use_cuda_graph is True; move the torch.cuda.CUDAGraph() creation
into the branch guarded by tuning_config.use_cuda_graph (the block that
currently does "with torch.cuda.graph(graph):") so graph is only created when
needed, ensure any references (graph.replay()) remain valid by keeping graph in
scope of the branch where it's used, and remove the unconditional graph variable
to avoid overhead when use_cuda_graph is False.

923-924: Use TypeError for invalid type detection.

Per static analysis, TypeError is more appropriate than RuntimeError when the error condition is about an unexpected type.

♻️ Proposed fix
         else:
-            raise RuntimeError("Unknown error type: {}".format(error))
+            raise TypeError(f"Unknown error type: {type(error).__name__}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 923 - 924, The code currently raises
RuntimeError("Unknown error type: {}".format(error)); replace that with raising
a TypeError instead to reflect an invalid/unexpected type detection (e.g., raise
TypeError("Unknown error type: {}".format(error))). Locate the branch that
raises RuntimeError (the else that handles unknown error types) and swap
RuntimeError for TypeError, keeping the existing message or slightly clarifying
it as needed.

869-870: Add documentation for the 3x L2 cache size multiplier heuristic.

The * 3 factor appears to be a heuristic to ensure sufficient data is cycled through to evict L2 cache entries. Consider adding a brief comment explaining this design choice for maintainability.

📝 Proposed documentation
+        # Use 3x L2 cache size to ensure cold cache simulation by evicting
+        # previously cached data through sufficient memory pressure
         num_buffers = self._get_l2_cache_size_in_bytes() * 3 // one_buffer_bytes + 1
         num_buffers = min(num_buffers, self.repeat + 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 869 - 870, Add a short inline comment
next to the num_buffers computation explaining the "* 3" heuristic: in the line
where num_buffers is computed using self._get_l2_cache_size_in_bytes(), note
that multiplying by 3 aims to cycle roughly three times the L2 cache capacity
(to increase likelihood of evicting L2-resident data) before dividing by
one_buffer_bytes, and that the result is clamped by min(..., self.repeat + 1);
reference the symbols num_buffers, _get_l2_cache_size_in_bytes,
one_buffer_bytes, and repeat so the rationale is clear for future maintainers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1284-1310: The cached classmethod refine_tuning_config is
decorated with functools.lru_cache but accepts arbitrary **kwargs which may be
unhashable and will raise TypeError; update the signature and implementation to
enforce hashable kwargs (e.g., change to **kwargs: Mapping[str, Hashable] or
accept a single frozen mapping) or normalize/validate incoming kwargs into a
hashable form before using them in the cached call (for example convert nested
dicts/lists to frozensets/tuples), and update callers or document the
constraint; reference refine_tuning_config, the functools.lru_cache decorator,
and the class attributes tuning_config_with_hidden_states_scales /
tuning_config_no_hidden_states_scales when making the change.

---

Nitpick comments:
In `@flashinfer/autotuner.py`:
- Around line 601-647: The CUDAGraph object is always instantiated in
pure_profile even though it's only used when tuning_config.use_cuda_graph is
True; move the torch.cuda.CUDAGraph() creation into the branch guarded by
tuning_config.use_cuda_graph (the block that currently does "with
torch.cuda.graph(graph):") so graph is only created when needed, ensure any
references (graph.replay()) remain valid by keeping graph in scope of the branch
where it's used, and remove the unconditional graph variable to avoid overhead
when use_cuda_graph is False.
- Around line 923-924: The code currently raises RuntimeError("Unknown error
type: {}".format(error)); replace that with raising a TypeError instead to
reflect an invalid/unexpected type detection (e.g., raise TypeError("Unknown
error type: {}".format(error))). Locate the branch that raises RuntimeError (the
else that handles unknown error types) and swap RuntimeError for TypeError,
keeping the existing message or slightly clarifying it as needed.
- Around line 869-870: Add a short inline comment next to the num_buffers
computation explaining the "* 3" heuristic: in the line where num_buffers is
computed using self._get_l2_cache_size_in_bytes(), note that multiplying by 3
aims to cycle roughly three times the L2 cache capacity (to increase likelihood
of evicting L2-resident data) before dividing by one_buffer_bytes, and that the
result is clamped by min(..., self.repeat + 1); reference the symbols
num_buffers, _get_l2_cache_size_in_bytes, one_buffer_bytes, and repeat so the
rationale is clear for future maintainers.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f521fe1 and 6134222.

📒 Files selected for processing (2)
  • flashinfer/autotuner.py
  • flashinfer/fused_moe/core.py

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 13-14: Wrap the top-level "from cuda.bindings import driver" in a
try/except to gracefully handle missing cuda-python (set driver = None or
re-raise a clear ImportError), and add cuda-python to requirements.txt; in the
code path that queries CUDA devices (the block that calls
driver.cuDeviceGet/driver.cuDeviceGetName etc.) ensure you call driver.cuInit(0)
before any device queries and replace the hardcoded device 0 with
torch.cuda.current_device() (use that index when calling driver.cuDeviceGet and
related APIs), and guard these calls to raise a helpful error or skip CUDA logic
if driver is None.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6134222 and ceabfe0.

📒 Files selected for processing (1)
  • flashinfer/autotuner.py

Comment thread flashinfer/autotuner.py Outdated
Comment on lines 13 to 14
from cuda.bindings import driver

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Step 1: Check dependency declarations for cuda-python/cuda-bindings
echo "=== Checking dependency declarations ==="
fd -HI 'pyproject.toml|setup.py|setup.cfg|requirements.*\.txt|environment.*\.y[a]?ml' \
  | xargs -r rg -nP 'cuda-python|cuda-bindings|nvidia-cuda-python|cuda\.bindings' \
  | head -50

echo -e "\n=== Checking flashinfer/autotuner.py line 13 and surrounding context ==="
head -30 flashinfer/autotuner.py | cat -n

echo -e "\n=== Checking lines 889-922 in flashinfer/autotuner.py ==="
sed -n '880,925p' flashinfer/autotuner.py | cat -n

echo -e "\n=== Checking all driver API usage in autotuner.py ==="
rg -nP 'cuInit|cuDeviceGet|cuDeviceGetAttribute|from cuda' flashinfer/autotuner.py -B2 -A2

Repository: flashinfer-ai/flashinfer

Length of output: 4066


🏁 Script executed:

#!/bin/bash
echo "=== Checking pyproject.toml full content for dependencies ==="
cat pyproject.toml | head -100

echo -e "\n=== Checking setup.py if it exists ==="
[ -f setup.py ] && head -50 setup.py || echo "No setup.py found"

echo -e "\n=== Checking setup.cfg if it exists ==="
[ -f setup.cfg ] && cat setup.cfg || echo "No setup.cfg found"

echo -e "\n=== Searching for any mention of cuda in pyproject.toml ==="
rg -i 'cuda|torch' pyproject.toml

echo -e "\n=== Checking if there are any other imports from cuda.bindings in the codebase ==="
rg 'from cuda\.bindings import|import cuda\.bindings' --type=py | head -20

echo -e "\n=== Checking if cuInit is called anywhere in flashinfer ==="
rg 'cuInit' --type=py

Repository: flashinfer-ai/flashinfer

Length of output: 4889


🏁 Script executed:

#!/bin/bash
echo "=== Checking requirements.txt for cuda-python ==="
cat requirements.txt

echo -e "\n=== Checking how green_ctx.py handles cuda imports ==="
grep -n 'cuda.bindings\|import driver\|import runtime' flashinfer/green_ctx.py -B2 -A2

echo -e "\n=== Checking how cuda_utils.py handles cuda imports ==="
grep -n 'cuda.bindings\|import driver\|import runtime' flashinfer/cuda_utils.py -B3 -A3

echo -e "\n=== Checking how gdn_decode.py handles cuda imports ==="
grep -n 'cuda.bindings\|import cuda' flashinfer/gdn_decode.py -B2 -A2

echo -e "\n=== Check if _get_l2_cache_size_in_bytes is called in the codebase ==="
rg '_get_l2_cache_size_in_bytes' --type=py -B2 -A2

Repository: flashinfer-ai/flashinfer

Length of output: 2127


🏁 Script executed:

#!/bin/bash
echo "=== Checking where _get_l2_cache_size_in_bytes is called ==="
rg '_get_l2_cache_size_in_bytes' flashinfer/autotuner.py -B10 -A3

echo -e "\n=== Checking if cuda-python is a transitive dependency ==="
rg -i 'cuda-python|cuda.bindings' pyproject.toml

echo -e "\n=== Checking build_utils.py and setup process for cuda dependencies ==="
head -50 build_utils.py 2>/dev/null || echo "No build_utils.py in root"

echo -e "\n=== Check if there's optional dependency on cuda-python ==="
rg -i 'extras_require|optional' pyproject.toml

echo -e "\n=== Check what torch depends on in terms of CUDA bindings ==="
# Check if torch typically includes cuda-python
python3 -c "import sys; print('Python available')" 2>/dev/null && echo "Python available for querying" || echo "Python not available"

echo -e "\n=== Check line context where _get_l2_cache_size_in_bytes is called (broader context) ==="
sed -n '865,905p' flashinfer/autotuner.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 4655


Harden CUDA driver integration: add error handling for missing dependency, initialize driver, and respect current device context.

The cuda.bindings import at line 13 lacks error handling and cuda-python is not declared in requirements.txt, causing import failures when the package is absent. At lines 889–895, the driver API calls should initialize the driver with cuInit(0) before querying devices, and should not default to device 0 since this breaks multi-GPU contexts—use torch.cuda.current_device() instead.

Other files like green_ctx.py and cuda_utils.py properly wrap cuda.bindings imports in try-except blocks; follow that pattern here.

Suggested fixes
-from cuda.bindings import driver
+try:
+    from cuda.bindings import driver
+except ImportError:
+    driver = None
-    def _get_l2_cache_size_in_bytes(self, device_id: int = 0) -> int:
-        device = self._checkCudaErrors(driver.cuDeviceGet(device_id))
-        return self._checkCudaErrors(
-            driver.cuDeviceGetAttribute(
-                driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE,
-                device,
-            )
-        )
+    def _get_l2_cache_size_in_bytes(self, device_id: Optional[int] = None) -> int:
+        if device_id is None:
+            device_id = torch.cuda.current_device()
+
+        if driver is None:
+            props = torch.cuda.get_device_properties(device_id)
+            return int(getattr(props, "l2_cache_size", 0))
+
+        self._checkCudaErrors(driver.cuInit(0))
+        device = self._checkCudaErrors(driver.cuDeviceGet(device_id))
+        return self._checkCudaErrors(
+            driver.cuDeviceGetAttribute(
+                driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE,
+                device,
+            )
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 13 - 14, Wrap the top-level "from
cuda.bindings import driver" in a try/except to gracefully handle missing
cuda-python (set driver = None or re-raise a clear ImportError), and add
cuda-python to requirements.txt; in the code path that queries CUDA devices (the
block that calls driver.cuDeviceGet/driver.cuDeviceGetName etc.) ensure you call
driver.cuInit(0) before any device queries and replace the hardcoded device 0
with torch.cuda.current_device() (use that index when calling driver.cuDeviceGet
and related APIs), and guard these calls to raise a helpful error or skip CUDA
logic if driver is None.

Comment thread flashinfer/autotuner.py
avg_time = float("inf")

def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
start = torch.cuda.Event(enable_timing=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we reuse the functions available in https://github.com/flashinfer-ai/flashinfer/blob/f521fe19ac387e8baffd7b5c925ef59d9f2ecc0c/flashinfer/testing/utils.py

afaik the bench_gpu_time_with_cupti should provide most precise measure atm.

cc @nv-yunzheq @bkryu for confirmation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience with flashinfer benchmark script, combining CUPTI-based timing with nsys would get CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED.

If user wants to profile flashinfer kernels in e2e runs, this will prevent them from doing so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following what @rosenrodt replied, do you @yzh119 think it should be changed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is true, nsys and cupti goes thru the same mechanism underneath and it is critical we unblock nsys analysis in library code paths

citing https://docs.nvidia.com/cupti/api/group__CUPTI__CALLBACK__API.html#_CPPv414cuptiSubscribeP22CUpti_SubscriberHandle18CUpti_CallbackFuncPv

i'm going to say let's approve this as is at least for now

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes nsys uses the lower level CUPTI, and NVIDIA's cupti-python (obviously) also uses the lower level CUPTI.

As a result, profiling with nsys produces CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED. I suggest that we do not use bench_gpu_time_with_cupti for the autotuner for this reason. Losing the ability to nsys profile would be a bit too much.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining this to me, and now I can understand why not using cupti here, at least we can consider reusing cuda event benchmarking functions under flashinfer.testing (maybe in future PRs).

Let's merge this PR now.

@aleozlx aleozlx added the v0.6.5 release blocker label for v0.6.5 label Mar 2, 2026
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@amitz-nv amitz-nv requested a review from yzh119 March 2, 2026 18:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
flashinfer/autotuner.py (1)

13-13: ⚠️ Potential issue | 🟠 Major

Guard optional CUDA bindings and initialize the driver before device queries.

Line 13 makes cuda.bindings a hard import dependency, and Line 893 uses driver APIs without cuInit(0). This can fail at import/runtime even though these modes are optional/disabled by default.

Suggested patch
-from cuda.bindings import driver
+try:
+    from cuda.bindings import driver
+except ImportError:
+    driver = None
@@
     def _get_l2_cache_size_in_bytes(self, device_id: Optional[int] = None) -> int:
         if device_id is None:
             device_id = torch.cuda.current_device()
 
+        if driver is None:
+            props = torch.cuda.get_device_properties(device_id)
+            return int(getattr(props, "l2_cache_size", 0))
+
+        self._checkCudaErrors(driver.cuInit(0))
         device = self._checkCudaErrors(driver.cuDeviceGet(device_id))
         return self._checkCudaErrors(
             driver.cuDeviceGetAttribute(
                 driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE,
                 device,
             )
         )
#!/bin/bash
set -euo pipefail

echo "== Verify CUDA Python dependency declaration =="
fd -HI 'pyproject.toml|requirements*.txt|setup.py|setup.cfg' \
  | xargs -r rg -n 'cuda-python|cuda-bindings|cuda\.bindings' || true

echo
echo "== Verify driver init/query ordering in autotuner =="
rg -n -C3 'cuInit\(|cuDeviceGet\(|cuDeviceGetAttribute\(' flashinfer/autotuner.py

Expected: either guarded import/fallback path exists, and cuInit(0) appears before device attribute queries in _get_l2_cache_size_in_bytes.

Also applies to: 889-899

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` at line 13, The hard import of cuda.bindings.driver
should be guarded and the CUDA driver must be initialized before any device
queries: change the module import to safely try/except ImportError (or set
driver = None) for "from cuda.bindings import driver" and add an initialization
check that calls driver.cuInit(0) before any driver.* queries; specifically,
update _get_l2_cache_size_in_bytes (and neighboring code that calls cuDeviceGet
/ cuDeviceGetAttribute) to first ensure the driver is present and initialized
(call cuInit(0) once, handle and log errors) and return a safe fallback when the
CUDA bindings are unavailable instead of crashing.
🧹 Nitpick comments (1)
flashinfer/autotuner.py (1)

925-925: Use TypeError for invalid error-type input.

Line 925 handles an invalid input type in _cudaGetErrorEnum; TypeError is the more precise exception class.

Suggested patch
-            raise RuntimeError("Unknown error type: {}".format(error))
+            raise TypeError("Unknown error type: {}".format(error))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` at line 925, The code in _cudaGetErrorEnum currently
raises RuntimeError for an invalid error-type input; change this to raise a
TypeError instead so the exception accurately reflects an incorrect input type.
Update the raise statement in _cudaGetErrorEnum (the line that currently does
raise RuntimeError("Unknown error type: {}".format(error))) to raise TypeError
with the same informative message, ensuring callers receive a type-specific
exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@flashinfer/autotuner.py`:
- Line 13: The hard import of cuda.bindings.driver should be guarded and the
CUDA driver must be initialized before any device queries: change the module
import to safely try/except ImportError (or set driver = None) for "from
cuda.bindings import driver" and add an initialization check that calls
driver.cuInit(0) before any driver.* queries; specifically, update
_get_l2_cache_size_in_bytes (and neighboring code that calls cuDeviceGet /
cuDeviceGetAttribute) to first ensure the driver is present and initialized
(call cuInit(0) once, handle and log errors) and return a safe fallback when the
CUDA bindings are unavailable instead of crashing.

---

Nitpick comments:
In `@flashinfer/autotuner.py`:
- Line 925: The code in _cudaGetErrorEnum currently raises RuntimeError for an
invalid error-type input; change this to raise a TypeError instead so the
exception accurately reflects an incorrect input type. Update the raise
statement in _cudaGetErrorEnum (the line that currently does raise
RuntimeError("Unknown error type: {}".format(error))) to raise TypeError with
the same informative message, ensuring callers receive a type-specific
exception.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceabfe0 and e6d7e44.

📒 Files selected for processing (1)
  • flashinfer/autotuner.py

@aleozlx aleozlx added the run-ci label Mar 2, 2026
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 2, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !363 has been created, and the CI pipeline #45167053 is currently running. I'll report back once the pipeline job completes.

@aleozlx aleozlx added the ready label Mar 2, 2026
@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[CANCELING] Pipeline #45167053: canceled

Comment thread flashinfer/autotuner.py
avg_time = float("inf")

def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
start = torch.cuda.Event(enable_timing=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining this to me, and now I can understand why not using cupti here, at least we can consider reusing cuda event benchmarking functions under flashinfer.testing (maybe in future PRs).

Let's merge this PR now.

@yzh119 yzh119 enabled auto-merge (squash) March 3, 2026 01:24
Removed CUDA error handling methods and replaced L2 cache size retrieval with PyTorch's method. This is due to an AOT build error in CI
@aleozlx aleozlx removed the run-ci label Mar 3, 2026
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 3, 2026

@flashinfer-bot run

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 3, 2026

Removed CUDA error handling methods and replaced L2 cache size retrieval with PyTorch's method. This is due to an AOT build error in CI

@aleozlx aleozlx added the run-ci label Mar 3, 2026
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 3, 2026

two failed tests are seen on the public CI

https://github.com/flashinfer-ai/flashinfer/actions/runs/22613536689/job/65520873046?pr=2663

tests/autotuner/test_autotuner_core.py:326: AssertionError
----------------------------- Captured stderr call -----------------------------
2026-03-03 10:59:00,394 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-03-03 10:59:00,394 - WARNING - autotuner.py:503 - flashinfer.jit: [Autotuner]: Skipping tactic <tests.autotuner.test_autotuner_core.DummyRunner object at 0x7a72e7917e60> 0, due to failure while profiling: test_choose_one_tuning_selects_best_tactic_and_populates_cache..fake_profile() takes 4 positional arguments but 5 were given
2026-03-03 10:59:00,394 - WARNING - autotuner.py:503 - flashinfer.jit: [Autotuner]: Skipping tactic <tests.autotuner.test_autotuner_core.DummyRunner object at 0x7a72e7917e60> 1, due to failure while profiling: test_choose_one_tuning_selects_best_tactic_and_populates_cache..fake_profile() takes 4 positional arguments but 5 were given
2026-03-03 10:59:00,394 - WARNING - autotuner.py:503 - flashinfer.jit: [Autotuner]: Skipping tactic <tests.autotuner.test_autotuner_core.DummyRunner object at 0x7a72e7917e60> 2, due to failure while profiling: test_choose_one_tuning_selects_best_tactic_and_populates_cache..fake_profile() takes 4 positional arguments but 5 were given
2026-03-03 10:59:00,395 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends

  • generated xml file: /workspace/junit/tests_autotuner_test_autotuner_core.py.xml -
    =========================== short test summary info ============================
    FAILED tests/autotuner/test_autotuner_core.py::test_search_cache_preserving_leading_dims_hits_while_flattened_misses
    FAILED tests/autotuner/test_autotuner_core.py::test_choose_one_tuning_selects_best_tactic_and_populates_cache
    ========================= 2 failed, 25 passed in 0.39s =========================
    ❌ FAILED: tests/autotuner/test_autotuner_core.py

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 3, 2026

the failed test is a new file from

#2617

aleozlx and others added 2 commits March 3, 2026 04:14
…ature

The _profile_single_kernel method added a tuning_config parameter but
the fake_profile mocks in two tests were not updated, causing them to
fail with "takes 4 positional arguments but 5 were given".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aleozlx aleozlx requested a review from nvmbreughe as a code owner March 3, 2026 12:23
@aleozlx aleozlx removed the run-ci label Mar 3, 2026
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 3, 2026

@flashinfer-bot run

@aleozlx aleozlx added the run-ci label Mar 3, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/autotuner/test_autotuner_core.py (1)

264-266: Silence intentional-unused-argument warnings in test mocks.

These mocks intentionally ignore most parameters; prefixing with _ avoids ARG001 noise while keeping compatibility.

Suggested cleanup
-    def fake_profile(
-        self, runner_obj, prof_inputs, tactic, tuning_config=None, **kwargs
-    ):
+    def fake_profile(
+        _self, _runner_obj, _prof_inputs, tactic, _tuning_config=None, **_kwargs
+    ):
         return {0: 5.0, 1: 1.0, 2: 3.0}[tactic]

Also applies to: 320-322

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/autotuner/test_autotuner_core.py` around lines 264 - 266, The test mock
function fake_profile (and the other similar mock at the later block)
intentionally ignores most parameters; rename unused parameters by prefixing
them with an underscore (e.g., change runner_obj, prof_inputs, tactic,
tuning_config to _runner_obj, _prof_inputs, _tactic, _tuning_config, or similar)
so ARG001 linter warnings are silenced while keeping the mock signature
compatible; update both fake_profile and the other mock function referenced in
the same test file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 146-147: The cache-key generation must incorporate the new
profiling-mode flags so tuned tactics for different modes aren't reused; update
the Autotuner cache-key builder (the function/method that composes the cache key
used by the autotuner) to include the boolean flags use_cold_l2_cache and
use_cuda_graph (or equivalent attributes) when constructing the key so keys
differ when these flags change, and ensure any lookups/insertions of the cache
use the updated key format.

---

Nitpick comments:
In `@tests/autotuner/test_autotuner_core.py`:
- Around line 264-266: The test mock function fake_profile (and the other
similar mock at the later block) intentionally ignores most parameters; rename
unused parameters by prefixing them with an underscore (e.g., change runner_obj,
prof_inputs, tactic, tuning_config to _runner_obj, _prof_inputs, _tactic,
_tuning_config, or similar) so ARG001 linter warnings are silenced while keeping
the mock signature compatible; update both fake_profile and the other mock
function referenced in the same test file.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17e8a10 and 766eabd.

📒 Files selected for processing (3)
  • flashinfer/autotuner.py
  • flashinfer/fused_moe/core.py
  • tests/autotuner/test_autotuner_core.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • flashinfer/fused_moe/core.py

Comment thread flashinfer/autotuner.py
Comment on lines +146 to +147
use_cold_l2_cache: bool = False
use_cuda_graph: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include tuning-mode flags in autotuner cache keys.

Line 146 and Line 147 add profiling-mode knobs, but cache keys still don’t distinguish them. This can reuse tactics tuned under different measurement modes (e.g., graph/cold-L2 vs default), leading to incorrect tactic selection.

Suggested fix
 def _get_cache_key(
     cls,
     custom_op: str,
     runner: TunableRunner,
     input_shapes: Tuple[torch.Size],
     tuning_config: TuningConfig,
 ) -> Tuple:
     return (
         custom_op,
         runner.__class__.__name__,
         hash(runner),
         cls._find_nearest_profile(input_shapes, tuning_config),
+        tuning_config.use_cold_l2_cache,
+        tuning_config.use_cuda_graph,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 146 - 147, The cache-key generation
must incorporate the new profiling-mode flags so tuned tactics for different
modes aren't reused; update the Autotuner cache-key builder (the function/method
that composes the cache key used by the autotuner) to include the boolean flags
use_cold_l2_cache and use_cuda_graph (or equivalent attributes) when
constructing the key so keys differ when these flags change, and ensure any
lookups/insertions of the cache use the updated key format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready run-ci v0.6.5 release blocker label for v0.6.5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants