feat: Autotuner support CUDA graph and cold L2 cache by amitz-nv · Pull Request #2663 · flashinfer-ai/flashinfer

amitz-nv · 2026-03-01T18:26:30Z

📌 Description

Adds support for CUDA graph and cold L2 cache in the autotuner.
Mostly copied from TRTLLM, see https://github.com/NVIDIA/TensorRT-LLM/blob/63c33c7c9a705e6d194a53b7ed54bbaa11494f7d/tensorrt_llm/_torch/autotuner.py#L1134

Currently:

Both are disabled by default
Both are enabled specifically in trtllm_fp4_block_scale_moe_op

🔍 Related Issues

Not aware of any.

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- CUDA-graph profiling with batched replay to speed and stabilize kernel profiling.
- Cold L2-cache profiling mode and automatic L2 cache-size detection for more realistic tuning.
- Improved CUDA timing/delay handling and clearer CUDA error diagnostics.
Chores
- Tuning configuration gains and propagates new flags (use_cuda_graph, use_cold_l2_cache) through autotuning.
- MoE autotuning now accepts extra tuning hints; tests updated to match new profiling signatures.

…t are enabled for trtllm_fp4_block_scale_moe_op Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai · 2026-03-01T18:26:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds two tuning flags (use_cuda_graph, use_cold_l2_cache) and implements batched profiling plus an optional CUDA-graph capture/replay path in the autotuner; MoE runner paths propagate these flags into tuning-config refinement; small test updates to accept the new tuning_config parameter.

Changes

Cohort / File(s)	Summary
Autotuner core `flashinfer/autotuner.py`	Added `use_cuda_graph` and `use_cold_l2_cache` to `TuningConfig`. `_profile_single_kernel` signature extended to accept `tuning_config` and looser `tactic` typing. Implemented batched-input preparation (`_prepare_input_tensors_with_batches`), L2-cache sizing helper (`_get_l2_cache_size_in_bytes`), CUDA-graph capture/replay path (with `_CUDA_GRAPH_DELAY_MICRO_SECS` delay handling), and updated `choose_one` to forward `tuning_config`.
MoE tuning callers `flashinfer/fused_moe/core.py`	`MoERunner.refine_tuning_config` now accepts `**kwargs`. Call sites updated to pass `use_cold_l2_cache=True` and `use_cuda_graph=True` into tuning-config refinement for fused MoE operator variants.
Tests `tests/autotuner/test_autotuner_core.py`	Test stubs updated to accept an optional `tuning_config` parameter in fake profiling helpers to match new profiling signature.

Sequence Diagram(s)

sequenceDiagram
participant Caller as Caller
participant AutoTuner as AutoTuner
participant Runner as TunableRunner
participant CUDA as CUDA

Caller->>AutoTuner: choose_one(op, runners, tuning_config, inputs)
AutoTuner->>AutoTuner: _prepare_input_tensors_with_batches(inputs, tuning_config)
AutoTuner->>Runner: _profile_single_kernel(runner, batched_inputs, tactic, tuning_config)
alt tuning_config.use_cuda_graph
AutoTuner->>CUDA: beginGraphCapture() (host stream)
Runner->>CUDA: execute kernels (warmup + timed)
CUDA-->>Runner: endGraphCapture()/createGraph()
AutoTuner->>CUDA: replayGraph() (with graph delay)
else not use_cuda_graph
Runner->>CUDA: execute kernels directly (warmup + timed) on stream
end
Runner-->>AutoTuner: timing result
AutoTuner-->>Caller: chosen runner + tactic

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Fix autotuner oom #2442 — Related changes around AutoTuner profiling loop and per-tactic profiling/OOM handling.
[Test] Optimize test_trtllm_gen_fused_moe.py #2072 — Related propagation of tuning flags through MoE autotuning/refinement paths.

Suggested labels

op: moe

Suggested reviewers

IwakuraRein
bkryu
nvmbreughe
kahyunnam
jimmyzho
nv-yunzheq

Poem

🐰 I hopped through streams to catch a spark,
Captured a graph before it grew dark,
Bunched my tensors in tidy stacks,
Timed the kernels, then hopped back.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main feature addition: CUDA graph and cold L2 cache support in the autotuner.
Description check	✅ Passed	The description covers what the PR does, references the source (TRTLLM), and documents the feature scope, but the Pre-commit Checks and Tests sections show incomplete items.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-01T18:26:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the autotuner's capability to perform more realistic and accurate performance profiling. By incorporating support for CUDA graphs and simulating cold L2 cache conditions, the autotuner can now identify optimal kernel implementations that perform better in real-world scenarios where L2 cache might not be warm or operations are part of a larger CUDA graph. This leads to improved overall performance for operations leveraging these tuning advancements.

Highlights

CUDA Graph Support: Added functionality to the autotuner to profile kernels using CUDA graphs, which can provide more accurate performance measurements for graph-compatible operations.
Cold L2 Cache Simulation: Implemented a mechanism to simulate a 'cold' L2 cache state during profiling. This involves creating multiple input tensor batches to reduce L2 cache hits, leading to more realistic performance tuning results, especially for kernels sensitive to cache warm-up.
Configurable Tuning Options: Introduced use_cold_l2_cache and use_cuda_graph boolean flags to the TuningConfig dataclass, allowing these features to be enabled or disabled per tuning configuration.
Integration in MoE Operations: Explicitly enabled both CUDA graph and cold L2 cache simulation for the trtllm_fp4_block_scale_moe_op to optimize its performance under these specific conditions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/autotuner.py
- Imported cuda.bindings.driver for CUDA API interactions.
- Added use_cold_l2_cache and use_cuda_graph fields to TuningConfig.
- Introduced _CUDA_GRAPH_DELAY_MICRO_SECS constant for graph profiling delays.
- Modified _profile_single_kernel to accept tuning_config and incorporate CUDA graph capture/replay logic.
- Implemented _prepare_input_tensors_with_batches to create multiple input buffers for cold L2 cache simulation.
- Added _get_l2_cache_size_in_bytes, _checkCudaErrors, and _cudaGetErrorEnum helper methods for CUDA driver API interaction and error handling.
flashinfer/fused_moe/core.py
- Updated MoERunner.refine_tuning_config to accept arbitrary keyword arguments (**kwargs).
- Passed **kwargs to TuningConfig constructor within refine_tuning_config.
- Enabled use_cold_l2_cache=True and use_cuda_graph=True for trtllm_fp4_block_scale_moe_op's tuning configuration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for CUDA graph and cold L2 cache to the autotuner, which is a great feature for improving profiling accuracy. The implementation is mostly copied from TRT-LLM and looks solid. I have one suggestion to refactor a small piece of duplicated code to improve maintainability.

gemini-code-assist · 2026-03-01T18:28:09Z

+        def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
+            start = torch.cuda.Event(enable_timing=True)
+            end = torch.cuda.Event(enable_timing=True)
+            graph = torch.cuda.CUDAGraph()
+
+            with torch.cuda.stream(stream):
+                if tuning_config.use_cuda_graph:
+                    with torch.cuda.graph(graph):
+                        for r in range(repeat):
+                            runner(
+                                input_tensor_batches[r % len(input_tensor_batches)],
+                                tactic=tactic,
+                                **kwargs,
+                            )
+
+                stream.synchronize()
+
+                # Delay the profiled kernel launch to eliminate affects of host time overhead in profiling.
+                delay_kernel_time_usec = (
+                    self._CUDA_GRAPH_DELAY_MICRO_SECS
+                    if tuning_config.use_cuda_graph
+                    else self.stream_delay_micro_secs
+                )
+                delay_kernel(delay_kernel_time_usec)
+
+                start.record()
+
+                if tuning_config.use_cuda_graph:
+                    graph.replay()
+                else:
+                    for r in range(repeat):
+                        runner(
+                            input_tensor_batches[r % len(input_tensor_batches)],
+                            tactic=tactic,
+                            **kwargs,
+                        )
+
+                end.record()
+                stream.synchronize()

-        # Delay the profiled kernel launch to eliminate affects of host time overhead in profiling.
-        # TODO: This is build time sensitive, O(tactic_num * impl_num * num_profile * tunable_ops)
-        # Consider apply a preprofiling to estimate the kernel execution time, then decide the necessity.
-        if self.stream_delay_micro_secs > 0:
-            delay_kernel(self.stream_delay_micro_secs)
-        start = torch.cuda.Event(enable_timing=True)
-        end = torch.cuda.Event(enable_timing=True)
+                return start.elapsed_time(end) / repeat


To improve code clarity and reduce duplication, you can extract the repeated kernel execution loop into a nested helper function within pure_profile. This makes the logic for both CUDA graph and regular execution paths cleaner and easier to maintain.

def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float: start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) graph = torch.cuda.CUDAGraph() def _run_kernels(): for r in range(repeat): runner( input_tensor_batches[r % len(input_tensor_batches)], tactic=tactic, **kwargs, ) with torch.cuda.stream(stream): if tuning_config.use_cuda_graph: with torch.cuda.graph(graph): _run_kernels() stream.synchronize() # Delay the profiled kernel launch to eliminate affects of host time overhead in profiling. delay_kernel_time_usec = ( self._CUDA_GRAPH_DELAY_MICRO_SECS if tuning_config.use_cuda_graph else self.stream_delay_micro_secs ) delay_kernel(delay_kernel_time_usec) start.record() if tuning_config.use_cuda_graph: graph.replay() else: _run_kernels() end.record() stream.synchronize() return start.elapsed_time(end) / repeat

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/fused_moe/core.py (1)
1284-1310: ⚠️ Potential issue | 🟡 Minor

Consider documenting or enforcing hashable **kwargs in refine_tuning_config.

The @functools.lru_cache decorator requires all function arguments to be hashable. While current callers at line 1893 only pass booleans (use_cold_l2_cache=True, use_cuda_graph=True), which are safely hashable, the method signature accepts arbitrary **kwargs without constraints. Passing unhashable values (lists, dicts, etc.) through **kwargs would cause a TypeError at runtime. Add type hints restricting kwargs to hashable types or document the limitation to prevent future misuse.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 1284 - 1310, The cached
classmethod refine_tuning_config is decorated with functools.lru_cache but
accepts arbitrary **kwargs which may be unhashable and will raise TypeError;
update the signature and implementation to enforce hashable kwargs (e.g., change
to **kwargs: Mapping[str, Hashable] or accept a single frozen mapping) or
normalize/validate incoming kwargs into a hashable form before using them in the
cached call (for example convert nested dicts/lists to frozensets/tuples), and
update callers or document the constraint; reference refine_tuning_config, the
functools.lru_cache decorator, and the class attributes
tuning_config_with_hidden_states_scales / tuning_config_no_hidden_states_scales
when making the change.

🧹 Nitpick comments (3)

flashinfer/autotuner.py (3)

601-647: Consider moving CUDAGraph instantiation inside the conditional.

The torch.cuda.CUDAGraph() object is created unconditionally on line 604 but only used when use_cuda_graph is True. This creates unnecessary overhead when CUDA graphs are disabled.

♻️ Proposed refactor

         def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
             start = torch.cuda.Event(enable_timing=True)
             end = torch.cuda.Event(enable_timing=True)
-            graph = torch.cuda.CUDAGraph()

             with torch.cuda.stream(stream):
                 if tuning_config.use_cuda_graph:
+                    graph = torch.cuda.CUDAGraph()
                     with torch.cuda.graph(graph):
                         for r in range(repeat):
                             runner(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 601 - 647, The CUDAGraph object is
always instantiated in pure_profile even though it's only used when
tuning_config.use_cuda_graph is True; move the torch.cuda.CUDAGraph() creation
into the branch guarded by tuning_config.use_cuda_graph (the block that
currently does "with torch.cuda.graph(graph):") so graph is only created when
needed, ensure any references (graph.replay()) remain valid by keeping graph in
scope of the branch where it's used, and remove the unconditional graph variable
to avoid overhead when use_cuda_graph is False.

923-924: Use TypeError for invalid type detection.

Per static analysis, TypeError is more appropriate than RuntimeError when the error condition is about an unexpected type.

♻️ Proposed fix

         else:
-            raise RuntimeError("Unknown error type: {}".format(error))
+            raise TypeError(f"Unknown error type: {type(error).__name__}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 923 - 924, The code currently raises
RuntimeError("Unknown error type: {}".format(error)); replace that with raising
a TypeError instead to reflect an invalid/unexpected type detection (e.g., raise
TypeError("Unknown error type: {}".format(error))). Locate the branch that
raises RuntimeError (the else that handles unknown error types) and swap
RuntimeError for TypeError, keeping the existing message or slightly clarifying
it as needed.

869-870: Add documentation for the 3x L2 cache size multiplier heuristic.

The * 3 factor appears to be a heuristic to ensure sufficient data is cycled through to evict L2 cache entries. Consider adding a brief comment explaining this design choice for maintainability.

📝 Proposed documentation

+        # Use 3x L2 cache size to ensure cold cache simulation by evicting
+        # previously cached data through sufficient memory pressure
         num_buffers = self._get_l2_cache_size_in_bytes() * 3 // one_buffer_bytes + 1
         num_buffers = min(num_buffers, self.repeat + 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` around lines 869 - 870, Add a short inline comment
next to the num_buffers computation explaining the "* 3" heuristic: in the line
where num_buffers is computed using self._get_l2_cache_size_in_bytes(), note
that multiplying by 3 aims to cycle roughly three times the L2 cache capacity
(to increase likelihood of evicting L2-resident data) before dividing by
one_buffer_bytes, and that the result is clamped by min(..., self.repeat + 1);
reference the symbols num_buffers, _get_l2_cache_size_in_bytes,
one_buffer_bytes, and repeat so the rationale is clear for future maintainers.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1284-1310: The cached classmethod refine_tuning_config is
decorated with functools.lru_cache but accepts arbitrary **kwargs which may be
unhashable and will raise TypeError; update the signature and implementation to
enforce hashable kwargs (e.g., change to **kwargs: Mapping[str, Hashable] or
accept a single frozen mapping) or normalize/validate incoming kwargs into a
hashable form before using them in the cached call (for example convert nested
dicts/lists to frozensets/tuples), and update callers or document the
constraint; reference refine_tuning_config, the functools.lru_cache decorator,
and the class attributes tuning_config_with_hidden_states_scales /
tuning_config_no_hidden_states_scales when making the change.

---

Nitpick comments:
In `@flashinfer/autotuner.py`:
- Around line 601-647: The CUDAGraph object is always instantiated in
pure_profile even though it's only used when tuning_config.use_cuda_graph is
True; move the torch.cuda.CUDAGraph() creation into the branch guarded by
tuning_config.use_cuda_graph (the block that currently does "with
torch.cuda.graph(graph):") so graph is only created when needed, ensure any
references (graph.replay()) remain valid by keeping graph in scope of the branch
where it's used, and remove the unconditional graph variable to avoid overhead
when use_cuda_graph is False.
- Around line 923-924: The code currently raises RuntimeError("Unknown error
type: {}".format(error)); replace that with raising a TypeError instead to
reflect an invalid/unexpected type detection (e.g., raise TypeError("Unknown
error type: {}".format(error))). Locate the branch that raises RuntimeError (the
else that handles unknown error types) and swap RuntimeError for TypeError,
keeping the existing message or slightly clarifying it as needed.
- Around line 869-870: Add a short inline comment next to the num_buffers
computation explaining the "* 3" heuristic: in the line where num_buffers is
computed using self._get_l2_cache_size_in_bytes(), note that multiplying by 3
aims to cycle roughly three times the L2 cache capacity (to increase likelihood
of evicting L2-resident data) before dividing by one_buffer_bytes, and that the
result is clamped by min(..., self.repeat + 1); reference the symbols
num_buffers, _get_l2_cache_size_in_bytes, one_buffer_bytes, and repeat so the
rationale is clear for future maintainers.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f521fe1 and 6134222.

📒 Files selected for processing (2)

flashinfer/autotuner.py
flashinfer/fused_moe/core.py

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 13-14: Wrap the top-level "from cuda.bindings import driver" in a
try/except to gracefully handle missing cuda-python (set driver = None or
re-raise a clear ImportError), and add cuda-python to requirements.txt; in the
code path that queries CUDA devices (the block that calls
driver.cuDeviceGet/driver.cuDeviceGetName etc.) ensure you call driver.cuInit(0)
before any device queries and replace the hardcoded device 0 with
torch.cuda.current_device() (use that index when calling driver.cuDeviceGet and
related APIs), and guard these calls to raise a helpful error or skip CUDA logic
if driver is None.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6134222 and ceabfe0.

📒 Files selected for processing (1)

flashinfer/autotuner.py

coderabbitai · 2026-03-01T18:50:47Z

+from cuda.bindings import driver



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Step 1: Check dependency declarations for cuda-python/cuda-bindings echo "=== Checking dependency declarations ===" fd -HI 'pyproject.toml|setup.py|setup.cfg|requirements.*\.txt|environment.*\.y[a]?ml' \ | xargs -r rg -nP 'cuda-python|cuda-bindings|nvidia-cuda-python|cuda\.bindings' \ | head -50 echo -e "\n=== Checking flashinfer/autotuner.py line 13 and surrounding context ===" head -30 flashinfer/autotuner.py | cat -n echo -e "\n=== Checking lines 889-922 in flashinfer/autotuner.py ===" sed -n '880,925p' flashinfer/autotuner.py | cat -n echo -e "\n=== Checking all driver API usage in autotuner.py ===" rg -nP 'cuInit|cuDeviceGet|cuDeviceGetAttribute|from cuda' flashinfer/autotuner.py -B2 -A2

Repository: flashinfer-ai/flashinfer

Length of output: 4066

🏁 Script executed:

#!/bin/bash echo "=== Checking pyproject.toml full content for dependencies ===" cat pyproject.toml | head -100 echo -e "\n=== Checking setup.py if it exists ===" [ -f setup.py ] && head -50 setup.py || echo "No setup.py found" echo -e "\n=== Checking setup.cfg if it exists ===" [ -f setup.cfg ] && cat setup.cfg || echo "No setup.cfg found" echo -e "\n=== Searching for any mention of cuda in pyproject.toml ===" rg -i 'cuda|torch' pyproject.toml echo -e "\n=== Checking if there are any other imports from cuda.bindings in the codebase ===" rg 'from cuda\.bindings import|import cuda\.bindings' --type=py | head -20 echo -e "\n=== Checking if cuInit is called anywhere in flashinfer ===" rg 'cuInit' --type=py

Repository: flashinfer-ai/flashinfer

Length of output: 4889

🏁 Script executed:

#!/bin/bash echo "=== Checking requirements.txt for cuda-python ===" cat requirements.txt echo -e "\n=== Checking how green_ctx.py handles cuda imports ===" grep -n 'cuda.bindings\|import driver\|import runtime' flashinfer/green_ctx.py -B2 -A2 echo -e "\n=== Checking how cuda_utils.py handles cuda imports ===" grep -n 'cuda.bindings\|import driver\|import runtime' flashinfer/cuda_utils.py -B3 -A3 echo -e "\n=== Checking how gdn_decode.py handles cuda imports ===" grep -n 'cuda.bindings\|import cuda' flashinfer/gdn_decode.py -B2 -A2 echo -e "\n=== Check if _get_l2_cache_size_in_bytes is called in the codebase ===" rg '_get_l2_cache_size_in_bytes' --type=py -B2 -A2

Repository: flashinfer-ai/flashinfer

Length of output: 2127

🏁 Script executed:

#!/bin/bash echo "=== Checking where _get_l2_cache_size_in_bytes is called ===" rg '_get_l2_cache_size_in_bytes' flashinfer/autotuner.py -B10 -A3 echo -e "\n=== Checking if cuda-python is a transitive dependency ===" rg -i 'cuda-python|cuda.bindings' pyproject.toml echo -e "\n=== Checking build_utils.py and setup process for cuda dependencies ===" head -50 build_utils.py 2>/dev/null || echo "No build_utils.py in root" echo -e "\n=== Check if there's optional dependency on cuda-python ===" rg -i 'extras_require|optional' pyproject.toml echo -e "\n=== Check what torch depends on in terms of CUDA bindings ===" # Check if torch typically includes cuda-python python3 -c "import sys; print('Python available')" 2>/dev/null && echo "Python available for querying" || echo "Python not available" echo -e "\n=== Check line context where _get_l2_cache_size_in_bytes is called (broader context) ===" sed -n '865,905p' flashinfer/autotuner.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 4655

Harden CUDA driver integration: add error handling for missing dependency, initialize driver, and respect current device context.

The cuda.bindings import at line 13 lacks error handling and cuda-python is not declared in requirements.txt, causing import failures when the package is absent. At lines 889–895, the driver API calls should initialize the driver with cuInit(0) before querying devices, and should not default to device 0 since this breaks multi-GPU contexts—use torch.cuda.current_device() instead.

Other files like green_ctx.py and cuda_utils.py properly wrap cuda.bindings imports in try-except blocks; follow that pattern here.

Suggested fixes

-from cuda.bindings import driver +try: + from cuda.bindings import driver +except ImportError: + driver = None

- def _get_l2_cache_size_in_bytes(self, device_id: int = 0) -> int: - device = self._checkCudaErrors(driver.cuDeviceGet(device_id)) - return self._checkCudaErrors( - driver.cuDeviceGetAttribute( - driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE, - device, - ) - ) + def _get_l2_cache_size_in_bytes(self, device_id: Optional[int] = None) -> int: + if device_id is None: + device_id = torch.cuda.current_device() + + if driver is None: + props = torch.cuda.get_device_properties(device_id) + return int(getattr(props, "l2_cache_size", 0)) + + self._checkCudaErrors(driver.cuInit(0)) + device = self._checkCudaErrors(driver.cuDeviceGet(device_id)) + return self._checkCudaErrors( + driver.cuDeviceGetAttribute( + driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE, + device, + ) + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/autotuner.py` around lines 13 - 14, Wrap the top-level "from cuda.bindings import driver" in a try/except to gracefully handle missing cuda-python (set driver = None or re-raise a clear ImportError), and add cuda-python to requirements.txt; in the code path that queries CUDA devices (the block that calls driver.cuDeviceGet/driver.cuDeviceGetName etc.) ensure you call driver.cuInit(0) before any device queries and replace the hardcoded device 0 with torch.cuda.current_device() (use that index when calling driver.cuDeviceGet and related APIs), and guard these calls to raise a helpful error or skip CUDA logic if driver is None.

yzh119 · 2026-03-01T22:42:20Z

+        avg_time = float("inf")
+
+        def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
+            start = torch.cuda.Event(enable_timing=True)


Why don't we reuse the functions available in https://github.com/flashinfer-ai/flashinfer/blob/f521fe19ac387e8baffd7b5c925ef59d9f2ecc0c/flashinfer/testing/utils.py

afaik the bench_gpu_time_with_cupti should provide most precise measure atm.

cc @nv-yunzheq @bkryu for confirmation.

In my experience with flashinfer benchmark script, combining CUPTI-based timing with nsys would get CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED.

If user wants to profile flashinfer kernels in e2e runs, this will prevent them from doing so.

Following what @rosenrodt replied, do you @yzh119 think it should be changed?

i think this is true, nsys and cupti goes thru the same mechanism underneath and it is critical we unblock nsys analysis in library code paths

citing https://docs.nvidia.com/cupti/api/group__CUPTI__CALLBACK__API.html#_CPPv414cuptiSubscribeP22CUpti_SubscriberHandle18CUpti_CallbackFuncPv

i'm going to say let's approve this as is at least for now

Yes nsys uses the lower level CUPTI, and NVIDIA's cupti-python (obviously) also uses the lower level CUPTI.

As a result, profiling with nsys produces CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED. I suggest that we do not use bench_gpu_time_with_cupti for the autotuner for this reason. Losing the ability to nsys profile would be a bit too much.

Thanks for explaining this to me, and now I can understand why not using cupti here, at least we can consider reusing cuda event benchmarking functions under flashinfer.testing (maybe in future PRs).

Let's merge this PR now.

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

flashinfer/autotuner.py (1)

13-13: ⚠️ Potential issue | 🟠 Major

Guard optional CUDA bindings and initialize the driver before device queries.

Line 13 makes cuda.bindings a hard import dependency, and Line 893 uses driver APIs without cuInit(0). This can fail at import/runtime even though these modes are optional/disabled by default.

Suggested patch

-from cuda.bindings import driver
+try:
+    from cuda.bindings import driver
+except ImportError:
+    driver = None
@@
     def _get_l2_cache_size_in_bytes(self, device_id: Optional[int] = None) -> int:
         if device_id is None:
             device_id = torch.cuda.current_device()
 
+        if driver is None:
+            props = torch.cuda.get_device_properties(device_id)
+            return int(getattr(props, "l2_cache_size", 0))
+
+        self._checkCudaErrors(driver.cuInit(0))
         device = self._checkCudaErrors(driver.cuDeviceGet(device_id))
         return self._checkCudaErrors(
             driver.cuDeviceGetAttribute(
                 driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE,
                 device,
             )
         )

#!/bin/bash
set -euo pipefail

echo "== Verify CUDA Python dependency declaration =="
fd -HI 'pyproject.toml|requirements*.txt|setup.py|setup.cfg' \
  | xargs -r rg -n 'cuda-python|cuda-bindings|cuda\.bindings' || true

echo
echo "== Verify driver init/query ordering in autotuner =="
rg -n -C3 'cuInit\(|cuDeviceGet\(|cuDeviceGetAttribute\(' flashinfer/autotuner.py

Expected: either guarded import/fallback path exists, and cuInit(0) appears before device attribute queries in _get_l2_cache_size_in_bytes.

Also applies to: 889-899

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` at line 13, The hard import of cuda.bindings.driver
should be guarded and the CUDA driver must be initialized before any device
queries: change the module import to safely try/except ImportError (or set
driver = None) for "from cuda.bindings import driver" and add an initialization
check that calls driver.cuInit(0) before any driver.* queries; specifically,
update _get_l2_cache_size_in_bytes (and neighboring code that calls cuDeviceGet
/ cuDeviceGetAttribute) to first ensure the driver is present and initialized
(call cuInit(0) once, handle and log errors) and return a safe fallback when the
CUDA bindings are unavailable instead of crashing.

🧹 Nitpick comments (1)

flashinfer/autotuner.py (1)

925-925: Use TypeError for invalid error-type input.

Line 925 handles an invalid input type in _cudaGetErrorEnum; TypeError is the more precise exception class.

Suggested patch

-            raise RuntimeError("Unknown error type: {}".format(error))
+            raise TypeError("Unknown error type: {}".format(error))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/autotuner.py` at line 925, The code in _cudaGetErrorEnum currently
raises RuntimeError for an invalid error-type input; change this to raise a
TypeError instead so the exception accurately reflects an incorrect input type.
Update the raise statement in _cudaGetErrorEnum (the line that currently does
raise RuntimeError("Unknown error type: {}".format(error))) to raise TypeError
with the same informative message, ensuring callers receive a type-specific
exception.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@flashinfer/autotuner.py`:
- Line 13: The hard import of cuda.bindings.driver should be guarded and the
CUDA driver must be initialized before any device queries: change the module
import to safely try/except ImportError (or set driver = None) for "from
cuda.bindings import driver" and add an initialization check that calls
driver.cuInit(0) before any driver.* queries; specifically, update
_get_l2_cache_size_in_bytes (and neighboring code that calls cuDeviceGet /
cuDeviceGetAttribute) to first ensure the driver is present and initialized
(call cuInit(0) once, handle and log errors) and return a safe fallback when the
CUDA bindings are unavailable instead of crashing.

---

Nitpick comments:
In `@flashinfer/autotuner.py`:
- Line 925: The code in _cudaGetErrorEnum currently raises RuntimeError for an
invalid error-type input; change this to raise a TypeError instead so the
exception accurately reflects an incorrect input type. Update the raise
statement in _cudaGetErrorEnum (the line that currently does raise
RuntimeError("Unknown error type: {}".format(error))) to raise TypeError with
the same informative message, ensuring callers receive a type-specific
exception.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceabfe0 and e6d7e44.

📒 Files selected for processing (1)

flashinfer/autotuner.py

aleozlx · 2026-03-02T20:27:35Z

/bot run

flashinfer-bot · 2026-03-02T20:28:07Z

GitLab MR !363 has been created, and the CI pipeline #45167053 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-02T23:32:22Z

[CANCELING] Pipeline #45167053: canceled

yzh119 · 2026-03-03T01:24:40Z

+        avg_time = float("inf")
+
+        def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
+            start = torch.cuda.Event(enable_timing=True)


Thanks for explaining this to me, and now I can understand why not using cupti here, at least we can consider reusing cuda event benchmarking functions under flashinfer.testing (maybe in future PRs).

Let's merge this PR now.

Removed CUDA error handling methods and replaced L2 cache size retrieval with PyTorch's method. This is due to an AOT build error in CI

aleozlx · 2026-03-03T07:49:13Z

@flashinfer-bot run

aleozlx · 2026-03-03T07:49:18Z

Removed CUDA error handling methods and replaced L2 cache size retrieval with PyTorch's method. This is due to an AOT build error in CI

aleozlx · 2026-03-03T12:03:50Z

two failed tests are seen on the public CI

https://github.com/flashinfer-ai/flashinfer/actions/runs/22613536689/job/65520873046?pr=2663

tests/autotuner/test_autotuner_core.py:326: AssertionError
----------------------------- Captured stderr call -----------------------------
2026-03-03 10:59:00,394 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-03-03 10:59:00,394 - WARNING - autotuner.py:503 - flashinfer.jit: [Autotuner]: Skipping tactic <tests.autotuner.test_autotuner_core.DummyRunner object at 0x7a72e7917e60> 0, due to failure while profiling: test_choose_one_tuning_selects_best_tactic_and_populates_cache..fake_profile() takes 4 positional arguments but 5 were given
2026-03-03 10:59:00,394 - WARNING - autotuner.py:503 - flashinfer.jit: [Autotuner]: Skipping tactic <tests.autotuner.test_autotuner_core.DummyRunner object at 0x7a72e7917e60> 1, due to failure while profiling: test_choose_one_tuning_selects_best_tactic_and_populates_cache..fake_profile() takes 4 positional arguments but 5 were given
2026-03-03 10:59:00,394 - WARNING - autotuner.py:503 - flashinfer.jit: [Autotuner]: Skipping tactic <tests.autotuner.test_autotuner_core.DummyRunner object at 0x7a72e7917e60> 2, due to failure while profiling: test_choose_one_tuning_selects_best_tactic_and_populates_cache..fake_profile() takes 4 positional arguments but 5 were given
2026-03-03 10:59:00,395 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends

generated xml file: /workspace/junit/tests_autotuner_test_autotuner_core.py.xml -
=========================== short test summary info ============================
FAILED tests/autotuner/test_autotuner_core.py::test_search_cache_preserving_leading_dims_hits_while_flattened_misses
FAILED tests/autotuner/test_autotuner_core.py::test_choose_one_tuning_selects_best_tactic_and_populates_cache
========================= 2 failed, 25 passed in 0.39s =========================
❌ FAILED: tests/autotuner/test_autotuner_core.py

aleozlx · 2026-03-03T12:13:00Z

the failed test is a new file from

#2617

…ature The _profile_single_kernel method added a tuning_config parameter but the fake_profile mocks in two tests were not updated, causing them to fail with "takes 4 positional arguments but 5 were given". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aleozlx · 2026-03-03T12:25:58Z

@flashinfer-bot run

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/autotuner/test_autotuner_core.py (1)

264-266: Silence intentional-unused-argument warnings in test mocks.

These mocks intentionally ignore most parameters; prefixing with _ avoids ARG001 noise while keeping compatibility.

Suggested cleanup

-    def fake_profile(
-        self, runner_obj, prof_inputs, tactic, tuning_config=None, **kwargs
-    ):
+    def fake_profile(
+        _self, _runner_obj, _prof_inputs, tactic, _tuning_config=None, **_kwargs
+    ):
         return {0: 5.0, 1: 1.0, 2: 3.0}[tactic]

Also applies to: 320-322

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/autotuner/test_autotuner_core.py` around lines 264 - 266, The test mock
function fake_profile (and the other similar mock at the later block)
intentionally ignores most parameters; rename unused parameters by prefixing
them with an underscore (e.g., change runner_obj, prof_inputs, tactic,
tuning_config to _runner_obj, _prof_inputs, _tactic, _tuning_config, or similar)
so ARG001 linter warnings are silenced while keeping the mock signature
compatible; update both fake_profile and the other mock function referenced in
the same test file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 146-147: The cache-key generation must incorporate the new
profiling-mode flags so tuned tactics for different modes aren't reused; update
the Autotuner cache-key builder (the function/method that composes the cache key
used by the autotuner) to include the boolean flags use_cold_l2_cache and
use_cuda_graph (or equivalent attributes) when constructing the key so keys
differ when these flags change, and ensure any lookups/insertions of the cache
use the updated key format.

---

Nitpick comments:
In `@tests/autotuner/test_autotuner_core.py`:
- Around line 264-266: The test mock function fake_profile (and the other
similar mock at the later block) intentionally ignores most parameters; rename
unused parameters by prefixing them with an underscore (e.g., change runner_obj,
prof_inputs, tactic, tuning_config to _runner_obj, _prof_inputs, _tactic,
_tuning_config, or similar) so ARG001 linter warnings are silenced while keeping
the mock signature compatible; update both fake_profile and the other mock
function referenced in the same test file.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17e8a10 and 766eabd.

📒 Files selected for processing (3)

flashinfer/autotuner.py
flashinfer/fused_moe/core.py
tests/autotuner/test_autotuner_core.py

🚧 Files skipped from review as they are similar to previous changes (1)

flashinfer/fused_moe/core.py

coderabbitai · 2026-03-03T12:29:06Z

+    use_cold_l2_cache: bool = False
+    use_cuda_graph: bool = False


⚠️ Potential issue | 🟠 Major

Include tuning-mode flags in autotuner cache keys.

Line 146 and Line 147 add profiling-mode knobs, but cache keys still don’t distinguish them. This can reuse tactics tuned under different measurement modes (e.g., graph/cold-L2 vs default), leading to incorrect tactic selection.

Suggested fix

def _get_cache_key( cls, custom_op: str, runner: TunableRunner, input_shapes: Tuple[torch.Size], tuning_config: TuningConfig, ) -> Tuple: return ( custom_op, runner.__class__.__name__, hash(runner), cls._find_nearest_profile(input_shapes, tuning_config), + tuning_config.use_cold_l2_cache, + tuning_config.use_cuda_graph, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/autotuner.py` around lines 146 - 147, The cache-key generation must incorporate the new profiling-mode flags so tuned tactics for different modes aren't reused; update the Autotuner cache-key builder (the function/method that composes the cache key used by the autotuner) to include the boolean flags use_cold_l2_cache and use_cuda_graph (or equivalent attributes) when constructing the key so keys differ when these flags change, and ensure any lookups/insertions of the cache use the updated key format.

Add use_cuda_graph and use_cold_l2_cache options to TuningConfig, tha…

6134222

…t are enabled for trtllm_fp4_block_scale_moe_op Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv requested review from IwakuraRein, bkryu, jimmyzho, kahyunnam, nv-yunzheq and yzh119 as code owners March 1, 2026 18:26

amitz-nv changed the title ~~feat: Support CUDA graph and cold L2 cache in autotuner~~ feat: Autotuner support CUDA graph and cold L2 cache Mar 1, 2026

gemini-code-assist Bot reviewed Mar 1, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 1, 2026

View reviewed changes

Refactor calls to runner to _run_kernels function

ceabfe0

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai Bot reviewed Mar 1, 2026

View reviewed changes

yzh119 reviewed Mar 1, 2026

View reviewed changes

aleozlx added the v0.6.5 release blocker label for v0.6.5 label Mar 2, 2026

Use current device by default in _get_l2_cache_size_in_bytes

e6d7e44

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv requested a review from yzh119 March 2, 2026 18:19

coderabbitai Bot reviewed Mar 2, 2026

View reviewed changes

aleozlx added the run-ci label Mar 2, 2026

aleozlx approved these changes Mar 2, 2026

View reviewed changes

aleozlx added the ready label Mar 2, 2026

yzh119 approved these changes Mar 3, 2026

View reviewed changes

yzh119 enabled auto-merge (squash) March 3, 2026 01:24

Try L2 cache size retrieval to use PyTorch

17e8a10

Removed CUDA error handling methods and replaced L2 cache size retrieval with PyTorch's method. This is due to an AOT build error in CI

aleozlx removed the run-ci label Mar 3, 2026

aleozlx added the run-ci label Mar 3, 2026

aleozlx and others added 2 commits March 3, 2026 04:14

Merge branch 'main' into autotuner-improvements-from-trtllm

55401f1

aleozlx requested a review from nvmbreughe as a code owner March 3, 2026 12:23

aleozlx removed the run-ci label Mar 3, 2026

aleozlx added the run-ci label Mar 3, 2026

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

yzh119 merged commit 9b223bb into flashinfer-ai:main Mar 3, 2026
32 of 55 checks passed

This was referenced Mar 4, 2026

feat: Add autotuner config caching, thread safety, and documentation #2554

Merged

fix: autotuner cache key mismatch for trtllm-gen FP8 block scale MoE and FP8 routed MoE #2640

Closed

This was referenced Mar 23, 2026

Yanqinz/gemm cudnn fixes #2862

Closed

Yanqinz/gemm cudnn autotune fix #2863

Merged

coderabbitai Bot mentioned this pull request Apr 2, 2026

Add custom tuning buckets and rounding direction to autotune() #2958

Merged

coderabbitai Bot mentioned this pull request Apr 9, 2026

Prevent MoE autotuner buffer overflow on large token buckets #3025

Merged

5 tasks

This was referenced Apr 20, 2026

autotuner: check cache before synthesizing profile input tensors #3126

Merged

Yanqinz/fix cudnn sm120 nan #3192

Merged

This was referenced May 5, 2026

[Bugfix] Fix fused MoE autotuning correctness issues by filtering clusterDimZ #3227

Merged

fix(cute_dsl/moe): unbias autotuner profiling for tile_size enumeration #3252

Open

Conversation

amitz-nv commented Mar 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot commented Mar 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

rosenrodt Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

amitz-nv Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

aleozlx Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

bkryu Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented Mar 2, 2026

Uh oh!

flashinfer-bot commented Mar 2, 2026

Uh oh!

flashinfer-bot commented Mar 2, 2026

Uh oh!

yzh119 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

aleozlx commented Mar 3, 2026

Uh oh!

aleozlx commented Mar 3, 2026

Uh oh!

aleozlx commented Mar 3, 2026

amitz-nv commented Mar 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 1, 2026 •

edited

Loading