feat: Autotuner support CUDA graph and cold L2 cache#2663
feat: Autotuner support CUDA graph and cold L2 cache#2663yzh119 merged 6 commits intoflashinfer-ai:mainfrom
Conversation
…t are enabled for trtllm_fp4_block_scale_moe_op Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds two tuning flags ( Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant AutoTuner as AutoTuner
participant Runner as TunableRunner
participant CUDA as CUDA
Caller->>AutoTuner: choose_one(op, runners, tuning_config, inputs)
AutoTuner->>AutoTuner: _prepare_input_tensors_with_batches(inputs, tuning_config)
AutoTuner->>Runner: _profile_single_kernel(runner, batched_inputs, tactic, tuning_config)
alt tuning_config.use_cuda_graph
AutoTuner->>CUDA: beginGraphCapture() (host stream)
Runner->>CUDA: execute kernels (warmup + timed)
CUDA-->>Runner: endGraphCapture()/createGraph()
AutoTuner->>CUDA: replayGraph() (with graph delay)
else not use_cuda_graph
Runner->>CUDA: execute kernels directly (warmup + timed) on stream
end
Runner-->>AutoTuner: timing result
AutoTuner-->>Caller: chosen runner + tactic
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the autotuner's capability to perform more realistic and accurate performance profiling. By incorporating support for CUDA graphs and simulating cold L2 cache conditions, the autotuner can now identify optimal kernel implementations that perform better in real-world scenarios where L2 cache might not be warm or operations are part of a larger CUDA graph. This leads to improved overall performance for operations leveraging these tuning advancements. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for CUDA graph and cold L2 cache to the autotuner, which is a great feature for improving profiling accuracy. The implementation is mostly copied from TRT-LLM and looks solid. I have one suggestion to refactor a small piece of duplicated code to improve maintainability.
| def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float: | ||
| start = torch.cuda.Event(enable_timing=True) | ||
| end = torch.cuda.Event(enable_timing=True) | ||
| graph = torch.cuda.CUDAGraph() | ||
|
|
||
| with torch.cuda.stream(stream): | ||
| if tuning_config.use_cuda_graph: | ||
| with torch.cuda.graph(graph): | ||
| for r in range(repeat): | ||
| runner( | ||
| input_tensor_batches[r % len(input_tensor_batches)], | ||
| tactic=tactic, | ||
| **kwargs, | ||
| ) | ||
|
|
||
| stream.synchronize() | ||
|
|
||
| # Delay the profiled kernel launch to eliminate affects of host time overhead in profiling. | ||
| delay_kernel_time_usec = ( | ||
| self._CUDA_GRAPH_DELAY_MICRO_SECS | ||
| if tuning_config.use_cuda_graph | ||
| else self.stream_delay_micro_secs | ||
| ) | ||
| delay_kernel(delay_kernel_time_usec) | ||
|
|
||
| start.record() | ||
|
|
||
| if tuning_config.use_cuda_graph: | ||
| graph.replay() | ||
| else: | ||
| for r in range(repeat): | ||
| runner( | ||
| input_tensor_batches[r % len(input_tensor_batches)], | ||
| tactic=tactic, | ||
| **kwargs, | ||
| ) | ||
|
|
||
| end.record() | ||
| stream.synchronize() | ||
|
|
||
| # Delay the profiled kernel launch to eliminate affects of host time overhead in profiling. | ||
| # TODO: This is build time sensitive, O(tactic_num * impl_num * num_profile * tunable_ops) | ||
| # Consider apply a preprofiling to estimate the kernel execution time, then decide the necessity. | ||
| if self.stream_delay_micro_secs > 0: | ||
| delay_kernel(self.stream_delay_micro_secs) | ||
| start = torch.cuda.Event(enable_timing=True) | ||
| end = torch.cuda.Event(enable_timing=True) | ||
| return start.elapsed_time(end) / repeat |
There was a problem hiding this comment.
To improve code clarity and reduce duplication, you can extract the repeated kernel execution loop into a nested helper function within pure_profile. This makes the logic for both CUDA graph and regular execution paths cleaner and easier to maintain.
def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
graph = torch.cuda.CUDAGraph()
def _run_kernels():
for r in range(repeat):
runner(
input_tensor_batches[r % len(input_tensor_batches)],
tactic=tactic,
**kwargs,
)
with torch.cuda.stream(stream):
if tuning_config.use_cuda_graph:
with torch.cuda.graph(graph):
_run_kernels()
stream.synchronize()
# Delay the profiled kernel launch to eliminate affects of host time overhead in profiling.
delay_kernel_time_usec = (
self._CUDA_GRAPH_DELAY_MICRO_SECS
if tuning_config.use_cuda_graph
else self.stream_delay_micro_secs
)
delay_kernel(delay_kernel_time_usec)
start.record()
if tuning_config.use_cuda_graph:
graph.replay()
else:
_run_kernels()
end.record()
stream.synchronize()
return start.elapsed_time(end) / repeatThere was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/fused_moe/core.py (1)
1284-1310:⚠️ Potential issue | 🟡 MinorConsider documenting or enforcing hashable
**kwargsinrefine_tuning_config.The
@functools.lru_cachedecorator requires all function arguments to be hashable. While current callers at line 1893 only pass booleans (use_cold_l2_cache=True, use_cuda_graph=True), which are safely hashable, the method signature accepts arbitrary**kwargswithout constraints. Passing unhashable values (lists, dicts, etc.) through**kwargswould cause aTypeErrorat runtime. Add type hints restricting kwargs to hashable types or document the limitation to prevent future misuse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/core.py` around lines 1284 - 1310, The cached classmethod refine_tuning_config is decorated with functools.lru_cache but accepts arbitrary **kwargs which may be unhashable and will raise TypeError; update the signature and implementation to enforce hashable kwargs (e.g., change to **kwargs: Mapping[str, Hashable] or accept a single frozen mapping) or normalize/validate incoming kwargs into a hashable form before using them in the cached call (for example convert nested dicts/lists to frozensets/tuples), and update callers or document the constraint; reference refine_tuning_config, the functools.lru_cache decorator, and the class attributes tuning_config_with_hidden_states_scales / tuning_config_no_hidden_states_scales when making the change.
🧹 Nitpick comments (3)
flashinfer/autotuner.py (3)
601-647: Consider moving CUDAGraph instantiation inside the conditional.The
torch.cuda.CUDAGraph()object is created unconditionally on line 604 but only used whenuse_cuda_graphis True. This creates unnecessary overhead when CUDA graphs are disabled.♻️ Proposed refactor
def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float: start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) - graph = torch.cuda.CUDAGraph() with torch.cuda.stream(stream): if tuning_config.use_cuda_graph: + graph = torch.cuda.CUDAGraph() with torch.cuda.graph(graph): for r in range(repeat): runner(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/autotuner.py` around lines 601 - 647, The CUDAGraph object is always instantiated in pure_profile even though it's only used when tuning_config.use_cuda_graph is True; move the torch.cuda.CUDAGraph() creation into the branch guarded by tuning_config.use_cuda_graph (the block that currently does "with torch.cuda.graph(graph):") so graph is only created when needed, ensure any references (graph.replay()) remain valid by keeping graph in scope of the branch where it's used, and remove the unconditional graph variable to avoid overhead when use_cuda_graph is False.
923-924: Use TypeError for invalid type detection.Per static analysis,
TypeErroris more appropriate thanRuntimeErrorwhen the error condition is about an unexpected type.♻️ Proposed fix
else: - raise RuntimeError("Unknown error type: {}".format(error)) + raise TypeError(f"Unknown error type: {type(error).__name__}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/autotuner.py` around lines 923 - 924, The code currently raises RuntimeError("Unknown error type: {}".format(error)); replace that with raising a TypeError instead to reflect an invalid/unexpected type detection (e.g., raise TypeError("Unknown error type: {}".format(error))). Locate the branch that raises RuntimeError (the else that handles unknown error types) and swap RuntimeError for TypeError, keeping the existing message or slightly clarifying it as needed.
869-870: Add documentation for the 3x L2 cache size multiplier heuristic.The
* 3factor appears to be a heuristic to ensure sufficient data is cycled through to evict L2 cache entries. Consider adding a brief comment explaining this design choice for maintainability.📝 Proposed documentation
+ # Use 3x L2 cache size to ensure cold cache simulation by evicting + # previously cached data through sufficient memory pressure num_buffers = self._get_l2_cache_size_in_bytes() * 3 // one_buffer_bytes + 1 num_buffers = min(num_buffers, self.repeat + 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/autotuner.py` around lines 869 - 870, Add a short inline comment next to the num_buffers computation explaining the "* 3" heuristic: in the line where num_buffers is computed using self._get_l2_cache_size_in_bytes(), note that multiplying by 3 aims to cycle roughly three times the L2 cache capacity (to increase likelihood of evicting L2-resident data) before dividing by one_buffer_bytes, and that the result is clamped by min(..., self.repeat + 1); reference the symbols num_buffers, _get_l2_cache_size_in_bytes, one_buffer_bytes, and repeat so the rationale is clear for future maintainers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1284-1310: The cached classmethod refine_tuning_config is
decorated with functools.lru_cache but accepts arbitrary **kwargs which may be
unhashable and will raise TypeError; update the signature and implementation to
enforce hashable kwargs (e.g., change to **kwargs: Mapping[str, Hashable] or
accept a single frozen mapping) or normalize/validate incoming kwargs into a
hashable form before using them in the cached call (for example convert nested
dicts/lists to frozensets/tuples), and update callers or document the
constraint; reference refine_tuning_config, the functools.lru_cache decorator,
and the class attributes tuning_config_with_hidden_states_scales /
tuning_config_no_hidden_states_scales when making the change.
---
Nitpick comments:
In `@flashinfer/autotuner.py`:
- Around line 601-647: The CUDAGraph object is always instantiated in
pure_profile even though it's only used when tuning_config.use_cuda_graph is
True; move the torch.cuda.CUDAGraph() creation into the branch guarded by
tuning_config.use_cuda_graph (the block that currently does "with
torch.cuda.graph(graph):") so graph is only created when needed, ensure any
references (graph.replay()) remain valid by keeping graph in scope of the branch
where it's used, and remove the unconditional graph variable to avoid overhead
when use_cuda_graph is False.
- Around line 923-924: The code currently raises RuntimeError("Unknown error
type: {}".format(error)); replace that with raising a TypeError instead to
reflect an invalid/unexpected type detection (e.g., raise TypeError("Unknown
error type: {}".format(error))). Locate the branch that raises RuntimeError (the
else that handles unknown error types) and swap RuntimeError for TypeError,
keeping the existing message or slightly clarifying it as needed.
- Around line 869-870: Add a short inline comment next to the num_buffers
computation explaining the "* 3" heuristic: in the line where num_buffers is
computed using self._get_l2_cache_size_in_bytes(), note that multiplying by 3
aims to cycle roughly three times the L2 cache capacity (to increase likelihood
of evicting L2-resident data) before dividing by one_buffer_bytes, and that the
result is clamped by min(..., self.repeat + 1); reference the symbols
num_buffers, _get_l2_cache_size_in_bytes, one_buffer_bytes, and repeat so the
rationale is clear for future maintainers.
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 13-14: Wrap the top-level "from cuda.bindings import driver" in a
try/except to gracefully handle missing cuda-python (set driver = None or
re-raise a clear ImportError), and add cuda-python to requirements.txt; in the
code path that queries CUDA devices (the block that calls
driver.cuDeviceGet/driver.cuDeviceGetName etc.) ensure you call driver.cuInit(0)
before any device queries and replace the hardcoded device 0 with
torch.cuda.current_device() (use that index when calling driver.cuDeviceGet and
related APIs), and guard these calls to raise a helpful error or skip CUDA logic
if driver is None.
| from cuda.bindings import driver | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Step 1: Check dependency declarations for cuda-python/cuda-bindings
echo "=== Checking dependency declarations ==="
fd -HI 'pyproject.toml|setup.py|setup.cfg|requirements.*\.txt|environment.*\.y[a]?ml' \
| xargs -r rg -nP 'cuda-python|cuda-bindings|nvidia-cuda-python|cuda\.bindings' \
| head -50
echo -e "\n=== Checking flashinfer/autotuner.py line 13 and surrounding context ==="
head -30 flashinfer/autotuner.py | cat -n
echo -e "\n=== Checking lines 889-922 in flashinfer/autotuner.py ==="
sed -n '880,925p' flashinfer/autotuner.py | cat -n
echo -e "\n=== Checking all driver API usage in autotuner.py ==="
rg -nP 'cuInit|cuDeviceGet|cuDeviceGetAttribute|from cuda' flashinfer/autotuner.py -B2 -A2Repository: flashinfer-ai/flashinfer
Length of output: 4066
🏁 Script executed:
#!/bin/bash
echo "=== Checking pyproject.toml full content for dependencies ==="
cat pyproject.toml | head -100
echo -e "\n=== Checking setup.py if it exists ==="
[ -f setup.py ] && head -50 setup.py || echo "No setup.py found"
echo -e "\n=== Checking setup.cfg if it exists ==="
[ -f setup.cfg ] && cat setup.cfg || echo "No setup.cfg found"
echo -e "\n=== Searching for any mention of cuda in pyproject.toml ==="
rg -i 'cuda|torch' pyproject.toml
echo -e "\n=== Checking if there are any other imports from cuda.bindings in the codebase ==="
rg 'from cuda\.bindings import|import cuda\.bindings' --type=py | head -20
echo -e "\n=== Checking if cuInit is called anywhere in flashinfer ==="
rg 'cuInit' --type=pyRepository: flashinfer-ai/flashinfer
Length of output: 4889
🏁 Script executed:
#!/bin/bash
echo "=== Checking requirements.txt for cuda-python ==="
cat requirements.txt
echo -e "\n=== Checking how green_ctx.py handles cuda imports ==="
grep -n 'cuda.bindings\|import driver\|import runtime' flashinfer/green_ctx.py -B2 -A2
echo -e "\n=== Checking how cuda_utils.py handles cuda imports ==="
grep -n 'cuda.bindings\|import driver\|import runtime' flashinfer/cuda_utils.py -B3 -A3
echo -e "\n=== Checking how gdn_decode.py handles cuda imports ==="
grep -n 'cuda.bindings\|import cuda' flashinfer/gdn_decode.py -B2 -A2
echo -e "\n=== Check if _get_l2_cache_size_in_bytes is called in the codebase ==="
rg '_get_l2_cache_size_in_bytes' --type=py -B2 -A2Repository: flashinfer-ai/flashinfer
Length of output: 2127
🏁 Script executed:
#!/bin/bash
echo "=== Checking where _get_l2_cache_size_in_bytes is called ==="
rg '_get_l2_cache_size_in_bytes' flashinfer/autotuner.py -B10 -A3
echo -e "\n=== Checking if cuda-python is a transitive dependency ==="
rg -i 'cuda-python|cuda.bindings' pyproject.toml
echo -e "\n=== Checking build_utils.py and setup process for cuda dependencies ==="
head -50 build_utils.py 2>/dev/null || echo "No build_utils.py in root"
echo -e "\n=== Check if there's optional dependency on cuda-python ==="
rg -i 'extras_require|optional' pyproject.toml
echo -e "\n=== Check what torch depends on in terms of CUDA bindings ==="
# Check if torch typically includes cuda-python
python3 -c "import sys; print('Python available')" 2>/dev/null && echo "Python available for querying" || echo "Python not available"
echo -e "\n=== Check line context where _get_l2_cache_size_in_bytes is called (broader context) ==="
sed -n '865,905p' flashinfer/autotuner.py | cat -nRepository: flashinfer-ai/flashinfer
Length of output: 4655
Harden CUDA driver integration: add error handling for missing dependency, initialize driver, and respect current device context.
The cuda.bindings import at line 13 lacks error handling and cuda-python is not declared in requirements.txt, causing import failures when the package is absent. At lines 889–895, the driver API calls should initialize the driver with cuInit(0) before querying devices, and should not default to device 0 since this breaks multi-GPU contexts—use torch.cuda.current_device() instead.
Other files like green_ctx.py and cuda_utils.py properly wrap cuda.bindings imports in try-except blocks; follow that pattern here.
Suggested fixes
-from cuda.bindings import driver
+try:
+ from cuda.bindings import driver
+except ImportError:
+ driver = None- def _get_l2_cache_size_in_bytes(self, device_id: int = 0) -> int:
- device = self._checkCudaErrors(driver.cuDeviceGet(device_id))
- return self._checkCudaErrors(
- driver.cuDeviceGetAttribute(
- driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE,
- device,
- )
- )
+ def _get_l2_cache_size_in_bytes(self, device_id: Optional[int] = None) -> int:
+ if device_id is None:
+ device_id = torch.cuda.current_device()
+
+ if driver is None:
+ props = torch.cuda.get_device_properties(device_id)
+ return int(getattr(props, "l2_cache_size", 0))
+
+ self._checkCudaErrors(driver.cuInit(0))
+ device = self._checkCudaErrors(driver.cuDeviceGet(device_id))
+ return self._checkCudaErrors(
+ driver.cuDeviceGetAttribute(
+ driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE,
+ device,
+ )
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/autotuner.py` around lines 13 - 14, Wrap the top-level "from
cuda.bindings import driver" in a try/except to gracefully handle missing
cuda-python (set driver = None or re-raise a clear ImportError), and add
cuda-python to requirements.txt; in the code path that queries CUDA devices (the
block that calls driver.cuDeviceGet/driver.cuDeviceGetName etc.) ensure you call
driver.cuInit(0) before any device queries and replace the hardcoded device 0
with torch.cuda.current_device() (use that index when calling driver.cuDeviceGet
and related APIs), and guard these calls to raise a helpful error or skip CUDA
logic if driver is None.
| avg_time = float("inf") | ||
|
|
||
| def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float: | ||
| start = torch.cuda.Event(enable_timing=True) |
There was a problem hiding this comment.
Why don't we reuse the functions available in https://github.com/flashinfer-ai/flashinfer/blob/f521fe19ac387e8baffd7b5c925ef59d9f2ecc0c/flashinfer/testing/utils.py
afaik the bench_gpu_time_with_cupti should provide most precise measure atm.
cc @nv-yunzheq @bkryu for confirmation.
There was a problem hiding this comment.
In my experience with flashinfer benchmark script, combining CUPTI-based timing with nsys would get CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED.
If user wants to profile flashinfer kernels in e2e runs, this will prevent them from doing so.
There was a problem hiding this comment.
Following what @rosenrodt replied, do you @yzh119 think it should be changed?
There was a problem hiding this comment.
i think this is true, nsys and cupti goes thru the same mechanism underneath and it is critical we unblock nsys analysis in library code paths
i'm going to say let's approve this as is at least for now
There was a problem hiding this comment.
Yes nsys uses the lower level CUPTI, and NVIDIA's cupti-python (obviously) also uses the lower level CUPTI.
As a result, profiling with nsys produces CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED. I suggest that we do not use bench_gpu_time_with_cupti for the autotuner for this reason. Losing the ability to nsys profile would be a bit too much.
There was a problem hiding this comment.
Thanks for explaining this to me, and now I can understand why not using cupti here, at least we can consider reusing cuda event benchmarking functions under flashinfer.testing (maybe in future PRs).
Let's merge this PR now.
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
flashinfer/autotuner.py (1)
13-13:⚠️ Potential issue | 🟠 MajorGuard optional CUDA bindings and initialize the driver before device queries.
Line 13 makes
cuda.bindingsa hard import dependency, and Line 893 uses driver APIs withoutcuInit(0). This can fail at import/runtime even though these modes are optional/disabled by default.Suggested patch
-from cuda.bindings import driver +try: + from cuda.bindings import driver +except ImportError: + driver = None @@ def _get_l2_cache_size_in_bytes(self, device_id: Optional[int] = None) -> int: if device_id is None: device_id = torch.cuda.current_device() + if driver is None: + props = torch.cuda.get_device_properties(device_id) + return int(getattr(props, "l2_cache_size", 0)) + + self._checkCudaErrors(driver.cuInit(0)) device = self._checkCudaErrors(driver.cuDeviceGet(device_id)) return self._checkCudaErrors( driver.cuDeviceGetAttribute( driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE, device, ) )#!/bin/bash set -euo pipefail echo "== Verify CUDA Python dependency declaration ==" fd -HI 'pyproject.toml|requirements*.txt|setup.py|setup.cfg' \ | xargs -r rg -n 'cuda-python|cuda-bindings|cuda\.bindings' || true echo echo "== Verify driver init/query ordering in autotuner ==" rg -n -C3 'cuInit\(|cuDeviceGet\(|cuDeviceGetAttribute\(' flashinfer/autotuner.pyExpected: either guarded import/fallback path exists, and
cuInit(0)appears before device attribute queries in_get_l2_cache_size_in_bytes.Also applies to: 889-899
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/autotuner.py` at line 13, The hard import of cuda.bindings.driver should be guarded and the CUDA driver must be initialized before any device queries: change the module import to safely try/except ImportError (or set driver = None) for "from cuda.bindings import driver" and add an initialization check that calls driver.cuInit(0) before any driver.* queries; specifically, update _get_l2_cache_size_in_bytes (and neighboring code that calls cuDeviceGet / cuDeviceGetAttribute) to first ensure the driver is present and initialized (call cuInit(0) once, handle and log errors) and return a safe fallback when the CUDA bindings are unavailable instead of crashing.
🧹 Nitpick comments (1)
flashinfer/autotuner.py (1)
925-925: UseTypeErrorfor invalid error-type input.Line 925 handles an invalid input type in
_cudaGetErrorEnum;TypeErroris the more precise exception class.Suggested patch
- raise RuntimeError("Unknown error type: {}".format(error)) + raise TypeError("Unknown error type: {}".format(error))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/autotuner.py` at line 925, The code in _cudaGetErrorEnum currently raises RuntimeError for an invalid error-type input; change this to raise a TypeError instead so the exception accurately reflects an incorrect input type. Update the raise statement in _cudaGetErrorEnum (the line that currently does raise RuntimeError("Unknown error type: {}".format(error))) to raise TypeError with the same informative message, ensuring callers receive a type-specific exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@flashinfer/autotuner.py`:
- Line 13: The hard import of cuda.bindings.driver should be guarded and the
CUDA driver must be initialized before any device queries: change the module
import to safely try/except ImportError (or set driver = None) for "from
cuda.bindings import driver" and add an initialization check that calls
driver.cuInit(0) before any driver.* queries; specifically, update
_get_l2_cache_size_in_bytes (and neighboring code that calls cuDeviceGet /
cuDeviceGetAttribute) to first ensure the driver is present and initialized
(call cuInit(0) once, handle and log errors) and return a safe fallback when the
CUDA bindings are unavailable instead of crashing.
---
Nitpick comments:
In `@flashinfer/autotuner.py`:
- Line 925: The code in _cudaGetErrorEnum currently raises RuntimeError for an
invalid error-type input; change this to raise a TypeError instead so the
exception accurately reflects an incorrect input type. Update the raise
statement in _cudaGetErrorEnum (the line that currently does raise
RuntimeError("Unknown error type: {}".format(error))) to raise TypeError with
the same informative message, ensuring callers receive a type-specific
exception.
|
/bot run |
|
[CANCELING] Pipeline #45167053: canceled |
| avg_time = float("inf") | ||
|
|
||
| def pure_profile(stream: torch.cuda.Stream, repeat: int) -> float: | ||
| start = torch.cuda.Event(enable_timing=True) |
There was a problem hiding this comment.
Thanks for explaining this to me, and now I can understand why not using cupti here, at least we can consider reusing cuda event benchmarking functions under flashinfer.testing (maybe in future PRs).
Let's merge this PR now.
Removed CUDA error handling methods and replaced L2 cache size retrieval with PyTorch's method. This is due to an AOT build error in CI
|
@flashinfer-bot run |
|
Removed CUDA error handling methods and replaced L2 cache size retrieval with PyTorch's method. This is due to an AOT build error in CI |
|
two failed tests are seen on the public CI https://github.com/flashinfer-ai/flashinfer/actions/runs/22613536689/job/65520873046?pr=2663 tests/autotuner/test_autotuner_core.py:326: AssertionError
|
|
the failed test is a new file from |
…ature The _profile_single_kernel method added a tuning_config parameter but the fake_profile mocks in two tests were not updated, causing them to fail with "takes 4 positional arguments but 5 were given". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@flashinfer-bot run |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/autotuner/test_autotuner_core.py (1)
264-266: Silence intentional-unused-argument warnings in test mocks.These mocks intentionally ignore most parameters; prefixing with
_avoids ARG001 noise while keeping compatibility.Suggested cleanup
- def fake_profile( - self, runner_obj, prof_inputs, tactic, tuning_config=None, **kwargs - ): + def fake_profile( + _self, _runner_obj, _prof_inputs, tactic, _tuning_config=None, **_kwargs + ): return {0: 5.0, 1: 1.0, 2: 3.0}[tactic]Also applies to: 320-322
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/autotuner/test_autotuner_core.py` around lines 264 - 266, The test mock function fake_profile (and the other similar mock at the later block) intentionally ignores most parameters; rename unused parameters by prefixing them with an underscore (e.g., change runner_obj, prof_inputs, tactic, tuning_config to _runner_obj, _prof_inputs, _tactic, _tuning_config, or similar) so ARG001 linter warnings are silenced while keeping the mock signature compatible; update both fake_profile and the other mock function referenced in the same test file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 146-147: The cache-key generation must incorporate the new
profiling-mode flags so tuned tactics for different modes aren't reused; update
the Autotuner cache-key builder (the function/method that composes the cache key
used by the autotuner) to include the boolean flags use_cold_l2_cache and
use_cuda_graph (or equivalent attributes) when constructing the key so keys
differ when these flags change, and ensure any lookups/insertions of the cache
use the updated key format.
---
Nitpick comments:
In `@tests/autotuner/test_autotuner_core.py`:
- Around line 264-266: The test mock function fake_profile (and the other
similar mock at the later block) intentionally ignores most parameters; rename
unused parameters by prefixing them with an underscore (e.g., change runner_obj,
prof_inputs, tactic, tuning_config to _runner_obj, _prof_inputs, _tactic,
_tuning_config, or similar) so ARG001 linter warnings are silenced while keeping
the mock signature compatible; update both fake_profile and the other mock
function referenced in the same test file.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
flashinfer/autotuner.pyflashinfer/fused_moe/core.pytests/autotuner/test_autotuner_core.py
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/fused_moe/core.py
| use_cold_l2_cache: bool = False | ||
| use_cuda_graph: bool = False |
There was a problem hiding this comment.
Include tuning-mode flags in autotuner cache keys.
Line 146 and Line 147 add profiling-mode knobs, but cache keys still don’t distinguish them. This can reuse tactics tuned under different measurement modes (e.g., graph/cold-L2 vs default), leading to incorrect tactic selection.
Suggested fix
def _get_cache_key(
cls,
custom_op: str,
runner: TunableRunner,
input_shapes: Tuple[torch.Size],
tuning_config: TuningConfig,
) -> Tuple:
return (
custom_op,
runner.__class__.__name__,
hash(runner),
cls._find_nearest_profile(input_shapes, tuning_config),
+ tuning_config.use_cold_l2_cache,
+ tuning_config.use_cuda_graph,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/autotuner.py` around lines 146 - 147, The cache-key generation
must incorporate the new profiling-mode flags so tuned tactics for different
modes aren't reused; update the Autotuner cache-key builder (the function/method
that composes the cache key used by the autotuner) to include the boolean flags
use_cold_l2_cache and use_cuda_graph (or equivalent attributes) when
constructing the key so keys differ when these flags change, and ensure any
lookups/insertions of the cache use the updated key format.
📌 Description
Adds support for CUDA graph and cold L2 cache in the autotuner.
Mostly copied from TRTLLM, see https://github.com/NVIDIA/TensorRT-LLM/blob/63c33c7c9a705e6d194a53b7ed54bbaa11494f7d/tensorrt_llm/_torch/autotuner.py#L1134
Currently:
trtllm_fp4_block_scale_moe_op🔍 Related Issues
Not aware of any.
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Chores