autotuner: check cache before synthesizing profile input tensors by leejnau · Pull Request #3126 · flashinfer-ai/flashinfer

leejnau · 2026-04-20T18:04:12Z

📌 Description

AutoTuner.choose_one's tuning-mode loop calls _prepare_input_tensors(p, inputs) before checking the cache. On a cache hit the synthesized tensors are thrown away, but their torch.rand / torch.randint kernel launches already happened on the
device. For any caller that runs choose_one repeatedly inside autotune(True) (e.g. a benchmark sharing its tuning and measurement call-sites), those kernels recur on every warm-cache call and get attributed to the measured region by CUPTI /
nsys.

Move search_cache above _prepare_input_tensors in the tuning-mode loop. Skip synthesis on cache hit; synthesize as before on cache miss. The lookup passes the caller's inputs (rather than synthesized tensors), which aligns it with the
non-tuning branch and the post-loop search_cache. Safe under the existing get_cache_key_extras contract (dtype-like properties preserved by synthesis).

🔍 Related Issues

#2398

#2886

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Refactor
- Reduced runtime overhead during tuning by checking the cache before preparing/synthesized inputs, avoiding unnecessary expensive tensor synthesis on cache hits.
Documentation
- Clarified cache behavior and requirements: cache-key extras must be synthesis-invariant, and input shapes vs. actual inputs influence separate matching steps.

`AutoTuner.choose_one` iterates through every `OptimizationProfile` that `_generate_optimization_profiles` produces from the `tuning_config`'s bucket grid. For each profile the method called `_prepare_input_tensors(p, inputs)` *before* checking the cache. On a cache hit those freshly-synthesized tensors were discarded and the cached (runner_id, tactic) pair was returned as-is. Under normal serving this waste was invisible -- each `_prepare_input_tensors` call launches only a handful of small `torch.rand` / `torch.randint` / `torch.empty` kernels (one per `DynamicTensorSpec`) and produces the right answer. But any caller that runs `choose_one` repeatedly from inside an `autotune(True)` context pays the full synthesis cost on every forward, even once the cache is fully warm. `benchmarks/bench_moe_deepseek.py` is the motivating example: its `run_benchmark` wraps both the pre-warm (which populates the cache) and the measured `bench_gpu_time` iters in a single `autotune(True)` scope (intentionally, to share the same call-site for tactic selection and timing). Every subsequent forward re-entered the tuning-mode branch of `choose_one`, re-synthesized tensors for all tuning buckets, and those synthesis kernels were captured by CUPTI/nsys tracing of the measurement region. On DeepSeek-V3 at bs=128 on B200 this added ~360 us/forward for the CuteDSL backend and ~470 us/forward for TRTLLM (the two backends' `dynamic_tensor_specs` differ). The result was both a ~3x inflation of per-forward timings and, because the tax is asymmetric across backends, a visible inversion of the reported ranking. Fix: move the `search_cache` call above `_prepare_input_tensors` in the tuning-mode loop. On a cache hit (by far the common case once the first iteration has populated the cache) we skip synthesis entirely. On a cache miss we proceed exactly as before. The lookup now passes the caller's `inputs` rather than the synthesized `tensors`; this is safe because `_get_cache_key` only uses `p.get_opt_shapes()` (the profile's bucket shapes, independent of any particular input tensor) plus `runner.get_cache_key_extras(inputs)`, and the documented contract for `get_cache_key_extras` is to return properties like dtype that are preserved by the synthesis initializers. The non-tuning-mode branch of `choose_one` and the final post-loop `search_cache` call already use the caller's `inputs`, so this aligns the in-loop lookup with both. No behavior change on cache miss. No change to how entries are stored in `profiling_cache` (still keyed on `get_cache_key_extras(tensors)` because `tensors` only exists on the miss path, which matches the non-tuning-mode cache key under the contract described above). Measured impact on `benchmarks/bench_moe_deepseek.py`, bs=128, EP=8, B200, CUPTI enabled: - Before: CuteDSL 0.516 ms, TRTLLM 0.613 ms -> CuteDSL "wins" 1.19x - After: CuteDSL 0.157 ms, TRTLLM 0.142 ms -> TRTLLM wins 1.11x The post-fix ratio matches the ratio observed when timing via CUDA events (`--no-cupti`: 0.144 / 0.126 = 1.14x TRTLLM), which bypasses CUPTI and was therefore always clean of this pollution.

coderabbitai · 2026-04-20T18:04:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4277bfdc-f493-4e39-b9ca-222a1f582183

📥 Commits

Reviewing files that changed from the base of the PR and between 4ad799e719556610f82a48078c58884f7515968c and c7100c7.

📒 Files selected for processing (1)

flashinfer/autotuner.py

📝 Walkthrough

Walkthrough

AutoTuner.choose_one now calls search_cache(...) with the caller’s real inputs before synthesizing profile tensors; _prepare_input_tensors(...) is invoked only on cache misses. TunableRunner.get_cache_key_extras is documented to require synthesis-invariant extras; search_cache docs clarify input_shapes vs inputs roles.

Changes

Cohort / File(s)	Summary
Autotuner core `flashinfer/autotuner.py`	Reordered `AutoTuner.choose_one`: call `search_cache(..., inputs=inputs)` before `_prepare_input_tensors`; skip synthesized profile tensors / GPU work on cache hits.
TunableRunner contract / docs `flashinfer/.../tunable_runner.py`	Documented `TunableRunner.get_cache_key_extras` must return synthesis-invariant values; clarified that `input_shapes` affects profile-bucket selection while `inputs` only affects extras used for cache keys.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant AutoTuner
    participant Cache
    participant Synth as _prepare_input_tensors_
    participant Runner

    Caller->>AutoTuner: choose_one(inputs, profiles)
    AutoTuner->>Cache: search_cache(profile, inputs)
    alt cache hit
        Cache-->>AutoTuner: cached result
        AutoTuner-->>Caller: use cached profile
    else cache miss
        AutoTuner->>Synth: _prepare_input_tensors(profile, inputs)
        Synth-->>Runner: prepared tensors
        AutoTuner->>Runner: run/profile with tensors
        Runner-->>Cache: compute/store cache_key_extras
        Runner-->>AutoTuner: profile result
        AutoTuner-->>Caller: selected profile
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Yanqinz/gemm cudnn autotune fix #2863 — directly modifies autotuner cache-key flow and get_cache_key_extras usage.
feat: SM121 (GB10) tile filtering and autotuner robustness #2927 — changes AutoTuner.choose_one profiling loop and related profiling/logging/CUDA handling.

Suggested reviewers

sricketts
aleozlx
yzh119
yongwww
cyx-6
samuellees
bkryu
nv-yunzheq
saltyminty

Poem

🐇 I hopped the loop to check the trace,
I peered the cache before a synthy race,
No phantom tensors warming up the GPU,
Misses wake the synth — profiling ensues,
A tiny rabbit clap for saved compute space! 🎉

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: moving cache checking before tensor synthesis in the autotuner's tuning loop.
Description check	✅ Passed	The PR description includes all required sections: a detailed description of the problem and solution, related issues, and a completed pre-commit and testing checklist.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request optimizes the choose_one method in flashinfer/autotuner.py by deferring input tensor synthesis until after a cache miss is confirmed. By checking the cache first using the original inputs, the implementation avoids unnecessary GPU kernel launches that could otherwise distort performance measurements in profiling tools. I have no feedback to provide.

nv-yunzheq · 2026-04-20T21:54:29Z

/bot run

flashinfer-bot · 2026-04-20T21:56:16Z

GitLab MR !573 has been created, and the CI pipeline #49033299 is currently running. I'll report back once the pipeline job completes.

nvpohanh · 2026-04-21T13:09:46Z

Thanks for fixing this!

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

leejnau marked this pull request as ready for review April 20, 2026 21:28

leejnau requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners April 20, 2026 21:28

nv-yunzheq added the run-ci label Apr 20, 2026

qiching reviewed Apr 20, 2026

View reviewed changes

Comment thread flashinfer/autotuner.py

leejnau force-pushed the autotuner-cache-first-in-tuning-loop branch from 12e5e30 to 4ad799e Compare April 20, 2026 22:39

autotuner: document synthesis-invariance contract

c7100c7

leejnau force-pushed the autotuner-cache-first-in-tuning-loop branch from 4ad799e to c7100c7 Compare April 20, 2026 22:42

nv-yunzheq approved these changes Apr 23, 2026

View reviewed changes

nv-yunzheq merged commit b353fa3 into flashinfer-ai:main Apr 23, 2026
37 of 44 checks passed

leejnau deleted the autotuner-cache-first-in-tuning-loop branch April 23, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autotuner: check cache before synthesizing profile input tensors#3126

autotuner: check cache before synthesizing profile input tensors#3126
nv-yunzheq merged 2 commits intoflashinfer-ai:mainfrom
leejnau:autotuner-cache-first-in-tuning-loop

leejnau commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

nv-yunzheq commented Apr 20, 2026

Uh oh!

flashinfer-bot commented Apr 20, 2026

Uh oh!

Uh oh!

nvpohanh commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

leejnau commented Apr 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

nv-yunzheq commented Apr 20, 2026

Uh oh!

flashinfer-bot commented Apr 20, 2026

Uh oh!

Uh oh!

nvpohanh commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leejnau commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading