Skip to content

autotuner: check cache before synthesizing profile input tensors#3126

Merged
nv-yunzheq merged 2 commits intoflashinfer-ai:mainfrom
leejnau:autotuner-cache-first-in-tuning-loop
Apr 23, 2026
Merged

autotuner: check cache before synthesizing profile input tensors#3126
nv-yunzheq merged 2 commits intoflashinfer-ai:mainfrom
leejnau:autotuner-cache-first-in-tuning-loop

Conversation

@leejnau
Copy link
Copy Markdown
Contributor

@leejnau leejnau commented Apr 20, 2026

📌 Description

AutoTuner.choose_one's tuning-mode loop calls _prepare_input_tensors(p, inputs) before checking the cache. On a cache hit the synthesized tensors are thrown away, but their torch.rand / torch.randint kernel launches already happened on the
device. For any caller that runs choose_one repeatedly inside autotune(True) (e.g. a benchmark sharing its tuning and measurement call-sites), those kernels recur on every warm-cache call and get attributed to the measured region by CUPTI /
nsys.

Move search_cache above _prepare_input_tensors in the tuning-mode loop. Skip synthesis on cache hit; synthesize as before on cache miss. The lookup passes the caller's inputs (rather than synthesized tensors), which aligns it with the
non-tuning branch and the post-loop search_cache. Safe under the existing get_cache_key_extras contract (dtype-like properties preserved by synthesis).

🔍 Related Issues

#2398

#2886

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Refactor
    • Reduced runtime overhead during tuning by checking the cache before preparing/synthesized inputs, avoiding unnecessary expensive tensor synthesis on cache hits.
  • Documentation
    • Clarified cache behavior and requirements: cache-key extras must be synthesis-invariant, and input shapes vs. actual inputs influence separate matching steps.

`AutoTuner.choose_one` iterates through every `OptimizationProfile` that
`_generate_optimization_profiles` produces from the `tuning_config`'s
bucket grid.  For each profile the method called
`_prepare_input_tensors(p, inputs)` *before* checking the cache.  On a
cache hit those freshly-synthesized tensors were discarded and the
cached (runner_id, tactic) pair was returned as-is.

Under normal serving this waste was invisible -- each `_prepare_input_tensors`
call launches only a handful of small `torch.rand` / `torch.randint` /
`torch.empty` kernels (one per `DynamicTensorSpec`) and produces the
right answer.  But any caller that runs `choose_one` repeatedly from
inside an `autotune(True)` context pays the full synthesis cost on
every forward, even once the cache is fully warm.

`benchmarks/bench_moe_deepseek.py` is the motivating example: its
`run_benchmark` wraps both the pre-warm (which populates the cache) and
the measured `bench_gpu_time` iters in a single `autotune(True)` scope
(intentionally, to share the same call-site for tactic selection and
timing).  Every subsequent forward re-entered the tuning-mode branch of
`choose_one`, re-synthesized tensors for all tuning buckets, and those
synthesis kernels were captured by CUPTI/nsys tracing of the
measurement region.  On DeepSeek-V3 at bs=128 on B200 this added
~360 us/forward for the CuteDSL backend and ~470 us/forward for
TRTLLM (the two backends' `dynamic_tensor_specs` differ).  The result
was both a ~3x inflation of per-forward timings and, because the tax
is asymmetric across backends, a visible inversion of the reported
ranking.

Fix: move the `search_cache` call above `_prepare_input_tensors` in the
tuning-mode loop.  On a cache hit (by far the common case once the
first iteration has populated the cache) we skip synthesis entirely.
On a cache miss we proceed exactly as before.  The lookup now passes
the caller's `inputs` rather than the synthesized `tensors`; this is
safe because `_get_cache_key` only uses `p.get_opt_shapes()` (the
profile's bucket shapes, independent of any particular input tensor)
plus `runner.get_cache_key_extras(inputs)`, and the documented
contract for `get_cache_key_extras` is to return properties like
dtype that are preserved by the synthesis initializers.  The
non-tuning-mode branch of `choose_one` and the final post-loop
`search_cache` call already use the caller's `inputs`, so this aligns
the in-loop lookup with both.

No behavior change on cache miss.  No change to how entries are stored
in `profiling_cache` (still keyed on `get_cache_key_extras(tensors)`
because `tensors` only exists on the miss path, which matches the
non-tuning-mode cache key under the contract described above).

Measured impact on `benchmarks/bench_moe_deepseek.py`, bs=128, EP=8,
B200, CUPTI enabled:
  - Before: CuteDSL 0.516 ms, TRTLLM 0.613 ms -> CuteDSL "wins" 1.19x
  -  After: CuteDSL 0.157 ms, TRTLLM 0.142 ms -> TRTLLM wins 1.11x

The post-fix ratio matches the ratio observed when timing via CUDA
events (`--no-cupti`: 0.144 / 0.126 = 1.14x TRTLLM), which bypasses
CUPTI and was therefore always clean of this pollution.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4277bfdc-f493-4e39-b9ca-222a1f582183

📥 Commits

Reviewing files that changed from the base of the PR and between 4ad799e719556610f82a48078c58884f7515968c and c7100c7.

📒 Files selected for processing (1)
  • flashinfer/autotuner.py

📝 Walkthrough

Walkthrough

AutoTuner.choose_one now calls search_cache(...) with the caller’s real inputs before synthesizing profile tensors; _prepare_input_tensors(...) is invoked only on cache misses. TunableRunner.get_cache_key_extras is documented to require synthesis-invariant extras; search_cache docs clarify input_shapes vs inputs roles.

Changes

Cohort / File(s) Summary
Autotuner core
flashinfer/autotuner.py
Reordered AutoTuner.choose_one: call search_cache(..., inputs=inputs) before _prepare_input_tensors; skip synthesized profile tensors / GPU work on cache hits.
TunableRunner contract / docs
flashinfer/.../tunable_runner.py
Documented TunableRunner.get_cache_key_extras must return synthesis-invariant values; clarified that input_shapes affects profile-bucket selection while inputs only affects extras used for cache keys.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant AutoTuner
    participant Cache
    participant Synth as _prepare_input_tensors_
    participant Runner

    Caller->>AutoTuner: choose_one(inputs, profiles)
    AutoTuner->>Cache: search_cache(profile, inputs)
    alt cache hit
        Cache-->>AutoTuner: cached result
        AutoTuner-->>Caller: use cached profile
    else cache miss
        AutoTuner->>Synth: _prepare_input_tensors(profile, inputs)
        Synth-->>Runner: prepared tensors
        AutoTuner->>Runner: run/profile with tensors
        Runner-->>Cache: compute/store cache_key_extras
        Runner-->>AutoTuner: profile result
        AutoTuner-->>Caller: selected profile
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • sricketts
  • aleozlx
  • yzh119
  • yongwww
  • cyx-6
  • samuellees
  • bkryu
  • nv-yunzheq
  • saltyminty

Poem

🐇 I hopped the loop to check the trace,
I peered the cache before a synthy race,
No phantom tensors warming up the GPU,
Misses wake the synth — profiling ensues,
A tiny rabbit clap for saved compute space! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: moving cache checking before tensor synthesis in the autotuner's tuning loop.
Description check ✅ Passed The PR description includes all required sections: a detailed description of the problem and solution, related issues, and a completed pre-commit and testing checklist.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the choose_one method in flashinfer/autotuner.py by deferring input tensor synthesis until after a cache miss is confirmed. By checking the cache first using the original inputs, the implementation avoids unnecessary GPU kernel launches that could otherwise distort performance measurements in profiling tools. I have no feedback to provide.

@nv-yunzheq
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !573 has been created, and the CI pipeline #49033299 is currently running. I'll report back once the pipeline job completes.

Comment thread flashinfer/autotuner.py
@leejnau leejnau force-pushed the autotuner-cache-first-in-tuning-loop branch from 12e5e30 to 4ad799e Compare April 20, 2026 22:39
@leejnau leejnau force-pushed the autotuner-cache-first-in-tuning-loop branch from 4ad799e to c7100c7 Compare April 20, 2026 22:42
@nvpohanh
Copy link
Copy Markdown
Contributor

Thanks for fixing this!

@nv-yunzheq nv-yunzheq merged commit b353fa3 into flashinfer-ai:main Apr 23, 2026
37 of 44 checks passed
@leejnau leejnau deleted the autotuner-cache-first-in-tuning-loop branch April 23, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants