autotuner: check cache before synthesizing profile input tensors#3126
Conversation
`AutoTuner.choose_one` iterates through every `OptimizationProfile` that `_generate_optimization_profiles` produces from the `tuning_config`'s bucket grid. For each profile the method called `_prepare_input_tensors(p, inputs)` *before* checking the cache. On a cache hit those freshly-synthesized tensors were discarded and the cached (runner_id, tactic) pair was returned as-is. Under normal serving this waste was invisible -- each `_prepare_input_tensors` call launches only a handful of small `torch.rand` / `torch.randint` / `torch.empty` kernels (one per `DynamicTensorSpec`) and produces the right answer. But any caller that runs `choose_one` repeatedly from inside an `autotune(True)` context pays the full synthesis cost on every forward, even once the cache is fully warm. `benchmarks/bench_moe_deepseek.py` is the motivating example: its `run_benchmark` wraps both the pre-warm (which populates the cache) and the measured `bench_gpu_time` iters in a single `autotune(True)` scope (intentionally, to share the same call-site for tactic selection and timing). Every subsequent forward re-entered the tuning-mode branch of `choose_one`, re-synthesized tensors for all tuning buckets, and those synthesis kernels were captured by CUPTI/nsys tracing of the measurement region. On DeepSeek-V3 at bs=128 on B200 this added ~360 us/forward for the CuteDSL backend and ~470 us/forward for TRTLLM (the two backends' `dynamic_tensor_specs` differ). The result was both a ~3x inflation of per-forward timings and, because the tax is asymmetric across backends, a visible inversion of the reported ranking. Fix: move the `search_cache` call above `_prepare_input_tensors` in the tuning-mode loop. On a cache hit (by far the common case once the first iteration has populated the cache) we skip synthesis entirely. On a cache miss we proceed exactly as before. The lookup now passes the caller's `inputs` rather than the synthesized `tensors`; this is safe because `_get_cache_key` only uses `p.get_opt_shapes()` (the profile's bucket shapes, independent of any particular input tensor) plus `runner.get_cache_key_extras(inputs)`, and the documented contract for `get_cache_key_extras` is to return properties like dtype that are preserved by the synthesis initializers. The non-tuning-mode branch of `choose_one` and the final post-loop `search_cache` call already use the caller's `inputs`, so this aligns the in-loop lookup with both. No behavior change on cache miss. No change to how entries are stored in `profiling_cache` (still keyed on `get_cache_key_extras(tensors)` because `tensors` only exists on the miss path, which matches the non-tuning-mode cache key under the contract described above). Measured impact on `benchmarks/bench_moe_deepseek.py`, bs=128, EP=8, B200, CUPTI enabled: - Before: CuteDSL 0.516 ms, TRTLLM 0.613 ms -> CuteDSL "wins" 1.19x - After: CuteDSL 0.157 ms, TRTLLM 0.142 ms -> TRTLLM wins 1.11x The post-fix ratio matches the ratio observed when timing via CUDA events (`--no-cupti`: 0.144 / 0.126 = 1.14x TRTLLM), which bypasses CUPTI and was therefore always clean of this pollution.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📥 CommitsReviewing files that changed from the base of the PR and between 4ad799e719556610f82a48078c58884f7515968c and c7100c7. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAutoTuner.choose_one now calls search_cache(...) with the caller’s real Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant AutoTuner
participant Cache
participant Synth as _prepare_input_tensors_
participant Runner
Caller->>AutoTuner: choose_one(inputs, profiles)
AutoTuner->>Cache: search_cache(profile, inputs)
alt cache hit
Cache-->>AutoTuner: cached result
AutoTuner-->>Caller: use cached profile
else cache miss
AutoTuner->>Synth: _prepare_input_tensors(profile, inputs)
Synth-->>Runner: prepared tensors
AutoTuner->>Runner: run/profile with tensors
Runner-->>Cache: compute/store cache_key_extras
Runner-->>AutoTuner: profile result
AutoTuner-->>Caller: selected profile
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request optimizes the choose_one method in flashinfer/autotuner.py by deferring input tensor synthesis until after a cache miss is confirmed. By checking the cache first using the original inputs, the implementation avoids unnecessary GPU kernel launches that could otherwise distort performance measurements in profiling tools. I have no feedback to provide.
|
/bot run |
12e5e30 to
4ad799e
Compare
4ad799e to
c7100c7
Compare
|
Thanks for fixing this! |
📌 Description
AutoTuner.choose_one's tuning-mode loop calls _prepare_input_tensors(p, inputs) before checking the cache. On a cache hit the synthesized tensors are thrown away, but their torch.rand / torch.randint kernel launches already happened on the
device. For any caller that runs choose_one repeatedly inside autotune(True) (e.g. a benchmark sharing its tuning and measurement call-sites), those kernels recur on every warm-cache call and get attributed to the measured region by CUPTI /
nsys.
Move search_cache above _prepare_input_tensors in the tuning-mode loop. Skip synthesis on cache hit; synthesize as before on cache miss. The lookup passes the caller's inputs (rather than synthesized tensors), which aligns it with the
non-tuning branch and the post-loop search_cache. Safe under the existing get_cache_key_extras contract (dtype-like properties preserved by synthesis).
🔍 Related Issues
#2398
#2886
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit