[Bug] Fix torch Compilation Cache Hit Error#25093
[Bug] Fix torch Compilation Cache Hit Error#25093simon-mo merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where an insufficient cache key for compiled CUDA graphs caused cache collisions and errors when using the deepep_high_throughput backend. The fix correctly disables CUDA graphs for this specific configuration, preventing the crash. This is a solid, pragmatic solution for the immediate problem. I've added one high-severity suggestion to add a TODO comment to track the technical debt of this temporary fix, ensuring the long-term goal of re-enabling this performance feature with a more robust caching mechanism is not lost.
ProExpertProg
left a comment
There was a problem hiding this comment.
I don't understand, is this a torch compile caching issue or is it a CUDAGraph issue? These are (somewhat) orthogonal features. I don't know of any cudagraph caching. Also I think we should be able to disable CUDAGraphs but keep compilation (maybe it just can't be piecewise).
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@ProExpertProg This is a compile caching issue. But the Yeah we can keep compilation as it is, just disabling cuda graphs, fixed now. |
ProExpertProg
left a comment
There was a problem hiding this comment.
This seems fine, but if we want more performance we could also just disable inductor compile caching (increases startup time but would give the best performance).
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Yes, but seems that piecewise cuda graph for HT is mainly beneficial for decoding, for prefill we don't see too much performance improvement. |
| # TODO: Piecewise Cuda graph might be enabled | ||
| # if torch compile cache key issue fixed | ||
| # See https://github.com/vllm-project/vllm/pull/25093 |
There was a problem hiding this comment.
is this a bug? can you file an issue if so?
There was a problem hiding this comment.
This is not a bug, it is just the cache key is not strong enough to support splitting. I think it is not worth doing the refactor just for the support of HT Piecewise cudagraph, so let's put it there.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Fixes #24915
The root cause of this issue is from
(runtime_shape, graph_index, backend_name)is not a strong enough cache key for compiled cuda graph. And in complicated situation like DeepEP HT, we split the piecewise graphmoe_forwardandmoe_forward_sharedand make the wrong cache hit.I am not sure if it is a good idea to refactor the cache key system throughly just for the support of piecewise graph for HT (Perhaps not worth enough), so simply cancel the support for HT graph now. If we encounter other scenarios where the cache key proves insufficient, we should revisit and redesign the cache system.
Note: The rough idea to refactor the cache system: Adding a fourth key introducing the signature of sub_graph like. Works good locally
Test
Originally: Wrong graph cache hit in the second run
Now:
(APIServer pid=1527118) INFO: Started server process [1527118] (APIServer pid=1527118) INFO: Waiting for application startup. (APIServer pid=1527118) INFO: Application startup complete.