You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we have three cuda graph implementations: full cuda graph, torch-compile-based piecewise cuda graph (TCPiecewiseCudaGraph), and breakable cuda graph. There is a lot of duplicated code across the runners. We hope to refactor the whole code structure to unify them and make it clearer for users to understand the cuda graph code structure.
Goal
Support flexible cuda graph backends across (decode, prefill) x (full, breakable, torch-compile-based pcg). Enable breakable cuda graph for prefill by default.
Proposed Design
Part 1: Cuda Graph Implementation
1. Runner
PrefillCudaGraphRunner: manages cuda graph execution for the prefill phase
DecodeCudaGraphRunner: manages cuda graph execution for the decode phase
Both inherit from BaseCudaGraphRunner.
2. Backend
FullCudaGraphBackend: captures the full model in a single cuda graph
BreakableCudaGraphBackend: supports breaking out of the cuda graph for dynamic ops
TCPiecewiseCudaGraphBackend: torch-compile-based piecewise cuda graph
Combination: each runner owns a pluggable BaseCudaGraphBackend, giving a clean cross product of (prefill, decode) x (full, breakable, tcpcg).
Part 2: Arguments
Two types of arguments are supported:
Config-based: full control via a JSON config per phase, e.g. --cuda-graph-mode {"decode": "full", "prefill": "breakable"}
Convenience flags: shorthand arguments that translate to the corresponding config, e.g.
Background
Currently we have three cuda graph implementations: full cuda graph, torch-compile-based piecewise cuda graph (
TCPiecewiseCudaGraph), and breakable cuda graph. There is a lot of duplicated code across the runners. We hope to refactor the whole code structure to unify them and make it clearer for users to understand the cuda graph code structure.Goal
Support flexible cuda graph backends across
(decode, prefill) x (full, breakable, torch-compile-based pcg). Enable breakable cuda graph for prefill by default.Proposed Design
Part 1: Cuda Graph Implementation
1. Runner
PrefillCudaGraphRunner: manages cuda graph execution for the prefill phaseDecodeCudaGraphRunner: manages cuda graph execution for the decode phaseBoth inherit from
BaseCudaGraphRunner.2. Backend
FullCudaGraphBackend: captures the full model in a single cuda graphBreakableCudaGraphBackend: supports breaking out of the cuda graph for dynamic opsTCPiecewiseCudaGraphBackend: torch-compile-based piecewise cuda graphCombination: each runner owns a pluggable
BaseCudaGraphBackend, giving a clean cross product of(prefill, decode) x (full, breakable, tcpcg).Part 2: Arguments
Two types of arguments are supported:
Config-based: full control via a JSON config per phase, e.g.
--cuda-graph-mode {"decode": "full", "prefill": "breakable"}Convenience flags: shorthand arguments that translate to the corresponding config, e.g.
--prefill-disable-cuda-graph→{"prefill": "disabled"}--decode-disable-cuda-graph→{"decode": "disabled"}--prefill-cuda-graph-bs→ sets batch sizes for prefill--decode-cuda-graph-bs→ sets batch sizes for decodePlan
{"decode": "full", "prefill": "breakable"}) enable