Skip to content

build(cuda): enable GGML_CUDA_GRAPHS on CUDA builds#26

Merged
mudler merged 1 commit into
masterfrom
feat/cuda-graphs
Jun 12, 2026
Merged

build(cuda): enable GGML_CUDA_GRAPHS on CUDA builds#26
mudler merged 1 commit into
masterfrom
feat/cuda-graphs

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

What

Enable GGML_CUDA_GRAPHS whenever parakeet.cpp forwards CUDA (PARAKEET_GGML_CUDA=ON). ggml leaves this off by default.

Why

With CUDA graphs on, the CUDA backend captures and replays the compute graph, a small but free speedup. Measured on a GB10 (interleaved, best-of, same 180s clip):

model graphs ON graphs OFF gain
tdt-1.1b ~1477 ms ~1498 ms +1.4%
tdt-0.6b-v3 ~970 ms ~974 ms +0.4%

Never negative across runs. The gain is capped because parakeet rebuilds each graph in a fresh ggml_context per call, which defeats ggml's cross-call graph replay (keyed on the first node pointer); the encoder also runs once per request. Lifting that is a separate, larger change. This PR just takes the free win.

Notes

  • Single point of control in CMakeLists.txt, so docker/release/local CUDA builds all inherit it.
  • Runtime kill-switch GGML_CUDA_DISABLE_GRAPHS=1 still works for A/B testing.
  • Proven by building parakeet.cpp with -DGGML_CUDA_GRAPHS=ON on the GB10 and benchmarking; this change just automates that flag.

🤖 Generated with Claude Code

ggml leaves GGML_CUDA_GRAPHS off by default. Turning it on lets the CUDA
backend capture and replay the compute graph, a small but free speedup
(about 1% measured on a GB10: 1.4% on tdt-1.1b, 0.4% on tdt-0.6b-v3, and
never negative across interleaved runs).

Enable it whenever we forward CUDA so every CUDA build (docker, release,
local) inherits it. The runtime kill-switch GGML_CUDA_DISABLE_GRAPHS=1
still disables it for A/B testing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mudler mudler merged commit b8012f1 into master Jun 12, 2026
8 checks passed
adyranov pushed a commit to adyranov/parakeet.cpp that referenced this pull request Jun 13, 2026
ggml leaves GGML_CUDA_GRAPHS off by default. Turning it on lets the CUDA
backend capture and replay the compute graph, a small but free speedup
(about 1% measured on a GB10: 1.4% on tdt-1.1b, 0.4% on tdt-0.6b-v3, and
never negative across interleaved runs).

Enable it whenever we forward CUDA so every CUDA build (docker, release,
local) inherits it. The runtime kill-switch GGML_CUDA_DISABLE_GRAPHS=1
still disables it for A/B testing.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants