server : print graphs reused in slot timings#23279
Merged
Merged
Conversation
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi
ServeurpersoCom
approved these changes
May 18, 2026
kgrama
pushed a commit
to kgrama/llama.cpp
that referenced
this pull request
May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
xxmustafacooTR
pushed a commit
to xxPlayground/llama-cpp-turboquant
that referenced
this pull request
May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
rsenthilkumar6
pushed a commit
to rsenthilkumar6/llama.cpp
that referenced
this pull request
May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
ArberSephirotheca
pushed a commit
to ArberSephirotheca/llama.cpp
that referenced
this pull request
May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
fhnmor21
pushed a commit
to fhnmor21/llama-cpp-turboquant
that referenced
this pull request
May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
dbrain
pushed a commit
to dbrain/hbd-llama-cpp-turboquant
that referenced
this pull request
May 21, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
baramofme
pushed a commit
to baramofme/llama-cpp-turboquant
that referenced
this pull request
May 23, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
Jcfunk
added a commit
to Jcfunk/llama.cpp
that referenced
this pull request
May 23, 2026
* upstream/HEAD: ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ggml-webgpu : extend GDN for K>1 (ggml-org#23299) [SCYL] add chapter for performance reference in SYCL.md (ggml-org#23315) convert : filter lora tensor names (ggml-org#23077) sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (ggml-org#22153) rpc : keep last_graph_uid in the device context (ggml-org#23273)
srossitto79
pushed a commit
to srossitto79/llama.cpp
that referenced
this pull request
May 23, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
carlosfundora
pushed a commit
to carlosfundora/llama.cpp-1-bit-turbo
that referenced
this pull request
May 24, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> (cherry picked from commit 3c81c8d)
fewtarius
pushed a commit
to fewtarius/llama.cpp
that referenced
this pull request
May 30, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
turbo-tan
pushed a commit
to turbo-tan/llama.cpp-tq3
that referenced
this pull request
Jun 2, 2026
Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
Jcfunk
added a commit
to Jcfunk/llama.cpp
that referenced
this pull request
Jun 11, 2026
* upstream/HEAD: (25 commits) metal : optimize pad + cpy (ggml-org#23354) snapdragon: update toolchain to v0.6 (ggml-org#23369) ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349) opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303) hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317) refactor: Chat Screen UI rendering (ggml-org#23333) github: mention --log-file in issue templates (ggml-org#23277) common: fix --help for --verbosity (ggml-org#23278) common: fix --fit verbosity with --verbosity 4 (ggml-org#23282) convert : update mtp related help (ggml-org#23334) hexagon: enable support for NORM op (ggml-org#23319) model : clarify MTP layer comment in qwen35.cpp [no ci] (ggml-org#23338) llama : MTP clean-up (ggml-org#23269) ui: Bump packages + address build warnings (ggml-org#23300) ci : install libssl-dev (ggml-org#23325) ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ...
carlosfundora
pushed a commit
to carlosfundora/llama.cpp-1-bit-turbo
that referenced
this pull request
Jun 13, 2026
- ggml-cpu/ops.cpp: drop duplicate quant case labels created by the merge (RotorQuant PLANAR/ISO/Q1_0/TQ3_0 cases were stacked twice in switch groups) - server-context.cpp: re-merge with codex spec-decode base + integration audio output; fix pre-existing ctx_tgt/ctx_dft/spec references that have no matching member declarations (inherited from master's partial upstream PR ggml-org#23279/ggml-org#23461 merge) -> use ctx and the per-slot common_speculative_free path - tests/CMakeLists.txt: drop duplicate test-llama-kv-cells registration CPU build clean (llama-quantize/server/cli, mtmd, liquid-audio, tests). test-llama-kv-cells passes. test-quantize-fns segfaults at tq3_0 — PRE-EXISTING on integration (identical crash+error), tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Add
graphs reusedcounter to the per-slot timing output in the server. The value is obtained viallama_perf_context().n_reused, matching the existing behavior inllama_perf_context_print().Note: these are the total accumulated reused graphs for all generations
Output example:
Requirements