server : print graphs reused in slot timings by ggerganov · Pull Request #23279 · ggml-org/llama.cpp

ggerganov · 2026-05-18T13:53:28Z

Overview

Add graphs reused counter to the per-slot timing output in the server. The value is obtained via llama_perf_context().n_reused, matching the existing behavior in llama_perf_context_print().

Note: these are the total accumulated reused graphs for all generations

Output example:

prompt eval time =     184.43 ms /   142 tokens (    1.30 ms per token,   769.93 tokens per second)
       eval time =  469503.36 ms / 35716 tokens (   13.15 ms per token,    76.07 tokens per second)
      total time =  469687.79 ms / 35858 tokens
   graphs reused =         42
draft acceptance = 0.74482 (24673 accepted / 33126 generated)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. llama.cpp + pi

Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi

Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

* upstream/HEAD: ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ggml-webgpu : extend GDN for K>1 (ggml-org#23299) [SCYL] add chapter for performance reference in SYCL.md (ggml-org#23315) convert : filter lora tensor names (ggml-org#23077) sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (ggml-org#22153) rpc : keep last_graph_uid in the device context (ggml-org#23273)

Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> (cherry picked from commit 3c81c8d)

Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

* upstream/HEAD: (25 commits) metal : optimize pad + cpy (ggml-org#23354) snapdragon: update toolchain to v0.6 (ggml-org#23369) ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349) opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303) hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317) refactor: Chat Screen UI rendering (ggml-org#23333) github: mention --log-file in issue templates (ggml-org#23277) common: fix --help for --verbosity (ggml-org#23278) common: fix --fit verbosity with --verbosity 4 (ggml-org#23282) convert : update mtp related help (ggml-org#23334) hexagon: enable support for NORM op (ggml-org#23319) model : clarify MTP layer comment in qwen35.cpp [no ci] (ggml-org#23338) llama : MTP clean-up (ggml-org#23269) ui: Bump packages + address build warnings (ggml-org#23300) ci : install libssl-dev (ggml-org#23325) ci : install server kleidiai runner dependencies (ggml-org#23259) server-context: guarantee there is at least 1 token to decode (ggml-org#23280) server : print graphs reused in slot timings (ggml-org#23279) save-load-state : refactor tests and improve readability (ggml-org#23196) llama-eval : add per-task summary stats (ggml-org#23151) ...

- ggml-cpu/ops.cpp: drop duplicate quant case labels created by the merge (RotorQuant PLANAR/ISO/Q1_0/TQ3_0 cases were stacked twice in switch groups) - server-context.cpp: re-merge with codex spec-decode base + integration audio output; fix pre-existing ctx_tgt/ctx_dft/spec references that have no matching member declarations (inherited from master's partial upstream PR ggml-org#23279/ggml-org#23461 merge) -> use ctx and the per-slot common_speculative_free path - tests/CMakeLists.txt: drop duplicate test-llama-kv-cells registration CPU build clean (llama-quantize/server/cli, mtmd, liquid-audio, tests). test-llama-kv-cells passes. test-quantize-fns segfaults at tq3_0 — PRE-EXISTING on integration (identical crash+error), tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

server : print graphs reused in slot timings

e8560c4

Add graphs reused counter to the per-slot timing output, printed via llama_perf_context(). Assisted-by: llama.cpp:local pi

ggerganov marked this pull request as ready for review May 18, 2026 14:22

ggerganov requested a review from a team as a code owner May 18, 2026 14:22

ServeurpersoCom approved these changes May 18, 2026

View reviewed changes

github-actions Bot added examples server labels May 18, 2026

ggerganov merged commit 3c81c8d into master May 19, 2026
46 of 49 checks passed

ggerganov deleted the gg/server-print-graphs-reused branch May 19, 2026 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : print graphs reused in slot timings#23279

server : print graphs reused in slot timings#23279
ggerganov merged 1 commit into
masterfrom
gg/server-print-graphs-reused

ggerganov commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented May 18, 2026 •

edited

Loading