Skip to content

server : print graphs reused in slot timings#23279

Merged
ggerganov merged 1 commit into
masterfrom
gg/server-print-graphs-reused
May 19, 2026
Merged

server : print graphs reused in slot timings#23279
ggerganov merged 1 commit into
masterfrom
gg/server-print-graphs-reused

Conversation

@ggerganov

@ggerganov ggerganov commented May 18, 2026

Copy link
Copy Markdown
Member

Overview

Add graphs reused counter to the per-slot timing output in the server. The value is obtained via llama_perf_context().n_reused, matching the existing behavior in llama_perf_context_print().

Note: these are the total accumulated reused graphs for all generations

Output example:

prompt eval time =     184.43 ms /   142 tokens (    1.30 ms per token,   769.93 tokens per second)
       eval time =  469503.36 ms / 35716 tokens (   13.15 ms per token,    76.07 tokens per second)
      total time =  469687.79 ms / 35858 tokens
   graphs reused =         42
draft acceptance = 0.74482 (24673 accepted / 33126 generated)

Requirements

Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi
@ggerganov ggerganov marked this pull request as ready for review May 18, 2026 14:22
@ggerganov ggerganov requested a review from a team as a code owner May 18, 2026 14:22
@ggerganov ggerganov merged commit 3c81c8d into master May 19, 2026
46 of 49 checks passed
@ggerganov ggerganov deleted the gg/server-print-graphs-reused branch May 19, 2026 06:47
kgrama pushed a commit to kgrama/llama.cpp that referenced this pull request May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
fhnmor21 pushed a commit to fhnmor21/llama-cpp-turboquant that referenced this pull request May 19, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request May 23, 2026
* upstream/HEAD:
  ci : install server kleidiai runner dependencies (ggml-org#23259)
  server-context: guarantee there is at least 1 token to decode (ggml-org#23280)
  server : print graphs reused in slot timings (ggml-org#23279)
  save-load-state : refactor tests and improve readability (ggml-org#23196)
  llama-eval : add per-task summary stats (ggml-org#23151)
  ggml-webgpu : extend GDN for K>1 (ggml-org#23299)
  [SCYL] add chapter for performance reference in SYCL.md (ggml-org#23315)
  convert : filter lora tensor names (ggml-org#23077)
  sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (ggml-org#22153)
  rpc : keep last_graph_uid in the device context (ggml-org#23273)
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
(cherry picked from commit 3c81c8d)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request Jun 11, 2026
* upstream/HEAD: (25 commits)
  metal : optimize pad + cpy (ggml-org#23354)
  snapdragon: update toolchain to v0.6 (ggml-org#23369)
  ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)
  opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303)
  hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317)
  refactor: Chat Screen UI rendering (ggml-org#23333)
  github: mention --log-file in issue templates (ggml-org#23277)
  common: fix --help for --verbosity (ggml-org#23278)
  common: fix --fit verbosity with --verbosity 4 (ggml-org#23282)
  convert : update mtp related help (ggml-org#23334)
  hexagon: enable support for NORM op (ggml-org#23319)
  model : clarify MTP layer comment in qwen35.cpp [no ci] (ggml-org#23338)
  llama : MTP clean-up (ggml-org#23269)
  ui: Bump packages + address build warnings (ggml-org#23300)
  ci : install libssl-dev (ggml-org#23325)
  ci : install server kleidiai runner dependencies (ggml-org#23259)
  server-context: guarantee there is at least 1 token to decode (ggml-org#23280)
  server : print graphs reused in slot timings (ggml-org#23279)
  save-load-state : refactor tests and improve readability (ggml-org#23196)
  llama-eval : add per-task summary stats (ggml-org#23151)
  ...
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request Jun 13, 2026
- ggml-cpu/ops.cpp: drop duplicate quant case labels created by the merge
  (RotorQuant PLANAR/ISO/Q1_0/TQ3_0 cases were stacked twice in switch groups)
- server-context.cpp: re-merge with codex spec-decode base + integration audio
  output; fix pre-existing ctx_tgt/ctx_dft/spec references that have no matching
  member declarations (inherited from master's partial upstream PR ggml-org#23279/ggml-org#23461
  merge) -> use ctx and the per-slot common_speculative_free path
- tests/CMakeLists.txt: drop duplicate test-llama-kv-cells registration

CPU build clean (llama-quantize/server/cli, mtmd, liquid-audio, tests).
test-llama-kv-cells passes. test-quantize-fns segfaults at tq3_0 — PRE-EXISTING
on integration (identical crash+error), tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants