Skip to content

feat(metrics): expose raw KV cache pool token counts as prometheus gauges#22726

Merged
ishandhanani merged 1 commit intomainfrom
ishan/morestat
Apr 14, 2026
Merged

feat(metrics): expose raw KV cache pool token counts as prometheus gauges#22726
ishandhanani merged 1 commit intomainfrom
ishan/morestat

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Apr 13, 2026

Summary

Expose three Prometheus gauges for the raw KV cache pool token counts:

  • sglang:kv_available_tokens -- free pool slots
  • sglang:kv_evictable_tokens -- radix-cached, reclaimable slots
  • sglang:kv_used_tokens -- actively pinned slots

Motivation

The existing sglang:token_usage metric reports only non-evictable tokens (active requests + pinned sessions). Evictable radix cache nodes are excluded, making the pool appear emptier than it is. This matters for agentic workloads where subagent KV lingers in the radix tree after completion -- token_usage shows ~2% while physical GPU memory is 72% consumed.

Exposing the raw counts at the most natural granularity lets operators derive any ratio they need in PromQL/Grafana, e.g.:

  • Physical usage: 1 - (kv_available_tokens / (kv_available_tokens + kv_evictable_tokens + kv_used_tokens))
  • Evictable fraction: kv_evictable_tokens / (kv_available_tokens + kv_evictable_tokens + kv_used_tokens)

Changes

  • metrics_collector.py: Add kv_available_tokens, kv_evictable_tokens, kv_used_tokens fields to SchedulerStats, add Gauges to SchedulerMetricsCollector, log in log_stats()
  • scheduler_runtime_checker_mixin.py: Plumb raw counts from PoolStats.update_scheduler_stats()

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani ishandhanani changed the title total kv stat feat(metrics): add kv_physical_usage gauge for physical KV cache occupancy Apr 13, 2026
@ishandhanani ishandhanani marked this pull request as ready for review April 13, 2026 21:34
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-a-test-1-gpu-small

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-1-gpu-small

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-b-test-1-gpu-small to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-a-test-1-gpu-small to run independently (skipping dependencies). View workflow run

@ishandhanani ishandhanani changed the title feat(metrics): add kv_physical_usage gauge for physical KV cache occupancy feat(metrics): expose raw KV cache pool token counts as prometheus gauges Apr 13, 2026
@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-a-test-1-gpu-small

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-1-gpu-small

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-a-test-1-gpu-small to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-b-test-1-gpu-small to run independently (skipping dependencies). View workflow run

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

The correct CI passed

@ishandhanani ishandhanani merged commit cc449ac into main Apr 14, 2026
106 of 114 checks passed
@ishandhanani ishandhanani deleted the ishan/morestat branch April 14, 2026 01:30
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
ishandhanani added a commit to ai-dynamo/dynamo that referenced this pull request May 6, 2026
Closes #8151 (now unblocked: sgl-project/sglang#22726 landed in v0.5.11,
which is the new floor after this PR).

Adds a "KV Pool Detail" row to deploy/observability/grafana_dashboards/sglang.json
with two new panels driven by the gauges added in 0.5.11:

* `KV Pool Breakdown (tokens)` — stacked timeseries of
  `sglang:kv_used_tokens` (locked by running requests),
  `sglang:kv_evictable_tokens` (radix-cached, reclaimable), and
  `sglang:kv_available_tokens` (free). The three series sum to
  <= `sglang:max_total_num_tokens` per the invariant documented in
  SGLang's metrics_collector.py.

* `KV Pool Physical Usage %` — `(1 - kv_available / (kv_available +
  kv_evictable + kv_used)) * 100`. Captures true pool occupancy
  including evictable slots, vs. `sglang:token_usage` which excludes
  them. 90% threshold drawn in red for the "no headroom even after
  evict" case.

The existing `GPU KV Cache Usage %` panel (driven by
`sglang:token_usage`) is unchanged — it's still useful as the
"bottleneck across full / SWA / mamba pools" view that the new
gauges don't replicate.

Verified live on a Qwen/Qwen3-0.6B agg worker: all three gauges
export at `<system_port>/metrics`, and `kv_available + kv_evictable
+ kv_used` = `max_total_num_tokens` after a real request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant