feat(metrics): expose raw KV cache pool token counts as prometheus gauges#22726
feat(metrics): expose raw KV cache pool token counts as prometheus gauges#22726ishandhanani merged 1 commit intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/rerun-stage stage-a-test-1-gpu-small |
|
/rerun-stage stage-b-test-1-gpu-small |
|
✅ Triggered |
|
✅ Triggered |
2eee7f6 to
3edcc0f
Compare
|
/rerun-stage stage-a-test-1-gpu-small |
|
/rerun-stage stage-b-test-1-gpu-small |
|
✅ Triggered |
|
✅ Triggered |
|
The correct CI passed |
Closes #8151 (now unblocked: sgl-project/sglang#22726 landed in v0.5.11, which is the new floor after this PR). Adds a "KV Pool Detail" row to deploy/observability/grafana_dashboards/sglang.json with two new panels driven by the gauges added in 0.5.11: * `KV Pool Breakdown (tokens)` — stacked timeseries of `sglang:kv_used_tokens` (locked by running requests), `sglang:kv_evictable_tokens` (radix-cached, reclaimable), and `sglang:kv_available_tokens` (free). The three series sum to <= `sglang:max_total_num_tokens` per the invariant documented in SGLang's metrics_collector.py. * `KV Pool Physical Usage %` — `(1 - kv_available / (kv_available + kv_evictable + kv_used)) * 100`. Captures true pool occupancy including evictable slots, vs. `sglang:token_usage` which excludes them. 90% threshold drawn in red for the "no headroom even after evict" case. The existing `GPU KV Cache Usage %` panel (driven by `sglang:token_usage`) is unchanged — it's still useful as the "bottleneck across full / SWA / mamba pools" view that the new gauges don't replicate. Verified live on a Qwen/Qwen3-0.6B agg worker: all three gauges export at `<system_port>/metrics`, and `kv_available + kv_evictable + kv_used` = `max_total_num_tokens` after a real request.
Summary
Expose three Prometheus gauges for the raw KV cache pool token counts:
sglang:kv_available_tokens-- free pool slotssglang:kv_evictable_tokens-- radix-cached, reclaimable slotssglang:kv_used_tokens-- actively pinned slotsMotivation
The existing
sglang:token_usagemetric reports only non-evictable tokens (active requests + pinned sessions). Evictable radix cache nodes are excluded, making the pool appear emptier than it is. This matters for agentic workloads where subagent KV lingers in the radix tree after completion --token_usageshows ~2% while physical GPU memory is 72% consumed.Exposing the raw counts at the most natural granularity lets operators derive any ratio they need in PromQL/Grafana, e.g.:
1 - (kv_available_tokens / (kv_available_tokens + kv_evictable_tokens + kv_used_tokens))kv_evictable_tokens / (kv_available_tokens + kv_evictable_tokens + kv_used_tokens)Changes
metrics_collector.py: Addkv_available_tokens,kv_evictable_tokens,kv_used_tokensfields toSchedulerStats, add Gauges toSchedulerMetricsCollector, log inlog_stats()scheduler_runtime_checker_mixin.py: Plumb raw counts fromPoolStats.update_scheduler_stats()