Skip to content

feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat#9230

Open
ishandhanani wants to merge 21 commits intomainfrom
idhanani/sgl-to-0.5.11-and-cleanups
Open

feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat#9230
ishandhanani wants to merge 21 commits intomainfrom
idhanani/sgl-to-0.5.11-and-cleanups

Conversation

@ishandhanani
Copy link
Copy Markdown
Contributor

@ishandhanani ishandhanani commented May 6, 2026

Summary

Bumps the SGLang dependency from 0.5.10.post1 to 0.5.11. All testable launch scripts in examples/backends/sglang/launch/ pass on 0.5.11 with no handler/init code changes — the entire bump is the version pin plus pruning 0.5.9 fallbacks from components/src/dynamo/sglang/_compat.py per the N/N-1 support policy in components/src/dynamo/sglang/CLAUDE.md.

Also closes #8151sgl-project/sglang#22726 (raw KV pool token gauges) landed in 0.5.11, so the dashboard work it was blocked on is now actionable. Bundled into this PR rather than a follow-up since both pieces gate on the same version floor.

Compat shim deprecation

New support window after this PR: N = 0.5.11, N-1 = 0.5.10. 0.5.9-targeted branches removed:

Symbol / branch Action Rationale
NetworkAddress polyfill class (~70 LoC) Removed sglang.srt.utils.network.NetworkAddress is canonical from 0.5.10 onward.
try/except ImportError around the network import Removed + inlined at call sites After the 0.5.9 fallback was pruned, the re-exports were trivial pass-throughs. NetworkAddress / get_local_ip_auto / get_zmq_socket now import directly from sglang.srt.utils.network in register.py, publisher.py, and request_handlers/handler_base.py.
mm_encode() wrapper Removed 0.5.9 single-arg _encode(mm_items) is out of window. Both 0.5.10 and 0.5.11 accept (mm_items, modality) and return the same (grid_dim, embedding, aux_data) 3-tuple, so the two call sites in encode_worker_handler.py now invoke await self.encoder._encode(...) directly.
enable_disjoint_streaming_output stream_output fallback Removed The field has been incremental_streaming_output since at-or-before 0.5.10 (ServerArgs.__dataclass_fields__ in 0.5.11 has incremental_streaming_output and not stream_output). The wrapper itself stays — diffusion SimpleNamespace stubs need the attribute-absent no-op path.
import ipaddress, socket (only used by polyfill) Removed Dead.

Kept (still doing real version-bridging): ensure_sglang_top_level_exports, filter_supported_async_generate_kwargs, get_scheduler_info (multi-attribute fork fallbacks), enable_disjoint_streaming_output (no-op for diffusion SimpleNamespace stubs).

components/src/dynamo/sglang/CLAUDE.md example version comment bumped to reflect 0.5.10 as the new N-1.

Grafana dashboard (#8151)

Adds a KV Pool Detail (sglang ≥ 0.5.11) row to deploy/observability/grafana_dashboards/sglang.json:

  • KV Pool Breakdown (tokens) — stacked timeseries of sglang:kv_used_tokens (locked), sglang:kv_evictable_tokens (radix-cached), sglang:kv_available_tokens (free). Series sum to ≤ max_total_num_tokens.
  • KV Pool Physical Usage %(1 - sglang:kv_available_tokens / sglang:max_total_num_tokens) * 100. Captures true pool occupancy including used + evictable + any internally protected slots, vs. sglang:token_usage (which is the bottleneck across full/SWA/mamba pools and excludes evictable/protected slots). 90% threshold drawn in red. Normalizing against max_total_num_tokens rather than the sum of the three gauges avoids inflating usage when protected slots are present, per @dynamo-ops review.

Existing GPU KV Cache Usage % panel (driven by sglang:token_usage) is unchanged — still useful as the bottleneck-pool view.

Container runtime image bump

Bumps lmsysorg/sglang:v0.5.10.post1-runtimev0.5.11-runtime (and -cu130-runtime variant) in container/context.yaml, container/compliance/README.md. Verified the tags resolve on Docker Hub (digest sha256:6f81caf1...).

These tags were published by sgl-project/sglang's Release Docker Runtime Images run 25470428234workflow_dispatch from main with version=0.5.11. The workflow uses the dispatched ref's Dockerfile, so main's already-fixed Dockerfile (sgl-project/sglang#24234) builds against v0.5.11 source — sidestepped the broken-on-aarch64 silent-cubin-skip bug that died on the prior tag-triggered runs.

Also drop the libjsoncpp25 apt workaround from container/templates/sglang_runtime.Dockerfile: v0.5.11's runtime image now bundles libjsoncpp inside the mooncake wheel (mooncake_transfer_engine_cuda13.libs/libjsoncpp-7d699962.so.1.9.5), so from mooncake.engine import TransferEngine succeeds without the system-level package. Verified via docker run lmsysorg/sglang:v0.5.11-runtime ....

A future-proofing cherry-pick of sgl-project/sglang#24234 onto release/v0.5.11 is open as sgl-project/sglang#24567 (draft) — not strictly required for this PR since the workflow dispatch from main works, but it lets any future tag-triggered or v0.5.11.post1 build include the fix.

Launch script results (2x L40S)

# Script Status
1 agg.sh PASS
2 agg_embed.sh PASS
3 agg_router.sh PASS
4 agg_vision.sh PASS (with SGLANG_DISABLE_CUDNN_CHECK=1, unchanged from 0.5.9)
5 disagg.sh PASS
6 diffusion_llada.sh PASS
7 image_diffusion.sh PASS (FLUX.1-dev, 207 KB PNG)
8 text-to-video-diffusion.sh PASS (Wan2.1-1.3B, 162 KB MP4)
9 multimodal_epd.sh PASS (verified pre- and post- compat prune)
10 disagg_router.sh SKIP — needs ≥4 GPUs
11 multimodal_disagg.sh SKIP — needs ≥3 GPUs
12 disagg_same_gpu.sh SKIP — optional per the bump skill

Test plan

  • All 9 launch scripts in examples/backends/sglang/launch/ validated end-to-end with health probes (chat / embeddings / images / videos / multimodal vision)
  • multimodal_epd.sh re-validated after _compat.py prune to confirm inlined _encode calls work
  • pytest -k compat components/src/dynamo/sglang/tests/test_sglang_unit.py — 6/6 pass
  • ruff check + ruff format --check clean on changed files
  • uvx pre-commit run --all-files — green (incl. Report pytest markers after adding sglang.srt.utils.network to the mock list)
  • New gauges (sglang:kv_used_tokens, sglang:kv_evictable_tokens, sglang:kv_available_tokens, sglang:max_total_num_tokens) verified scraping live from <system_port>/metrics on a Qwen/Qwen3-0.6B agg worker; invariant kv_used + kv_evictable + kv_available <= max_total_num_tokens confirmed after a real request.
  • Dashboard JSON validated (json.load clean, no duplicate panel IDs)
  • lmsysorg/sglang:v0.5.11-runtime and v0.5.11-cu130-runtime resolve on Docker Hub; mooncake import probed clean without the apt workaround.
  • CI green

All 9 testable launch scripts in examples/backends/sglang/launch/ pass on
0.5.11 with no handler/init code changes. (disagg_router needs 4 GPUs and
multimodal_disagg needs 3 GPUs; both skipped on the test box's 2x L40S.)

The new support window is N=0.5.11, N-1=0.5.10, so the 0.5.9-targeted
branches in _compat.py come out as part of this bump per the
N/N-1 policy in components/src/dynamo/sglang/CLAUDE.md.

Removed:
* `NetworkAddress` polyfill class + `try/except` around its import
  (`sglang.srt.utils.network` is canonical from 0.5.10 onward).
* `mm_encode()` wrapper. Both 0.5.10 and 0.5.11 take
  `_encode(mm_items, modality)` and return the same 3-tuple, so the
  call sites in `encode_worker_handler.py` now invoke
  `await self.encoder._encode(...)` directly.
* `enable_disjoint_streaming_output` `stream_output` fallback. The
  field has been `incremental_streaming_output` since at-or-before
  0.5.10 (verified: `ServerArgs.__dataclass_fields__` in 0.5.11 has
  `incremental_streaming_output` and not `stream_output`). The
  wrapper itself stays — diffusion `SimpleNamespace` stubs need the
  attribute-absent no-op path.

Kept (not version-bound):
* `ensure_sglang_top_level_exports()`, `filter_supported_async_generate_kwargs`,
  `get_scheduler_info` (the latter still probes fork/experimental
  attribute paths).

CLAUDE.md example version bumped to reflect 0.5.10 as the new N-1.
@github-actions github-actions Bot added documentation Improvements or additions to documentation backend::sglang Relates to the sglang backend multimodal labels May 6, 2026
@ishandhanani ishandhanani marked this pull request as ready for review May 6, 2026 22:06
@ishandhanani ishandhanani requested review from a team as code owners May 6, 2026 22:06
Closes #8151 (now unblocked: sgl-project/sglang#22726 landed in v0.5.11,
which is the new floor after this PR).

Adds a "KV Pool Detail" row to deploy/observability/grafana_dashboards/sglang.json
with two new panels driven by the gauges added in 0.5.11:

* `KV Pool Breakdown (tokens)` — stacked timeseries of
  `sglang:kv_used_tokens` (locked by running requests),
  `sglang:kv_evictable_tokens` (radix-cached, reclaimable), and
  `sglang:kv_available_tokens` (free). The three series sum to
  <= `sglang:max_total_num_tokens` per the invariant documented in
  SGLang's metrics_collector.py.

* `KV Pool Physical Usage %` — `(1 - kv_available / (kv_available +
  kv_evictable + kv_used)) * 100`. Captures true pool occupancy
  including evictable slots, vs. `sglang:token_usage` which excludes
  them. 90% threshold drawn in red for the "no headroom even after
  evict" case.

The existing `GPU KV Cache Usage %` panel (driven by
`sglang:token_usage`) is unchanged — it's still useful as the
"bottleneck across full / SWA / mamba pools" view that the new
gauges don't replicate.

Verified live on a Qwen/Qwen3-0.6B agg worker: all three gauges
export at `<system_port>/metrics`, and `kv_available + kv_evictable
+ kv_used` = `max_total_num_tokens` after a real request.
@ishandhanani ishandhanani requested a review from a team as a code owner May 6, 2026 22:10
@ishandhanani ishandhanani changed the title sglang: bump to 0.5.11 and prune 0.5.9 fallbacks in _compat feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat May 6, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Walkthrough

The PR modernizes SGLang compatibility by removing legacy import fallbacks, simplifying streaming configuration logic, and replacing the mm_encode compatibility wrapper with direct MMEncoder._encode() calls. The sglang dependency is updated to version 0.5.11 to support these changes.

Changes

SGLang Compatibility Modernization

Layer / File(s) Summary
Dependency Update
pyproject.toml
Bumps sglang[diffusion] from 0.5.10.post1 to 0.5.11.
Compatibility Module Refactoring
components/src/dynamo/sglang/_compat.py
Removes try/except fallback for NetworkAddress import (now required), simplifies enable_disjoint_streaming_output to only check attribute presence, and removes mm_encode from __all__ exports.
Handler Integration
components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py
Switches from mm_encode(...) wrapper to direct self.encoder._encode(..., Modality.IMAGE) calls in both cached and uncached vision encoding paths; removes legacy import.
Documentation
components/src/dynamo/sglang/CLAUDE.md
Updates _compat.py description to reflect its new role as async-compatibility shim rather than network-imports polyfill.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main changes: bumping SGLang to 0.5.11 and removing 0.5.9 compatibility fallbacks from _compat, which aligns with the changeset across pyproject.toml and _compat.py.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering all major aspects: version bump rationale, detailed compat shim deprecation table, test results, and validation across launch scripts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the feat label May 6, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/sglang/_compat.py`:
- Around line 23-27: The module currently hard-imports NetworkAddress,
get_local_ip_auto, and get_zmq_socket and unconditionally calls
ensure_sglang_top_level_exports(), which causes ModuleNotFoundError at import
time when sglang isn't installed; wrap the top-level import of
NetworkAddress/get_local_ip_auto/get_zmq_socket in a try/except ImportError and
provide safe fallbacks (e.g., set those names to None or simple proxy callables)
so consumers will get a clear runtime error when used, and also guard the call
to ensure_sglang_top_level_exports() in the same try/except (or remove the
top-level call and call it lazily inside a guarded accessor) so sglang import
failures are deferred to runtime. Ensure you reference the exact symbols
NetworkAddress, get_local_ip_auto, get_zmq_socket and the function
ensure_sglang_top_level_exports() when updating the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0bae7dfc-2c4a-4179-b6c7-22b1b32d6b13

📥 Commits

Reviewing files that changed from the base of the PR and between b9418d3 and 2224cb1.

📒 Files selected for processing (4)
  • components/src/dynamo/sglang/CLAUDE.md
  • components/src/dynamo/sglang/_compat.py
  • components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py
  • pyproject.toml

Comment thread components/src/dynamo/sglang/_compat.py Outdated
`tests/report_pytest_markers.py` mocks sglang submodules so test collection
succeeds in the pre-commit isolated venv (which doesn't install sglang).
Now that `dynamo.sglang._compat` imports from `sglang.srt.utils.network`
unconditionally (the 0.5.9 fallback was pruned in the same PR), the
mock list needs that submodule too — otherwise marker reporting fails
to collect any sglang test file.
Comment thread deploy/observability/grafana_dashboards/sglang.json Outdated
Now that the 0.5.9 fallback for `NetworkAddress` is gone, the
re-exports in `_compat` are pure pass-throughs of
`sglang.srt.utils.network.{NetworkAddress,get_local_ip_auto,
get_zmq_socket}`. Per the policy in `components/src/dynamo/sglang/CLAUDE.md`,
trivial re-exports belong at call sites, not in `_compat` — the shim
is reserved for symbols that have actually broken across versions.

Move the three imports to:
- `register.py`             (NetworkAddress, get_local_ip_auto)
- `publisher.py`            (NetworkAddress, get_local_ip_auto, get_zmq_socket)
- `request_handlers/handler_base.py` (NetworkAddress, get_local_ip_auto)

`_compat.py` keeps `ensure_sglang_top_level_exports`,
`filter_supported_async_generate_kwargs`, `get_scheduler_info`, and
`enable_disjoint_streaming_output` — all of which are still doing
real version-bridging work (signature probing, multi-attribute
fallbacks, SimpleNamespace stub handling).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Per @dynamo-ops review on #9230: dividing by
(kv_available + kv_evictable + kv_used) inflates the percentage when
internally protected slots are present, since the SGLang invariant is
`available + evictable + used <= max_total_num_tokens` (note <=, not ==).
Using `sglang:max_total_num_tokens` as the divisor gives true pool
occupancy regardless of how slots are accounted internally.

Expression:
  before: (1 - kv_available / (kv_available + kv_evictable + kv_used)) * 100
  after:  (1 - kv_available / max_total_num_tokens) * 100

Also rewrote the panel description so it reflects the new framing
("not free, normalized against max_total_num_tokens") rather than the
incorrect "physically allocated (used + evictable)" wording, and called
out what `sglang:token_usage` actually measures (bottleneck across
full/SWA/mamba pools) for clearer reviewer context.
Bumps `lmsysorg/sglang:v0.5.10.post1-runtime` →
`lmsysorg/sglang:v0.5.11-runtime` (and the `-cu130-runtime` variant)
across:
- container/context.yaml
- container/compliance/README.md

Verified the tags resolve on Docker Hub:
  digest sha256:6f81caf1d2a24b2cfc212410900c14d633302bc16e6ec8379c0d382e625ab313

These were published by sgl-project/sglang's `Release Docker Runtime
Images` workflow run https://github.com/sgl-project/sglang/actions/runs/25470428234
(workflow_dispatch from main with version=0.5.11). The workflow uses
the dispatched ref's Dockerfile, so main's already-fixed Dockerfile
(sgl-project/sglang#24234) builds against v0.5.11 source — no need
to wait for that fix to be cherry-picked into release/v0.5.11.

Also drop the libjsoncpp25 apt workaround from
container/templates/sglang_runtime.Dockerfile. v0.5.11's runtime image
bundles libjsoncpp inside the mooncake wheel itself
(/usr/local/lib/python3.12/dist-packages/mooncake_transfer_engine_cuda13.libs/
libjsoncpp-7d699962.so.1.9.5), so `from mooncake.engine import
TransferEngine` now succeeds without the system-level package. Verified
via `docker run --rm --runtime=runc --entrypoint bash
lmsysorg/sglang:v0.5.11-runtime -c 'python3 -c "from mooncake.engine
import TransferEngine"'`.
…nt hang

SGLang 0.5.11 silently hangs requests when prompt+max_tokens nears
max_total_tokens (no scheduler activity, no error). Bisected threshold
is ~1040 for the chat_payload_default() case (~16-token prompt +
max_tokens=1000); KV=1024 hangs, KV=1040 works (truncated to 962
tokens), KV=1056+ works fully. Test was at KV=96, well below threshold.
@ishandhanani
Copy link
Copy Markdown
Contributor Author

Fix for test_sglang_deployment[aggregated-2] / [aggregated_unified-2] hangs

Pushed d27b2717bcd — bumped requested_sglang_kv_tokens from 962048 for both configs.

What was happening

The previous CI failures showed both tests hanging with requests.exceptions.ReadTimeout (60s read timeout) on the very first chat completion. From the GH Actions logs and a local repro on this branch:

  • Frontend received the request, dynamo push_handler forwarded it to dynamo.backend.generate, then 60s of total silence — no SGLang scheduler/tokenizer log activity.
  • py-spy on the SGLang scheduler subprocess: blocked in recv_pyobj (zmq) — request never reached it.
  • py-spy on the tokenizer-manager (asyncio loop): idle in epoll_pwaitengine.async_generate(...) returned a generator that produced nothing.
  • The Subprocess scheduler_0 (pid=...) crashed with exit code -15 line in the logs is teardown noise (SIGTERM cascade from pytest cleanup), not the cause.

Why bumping pytest.mark.timeout didn't help

b7165adc329 raised pytest.mark.timeout 195 → 293s, but the killer is the per-request 60s read timeout in tests/utils/client.py, not the test-level pytest_timeout. So that bump was a no-op for this failure mode.

Root cause: SGLang 0.5.11 silently queues over-budget requests

Reproed without dynamo at all:

python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B \
  --max-total-tokens 1024 --page-size 16 --disable-piecewise-cuda-graph \
  --mem-fraction-static 0.9
# any chat completion with max_tokens=1000 → silent 60s+ hang, no error

Bisecting the threshold (chat payload with ~16-token prompt + max_tokens=1000):

--max-total-tokens Result
1024 hang
1038 hang
1040 OK, response truncated to 962 tokens
1056 OK, full 1000 tokens
2048 OK, full 1000 tokens

Likely culprit is PrefillAdder.add_one_req in sglang/srt/managers/schedule_policy.py:740-754:

cur_rem_tokens = self.cur_rem_tokens - self.ceil_paged_tokens(req.extend_input_len)
for i, (tokens_left, tokens_occupied) in enumerate(self.req_states):
    bs = len(self.req_states) - i
    min_free_tokens = cur_rem_tokens + tokens_freed - tokens_left * bs
    if min_free_tokens <= IGNORE_EOS_RESERVE_TOKENS * bs:
        return AddReqResult.NO_TOKEN

with tokens_left = max_new_tokens * new_token_ratio (default 0.7). When the request's projected occupancy approaches max_total_tokens, min_free_tokens goes ≤ IGNORE_EOS_RESERVE_TOKENS and the request goes back to waiting_queue — scheduler retries forever instead of clamping or rejecting. Worth filing upstream; SGLang should clamp max_new_tokens (already does at KV=1040+ — the 962-truncation case) or surface an error rather than spin.

Why 2048 specifically

  • chat_payload_default sends max_tokens=1000; with chat-template padding the prompt is ~16 tokens. Need max_total_tokens > 1000 + 16 + scheduler reserve.
  • Local probe shows ~1040 is the floor; 1056 lets generation hit the full 1000 tokens; 2048 leaves comfortable headroom for the secondary payloads (completion_payload_default, responses_payload_default, metric_payload_default(min_num_requests=6) — the latter sends 6 concurrent requests, so the budget needs to fit several small completions in flight).
  • The disaggregated_same_gpu config is unaffected (already at requested_sglang_kv_tokens(37472)).

Local verification

$ pytest tests/serve/test_sglang.py::test_sglang_deployment[aggregated-2] \
         tests/serve/test_sglang.py::test_sglang_deployment[aggregated_unified-2]
========================= 2 passed in 81.01s (0:01:21) =========================
# previously: 2 failed in 1184s after both timed out at 60s each

Follow-ups (not in this PR)

  • pytest.mark.timeout(293) from b7165adc329 is now unnecessary (each test runs ~40s); could be reverted to ~195 in a later cleanup, but harmless to leave.
  • Upstream issue against SGLang for the silent-hang behavior — want to file once we have a minimal repro script written up.
  • The requested_sglang_kv_tokens profiler logic (tests/utils/profile_pytest.py) found min=48 historically. On 0.5.11 the profiler will now find a much higher min (or worse, hang during profiling) — worth re-running the auto-profile next time we re-derive these markers.

Comment thread container/context.yaml
runtime_image: lmsysorg/sglang
base_image_tag: 25.06-cuda12.9-devel-ubuntu24.04
runtime_image_tag: v0.5.10.post1-runtime
runtime_image_tag: v0.5.11-runtime
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA 12.9 SGLang runtime now points at v0.5.11-runtime, which aliases to the CUDA 13.0 image, so the cuda12.9 build uses the wrong upstream CUDA stack. Fix: use v0.5.11-cu129-runtime here and update the compliance README entry to match.

The real fix for the aggregated test hangs was the KV budget bump
(d27b271), not the pytest timeout. Revert to the original profiled
value.
Comment thread container/context.yaml
runtime_image: lmsysorg/sglang
base_image_tag: 25.06-cuda12.9-devel-ubuntu24.04
runtime_image_tag: v0.5.10.post1-runtime
runtime_image_tag: v0.5.11-runtime
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA 12.9 SGLang build still uses v0.5.11-runtime, whose image config reports CUDA_VERSION=13.0.1, so this target is built against the wrong upstream CUDA runtime. Fix: use v0.5.11-cu129-runtime here and update the compliance README row to the same tag.

The dynamo, sglang, and kvbm dashboards reference prometheus uid
P1809F7CD0C75ACF3, which doesn't match the provisioned uid `prometheus`,
causing "datasource not found" panels. Add a second provisioned entry
pointing at the same prometheus URL with the legacy uid so both groups
of dashboards resolve cleanly.
…ation and per-pool KV breakdown

The bundled dashboard previously mixed Dynamo-frontend and SGLang-engine
data without source attribution, omitted the per-pool-type gauges added
in sglang 0.5.11 (full_token_usage / swa_token_usage / mamba_usage), and
referenced a few stale upstream metric names (sglang:hicache_eviction_*,
sglang:hicache_load_back_*, sglang:num_retracted_reqs_total) that no
longer exist in 0.5.11.

Reorganize into 9 explicitly-labeled rows:

  1. Overview — stat-row summary (success rate, totals, avg TTFT/ITL/E2E)
  2. Dynamo Frontend (:8000) — RPS, E2E + TTFT + ITL p50/p90/p99,
     request-outcome breakdown, inflight vs queued, ISL/OSL/cached tokens
  3. Dynamo KV Router — kv_hit_rate distribution, multi-stage routing
     overhead (block_hashing / indexer_find_matches / seq_hashing /
     scheduling / total), per-worker inflight, total KV blocks
  4. SGLang Engine (:8081) — running/queued/paused, gen throughput,
     engine-side TTFT/ITL p50/p90/p99, retractions + new_token_ratio
  5. SGLang Engine — KV Pool Breakdown — full / swa / mamba pool usage,
     absolute used/evictable/available, cache_hit_rate, pending_prealloc
     (PD), streaming_session_held_tokens, lora_pool_*, capacity reference

Collapsed by default (no spam in agg/non-feature runs):
  6. P/D Disagg Queues — num_prefill_*_queue_reqs / num_decode_*_queue_reqs,
     grammar queue, spec accept rate/length
  7. HiCache — host_used vs total, eviction & load-back rate by cache_type,
     latency p99 by cache_type, prefetch / backup
  8. Per-Worker Detail — req rate / cache hit / KV util / E2E p99 fanned by worker_id
  9. GPU & Runtime Health — DCGM, dynamo_component_gpu_cache_usage_percent,
     tokio_worker_busy_ratio, request_plane_roundtrip_ttft

Every panel description names its source metric so the data provenance
(Dynamo vs SGLang vs DCGM) is explicit. Dashboard uid stays sglang-engine
so existing links keep working.
_test_agg.sh: aggregated launch with all 0.5.11 observability features
turned on (--enable-hierarchical-cache, --enable-streaming-session,
--enable-metrics-for-all-schedulers, --enable-mfu-metrics) plus a
larger mem-fraction-static / max-running-requests / chunked-prefill /
page-size so a load test against it populates every non-disagg row of
the Dynamo + SGLang dashboard.

Dashboard: rows + descriptions referenced "port 8000" / "port 8081"
explicitly, but those are overridable via DYN_HTTP_PORT and
DYN_SYSTEM_PORT. Drop the hardcoded numbers; describe the source layer
instead. Per-pool gauges (full/swa/mamba) only populate on hybrid
attention models — note that in the script header so users know to
swap models if they want SWA/Mamba panels to move.
…o-0.5.11-and-cleanups

Pulls in observability work that pairs with the 0.5.11 bump:
- legacy prometheus uid alias so bundled dashboards stop showing "datasource not found"
- SGLang dashboard rebuild with Dynamo<->SGLang separation, per-pool KV breakdown
  (full / SWA / Mamba), and collapsed disagg/hicache/per-worker rows
- _test_agg.sh: launch with all 0.5.11 observability features on so a load
  test against it lights up every applicable dashboard panel

# Conflicts:
#	container/context.yaml
… under Router

- Inflight vs Queued: explain the difference (inflight = dispatched, awaiting
  completion; queued = received but not yet dispatched). The combined shape
  is the steady-state diagnostic — flat inflight + growing queue means
  workers are saturated.
- Pool Utilization panel: drop the redundant "overall (legacy)" series. On
  non-hybrid models it duplicates full_token_usage and adds noise; on
  hybrid models its semantics overlap with the dominant pool. Spell out
  what each remaining series (full / SWA / Mamba) means in the description.
- Move Per-Worker Detail row to sit directly under the Router row, since
  per-worker fanout is conceptually a drill-down of routing distribution.
Speculative decoding accept rate / accept length are throughput-level
signals (they directly explain tokens-per-step), not disagg-queue signals.
Move them out of the disagg row and into "SGLang Engine — throughput &
batching" as two side-by-side panels with explanations of when they
populate. Disagg row keeps just the grammar queue.
Restructure _test_agg.sh from single-worker to two-worker behind a
KV-aware router, matching agg_router.sh's topology:

  - frontend with --router-mode kv (KV events on by default, --approx
    falls back to approximate routing without KV events)
  - 2 SGLang workers, one per GPU (CUDA_VISIBLE_DEVICES=0,1)
  - per-worker DYN_SYSTEM_PORT (8081, 8082) and per-worker ZMQ KV-events
    endpoints (5557, 5558)
  - all 0.5.11 observability features still on for both workers

Update prometheus scrape config to pick up both worker metric ports;
without 8082 in the dynamo-backend job target list, Grafana would only
show worker 1 in the Per-Worker rows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend container documentation Improvements or additions to documentation feat multimodal size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add raw KV cache pool gauges to SGLang Grafana dashboard

4 participants