feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat by ishandhanani · Pull Request #9230 · ai-dynamo/dynamo

ishandhanani · 2026-05-06T21:50:56Z

Summary

Bumps the SGLang dependency from 0.5.10.post1 to 0.5.11. All testable launch scripts in examples/backends/sglang/launch/ pass on 0.5.11 with no handler/init code changes — the entire bump is the version pin plus pruning 0.5.9 fallbacks from components/src/dynamo/sglang/_compat.py per the N/N-1 support policy in components/src/dynamo/sglang/CLAUDE.md.

Also closes #8151 — sgl-project/sglang#22726 (raw KV pool token gauges) landed in 0.5.11, so the dashboard work it was blocked on is now actionable. Bundled into this PR rather than a follow-up since both pieces gate on the same version floor.

Compat shim deprecation

New support window after this PR: N = 0.5.11, N-1 = 0.5.10. 0.5.9-targeted branches removed:

Symbol / branch	Action	Rationale
`NetworkAddress` polyfill class (~70 LoC)	Removed	`sglang.srt.utils.network.NetworkAddress` is canonical from 0.5.10 onward.
`try/except ImportError` around the network import	Removed + inlined at call sites	After the 0.5.9 fallback was pruned, the re-exports were trivial pass-throughs. `NetworkAddress` / `get_local_ip_auto` / `get_zmq_socket` now import directly from `sglang.srt.utils.network` in `register.py`, `publisher.py`, and `request_handlers/handler_base.py`.
`mm_encode()` wrapper	Removed	0.5.9 single-arg `_encode(mm_items)` is out of window. Both 0.5.10 and 0.5.11 accept `(mm_items, modality)` and return the same `(grid_dim, embedding, aux_data)` 3-tuple, so the two call sites in `encode_worker_handler.py` now invoke `await self.encoder._encode(...)` directly.
`enable_disjoint_streaming_output` `stream_output` fallback	Removed	The field has been `incremental_streaming_output` since at-or-before 0.5.10 (`ServerArgs.__dataclass_fields__` in 0.5.11 has `incremental_streaming_output` and not `stream_output`). The wrapper itself stays — diffusion `SimpleNamespace` stubs need the attribute-absent no-op path.
`import ipaddress, socket` (only used by polyfill)	Removed	Dead.

Kept (still doing real version-bridging): ensure_sglang_top_level_exports, filter_supported_async_generate_kwargs, get_scheduler_info (multi-attribute fork fallbacks), enable_disjoint_streaming_output (no-op for diffusion SimpleNamespace stubs).

components/src/dynamo/sglang/CLAUDE.md example version comment bumped to reflect 0.5.10 as the new N-1.

Grafana dashboard (#8151)

Adds a KV Pool Detail (sglang ≥ 0.5.11) row to deploy/observability/grafana_dashboards/sglang.json:

KV Pool Breakdown (tokens) — stacked timeseries of sglang:kv_used_tokens (locked), sglang:kv_evictable_tokens (radix-cached), sglang:kv_available_tokens (free). Series sum to ≤ max_total_num_tokens.
KV Pool Physical Usage % — (1 - sglang:kv_available_tokens / sglang:max_total_num_tokens) * 100. Captures true pool occupancy including used + evictable + any internally protected slots, vs. sglang:token_usage (which is the bottleneck across full/SWA/mamba pools and excludes evictable/protected slots). 90% threshold drawn in red. Normalizing against max_total_num_tokens rather than the sum of the three gauges avoids inflating usage when protected slots are present, per @dynamo-ops review.

Existing GPU KV Cache Usage % panel (driven by sglang:token_usage) is unchanged — still useful as the bottleneck-pool view.

Container runtime image bump

Bumps lmsysorg/sglang:v0.5.10.post1-runtime → v0.5.11-runtime (and -cu130-runtime variant) in container/context.yaml, container/compliance/README.md. Verified the tags resolve on Docker Hub (digest sha256:6f81caf1...).

These tags were published by sgl-project/sglang's Release Docker Runtime Images run 25470428234 — workflow_dispatch from main with version=0.5.11. The workflow uses the dispatched ref's Dockerfile, so main's already-fixed Dockerfile (sgl-project/sglang#24234) builds against v0.5.11 source — sidestepped the broken-on-aarch64 silent-cubin-skip bug that died on the prior tag-triggered runs.

Also drop the libjsoncpp25 apt workaround from container/templates/sglang_runtime.Dockerfile: v0.5.11's runtime image now bundles libjsoncpp inside the mooncake wheel (mooncake_transfer_engine_cuda13.libs/libjsoncpp-7d699962.so.1.9.5), so from mooncake.engine import TransferEngine succeeds without the system-level package. Verified via docker run lmsysorg/sglang:v0.5.11-runtime ....

A future-proofing cherry-pick of sgl-project/sglang#24234 onto release/v0.5.11 is open as sgl-project/sglang#24567 (draft) — not strictly required for this PR since the workflow dispatch from main works, but it lets any future tag-triggered or v0.5.11.post1 build include the fix.

Launch script results (2x L40S)

#	Script	Status
1	`agg.sh`	PASS
2	`agg_embed.sh`	PASS
3	`agg_router.sh`	PASS
4	`agg_vision.sh`	PASS (with `SGLANG_DISABLE_CUDNN_CHECK=1`, unchanged from 0.5.9)
5	`disagg.sh`	PASS
6	`diffusion_llada.sh`	PASS
7	`image_diffusion.sh`	PASS (FLUX.1-dev, 207 KB PNG)
8	`text-to-video-diffusion.sh`	PASS (Wan2.1-1.3B, 162 KB MP4)
9	`multimodal_epd.sh`	PASS (verified pre- and post- compat prune)
10	`disagg_router.sh`	SKIP — needs ≥4 GPUs
11	`multimodal_disagg.sh`	SKIP — needs ≥3 GPUs
12	`disagg_same_gpu.sh`	SKIP — optional per the bump skill

Test plan

All 9 launch scripts in examples/backends/sglang/launch/ validated end-to-end with health probes (chat / embeddings / images / videos / multimodal vision)
multimodal_epd.sh re-validated after _compat.py prune to confirm inlined _encode calls work
pytest -k compat components/src/dynamo/sglang/tests/test_sglang_unit.py — 6/6 pass
ruff check + ruff format --check clean on changed files
uvx pre-commit run --all-files — green (incl. Report pytest markers after adding sglang.srt.utils.network to the mock list)
New gauges (sglang:kv_used_tokens, sglang:kv_evictable_tokens, sglang:kv_available_tokens, sglang:max_total_num_tokens) verified scraping live from <system_port>/metrics on a Qwen/Qwen3-0.6B agg worker; invariant kv_used + kv_evictable + kv_available <= max_total_num_tokens confirmed after a real request.
Dashboard JSON validated (json.load clean, no duplicate panel IDs)
lmsysorg/sglang:v0.5.11-runtime and v0.5.11-cu130-runtime resolve on Docker Hub; mooncake import probed clean without the apt workaround.
CI green

All 9 testable launch scripts in examples/backends/sglang/launch/ pass on 0.5.11 with no handler/init code changes. (disagg_router needs 4 GPUs and multimodal_disagg needs 3 GPUs; both skipped on the test box's 2x L40S.) The new support window is N=0.5.11, N-1=0.5.10, so the 0.5.9-targeted branches in _compat.py come out as part of this bump per the N/N-1 policy in components/src/dynamo/sglang/CLAUDE.md. Removed: * `NetworkAddress` polyfill class + `try/except` around its import (`sglang.srt.utils.network` is canonical from 0.5.10 onward). * `mm_encode()` wrapper. Both 0.5.10 and 0.5.11 take `_encode(mm_items, modality)` and return the same 3-tuple, so the call sites in `encode_worker_handler.py` now invoke `await self.encoder._encode(...)` directly. * `enable_disjoint_streaming_output` `stream_output` fallback. The field has been `incremental_streaming_output` since at-or-before 0.5.10 (verified: `ServerArgs.__dataclass_fields__` in 0.5.11 has `incremental_streaming_output` and not `stream_output`). The wrapper itself stays — diffusion `SimpleNamespace` stubs need the attribute-absent no-op path. Kept (not version-bound): * `ensure_sglang_top_level_exports()`, `filter_supported_async_generate_kwargs`, `get_scheduler_info` (the latter still probes fork/experimental attribute paths). CLAUDE.md example version bumped to reflect 0.5.10 as the new N-1.

Closes #8151 (now unblocked: sgl-project/sglang#22726 landed in v0.5.11, which is the new floor after this PR). Adds a "KV Pool Detail" row to deploy/observability/grafana_dashboards/sglang.json with two new panels driven by the gauges added in 0.5.11: * `KV Pool Breakdown (tokens)` — stacked timeseries of `sglang:kv_used_tokens` (locked by running requests), `sglang:kv_evictable_tokens` (radix-cached, reclaimable), and `sglang:kv_available_tokens` (free). The three series sum to <= `sglang:max_total_num_tokens` per the invariant documented in SGLang's metrics_collector.py. * `KV Pool Physical Usage %` — `(1 - kv_available / (kv_available + kv_evictable + kv_used)) * 100`. Captures true pool occupancy including evictable slots, vs. `sglang:token_usage` which excludes them. 90% threshold drawn in red for the "no headroom even after evict" case. The existing `GPU KV Cache Usage %` panel (driven by `sglang:token_usage`) is unchanged — it's still useful as the "bottleneck across full / SWA / mamba pools" view that the new gauges don't replicate. Verified live on a Qwen/Qwen3-0.6B agg worker: all three gauges export at `<system_port>/metrics`, and `kv_available + kv_evictable + kv_used` = `max_total_num_tokens` after a real request.

coderabbitai · 2026-05-06T22:12:11Z

Walkthrough

The PR modernizes SGLang compatibility by removing legacy import fallbacks, simplifying streaming configuration logic, and replacing the mm_encode compatibility wrapper with direct MMEncoder._encode() calls. The sglang dependency is updated to version 0.5.11 to support these changes.

Changes

SGLang Compatibility Modernization

Layer / File(s)	Summary
Dependency Update `pyproject.toml`	Bumps `sglang[diffusion]` from `0.5.10.post1` to `0.5.11`.
Compatibility Module Refactoring `components/src/dynamo/sglang/_compat.py`	Removes `try/except` fallback for `NetworkAddress` import (now required), simplifies `enable_disjoint_streaming_output` to only check attribute presence, and removes `mm_encode` from `__all__` exports.
Handler Integration `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py`	Switches from `mm_encode(...)` wrapper to direct `self.encoder._encode(..., Modality.IMAGE)` calls in both cached and uncached vision encoding paths; removes legacy import.
Documentation `components/src/dynamo/sglang/CLAUDE.md`	Updates `_compat.py` description to reflect its new role as async-compatibility shim rather than network-imports polyfill.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately describes the main changes: bumping SGLang to 0.5.11 and removing 0.5.9 compatibility fallbacks from _compat, which aligns with the changeset across pyproject.toml and _compat.py.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering all major aspects: version bump rationale, detailed compat shim deprecation table, test results, and validation across launch scripts.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/sglang/_compat.py`:
- Around line 23-27: The module currently hard-imports NetworkAddress,
get_local_ip_auto, and get_zmq_socket and unconditionally calls
ensure_sglang_top_level_exports(), which causes ModuleNotFoundError at import
time when sglang isn't installed; wrap the top-level import of
NetworkAddress/get_local_ip_auto/get_zmq_socket in a try/except ImportError and
provide safe fallbacks (e.g., set those names to None or simple proxy callables)
so consumers will get a clear runtime error when used, and also guard the call
to ensure_sglang_top_level_exports() in the same try/except (or remove the
top-level call and call it lazily inside a guarded accessor) so sglang import
failures are deferred to runtime. Ensure you reference the exact symbols
NetworkAddress, get_local_ip_auto, get_zmq_socket and the function
ensure_sglang_top_level_exports() when updating the file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0bae7dfc-2c4a-4179-b6c7-22b1b32d6b13

📥 Commits

Reviewing files that changed from the base of the PR and between b9418d3 and 2224cb1.

📒 Files selected for processing (4)

components/src/dynamo/sglang/CLAUDE.md
components/src/dynamo/sglang/_compat.py
components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py
pyproject.toml

`tests/report_pytest_markers.py` mocks sglang submodules so test collection succeeds in the pre-commit isolated venv (which doesn't install sglang). Now that `dynamo.sglang._compat` imports from `sglang.srt.utils.network` unconditionally (the 0.5.9 fallback was pruned in the same PR), the mock list needs that submodule too — otherwise marker reporting fails to collect any sglang test file.

Now that the 0.5.9 fallback for `NetworkAddress` is gone, the re-exports in `_compat` are pure pass-throughs of `sglang.srt.utils.network.{NetworkAddress,get_local_ip_auto, get_zmq_socket}`. Per the policy in `components/src/dynamo/sglang/CLAUDE.md`, trivial re-exports belong at call sites, not in `_compat` — the shim is reserved for symbols that have actually broken across versions. Move the three imports to: - `register.py` (NetworkAddress, get_local_ip_auto) - `publisher.py` (NetworkAddress, get_local_ip_auto, get_zmq_socket) - `request_handlers/handler_base.py` (NetworkAddress, get_local_ip_auto) `_compat.py` keeps `ensure_sglang_top_level_exports`, `filter_supported_async_generate_kwargs`, `get_scheduler_info`, and `enable_disjoint_streaming_output` — all of which are still doing real version-bridging work (signature probing, multi-attribute fallbacks, SimpleNamespace stub handling).

github-actions · 2026-05-06T23:29:01Z

🌿 Fern Docs Preview: https://nvidia-preview-4516e5ea-8f1a-486e-837f-cb46a36b7b90.docs.buildwithfern.com/dynamo/dev

@dynamo-ops

Per @dynamo-ops review on #9230: dividing by (kv_available + kv_evictable + kv_used) inflates the percentage when internally protected slots are present, since the SGLang invariant is `available + evictable + used <= max_total_num_tokens` (note <=, not ==). Using `sglang:max_total_num_tokens` as the divisor gives true pool occupancy regardless of how slots are accounted internally. Expression: before: (1 - kv_available / (kv_available + kv_evictable + kv_used)) * 100 after: (1 - kv_available / max_total_num_tokens) * 100 Also rewrote the panel description so it reflects the new framing ("not free, normalized against max_total_num_tokens") rather than the incorrect "physically allocated (used + evictable)" wording, and called out what `sglang:token_usage` actually measures (bottleneck across full/SWA/mamba pools) for clearer reviewer context.

Bumps `lmsysorg/sglang:v0.5.10.post1-runtime` → `lmsysorg/sglang:v0.5.11-runtime` (and the `-cu130-runtime` variant) across: - container/context.yaml - container/compliance/README.md Verified the tags resolve on Docker Hub: digest sha256:6f81caf1d2a24b2cfc212410900c14d633302bc16e6ec8379c0d382e625ab313 These were published by sgl-project/sglang's `Release Docker Runtime Images` workflow run https://github.com/sgl-project/sglang/actions/runs/25470428234 (workflow_dispatch from main with version=0.5.11). The workflow uses the dispatched ref's Dockerfile, so main's already-fixed Dockerfile (sgl-project/sglang#24234) builds against v0.5.11 source — no need to wait for that fix to be cherry-picked into release/v0.5.11. Also drop the libjsoncpp25 apt workaround from container/templates/sglang_runtime.Dockerfile. v0.5.11's runtime image bundles libjsoncpp inside the mooncake wheel itself (/usr/local/lib/python3.12/dist-packages/mooncake_transfer_engine_cuda13.libs/ libjsoncpp-7d699962.so.1.9.5), so `from mooncake.engine import TransferEngine` now succeeds without the system-level package. Verified via `docker run --rm --runtime=runc --entrypoint bash lmsysorg/sglang:v0.5.11-runtime -c 'python3 -c "from mooncake.engine import TransferEngine"'`.

…nt hang SGLang 0.5.11 silently hangs requests when prompt+max_tokens nears max_total_tokens (no scheduler activity, no error). Bisected threshold is ~1040 for the chat_payload_default() case (~16-token prompt + max_tokens=1000); KV=1024 hangs, KV=1040 works (truncated to 962 tokens), KV=1056+ works fully. Test was at KV=96, well below threshold.

ishandhanani · 2026-05-08T16:39:23Z

Fix for `test_sglang_deployment[aggregated-2]` / `[aggregated_unified-2]` hangs

Pushed d27b2717bcd — bumped requested_sglang_kv_tokens from 96 → 2048 for both configs.

What was happening

The previous CI failures showed both tests hanging with requests.exceptions.ReadTimeout (60s read timeout) on the very first chat completion. From the GH Actions logs and a local repro on this branch:

Frontend received the request, dynamo push_handler forwarded it to dynamo.backend.generate, then 60s of total silence — no SGLang scheduler/tokenizer log activity.
py-spy on the SGLang scheduler subprocess: blocked in recv_pyobj (zmq) — request never reached it.
py-spy on the tokenizer-manager (asyncio loop): idle in epoll_pwait — engine.async_generate(...) returned a generator that produced nothing.
The Subprocess scheduler_0 (pid=...) crashed with exit code -15 line in the logs is teardown noise (SIGTERM cascade from pytest cleanup), not the cause.

Why bumping pytest.mark.timeout didn't help

b7165adc329 raised pytest.mark.timeout 195 → 293s, but the killer is the per-request 60s read timeout in tests/utils/client.py, not the test-level pytest_timeout. So that bump was a no-op for this failure mode.

Root cause: SGLang 0.5.11 silently queues over-budget requests

Reproed without dynamo at all:

python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B \
  --max-total-tokens 1024 --page-size 16 --disable-piecewise-cuda-graph \
  --mem-fraction-static 0.9
# any chat completion with max_tokens=1000 → silent 60s+ hang, no error

Bisecting the threshold (chat payload with ~16-token prompt + max_tokens=1000):

`--max-total-tokens`	Result
1024	hang
1038	hang
1040	OK, response truncated to 962 tokens
1056	OK, full 1000 tokens
2048	OK, full 1000 tokens

Likely culprit is PrefillAdder.add_one_req in sglang/srt/managers/schedule_policy.py:740-754:

cur_rem_tokens = self.cur_rem_tokens - self.ceil_paged_tokens(req.extend_input_len)
for i, (tokens_left, tokens_occupied) in enumerate(self.req_states):
    bs = len(self.req_states) - i
    min_free_tokens = cur_rem_tokens + tokens_freed - tokens_left * bs
    if min_free_tokens <= IGNORE_EOS_RESERVE_TOKENS * bs:
        return AddReqResult.NO_TOKEN

with tokens_left = max_new_tokens * new_token_ratio (default 0.7). When the request's projected occupancy approaches max_total_tokens, min_free_tokens goes ≤ IGNORE_EOS_RESERVE_TOKENS and the request goes back to waiting_queue — scheduler retries forever instead of clamping or rejecting. Worth filing upstream; SGLang should clamp max_new_tokens (already does at KV=1040+ — the 962-truncation case) or surface an error rather than spin.

Why 2048 specifically

chat_payload_default sends max_tokens=1000; with chat-template padding the prompt is ~16 tokens. Need max_total_tokens > 1000 + 16 + scheduler reserve.
Local probe shows ~1040 is the floor; 1056 lets generation hit the full 1000 tokens; 2048 leaves comfortable headroom for the secondary payloads (completion_payload_default, responses_payload_default, metric_payload_default(min_num_requests=6) — the latter sends 6 concurrent requests, so the budget needs to fit several small completions in flight).
The disaggregated_same_gpu config is unaffected (already at requested_sglang_kv_tokens(37472)).

Local verification

$ pytest tests/serve/test_sglang.py::test_sglang_deployment[aggregated-2] \
         tests/serve/test_sglang.py::test_sglang_deployment[aggregated_unified-2]
========================= 2 passed in 81.01s (0:01:21) =========================
# previously: 2 failed in 1184s after both timed out at 60s each

Follow-ups (not in this PR)

pytest.mark.timeout(293) from b7165adc329 is now unnecessary (each test runs ~40s); could be reverted to ~195 in a later cleanup, but harmless to leave.
Upstream issue against SGLang for the silent-hang behavior — want to file once we have a minimal repro script written up.
The requested_sglang_kv_tokens profiler logic (tests/utils/profile_pytest.py) found min=48 historically. On 0.5.11 the profiler will now find a much higher min (or worse, hang during profiling) — worth re-running the auto-profile next time we re-derive these markers.

dynamo-ops · 2026-05-08T16:44:28Z

    runtime_image: lmsysorg/sglang
    base_image_tag: 25.06-cuda12.9-devel-ubuntu24.04
-    runtime_image_tag: v0.5.10.post1-runtime
+    runtime_image_tag: v0.5.11-runtime


The CUDA 12.9 SGLang runtime now points at v0.5.11-runtime, which aliases to the CUDA 13.0 image, so the cuda12.9 build uses the wrong upstream CUDA stack. Fix: use v0.5.11-cu129-runtime here and update the compliance README entry to match.

The real fix for the aggregated test hangs was the KV budget bump (d27b271), not the pytest timeout. Revert to the original profiled value.

dynamo-ops · 2026-05-08T18:47:44Z

    runtime_image: lmsysorg/sglang
    base_image_tag: 25.06-cuda12.9-devel-ubuntu24.04
-    runtime_image_tag: v0.5.10.post1-runtime
+    runtime_image_tag: v0.5.11-runtime


The CUDA 12.9 SGLang build still uses v0.5.11-runtime, whose image config reports CUDA_VERSION=13.0.1, so this target is built against the wrong upstream CUDA runtime. Fix: use v0.5.11-cu129-runtime here and update the compliance README row to the same tag.

The dynamo, sglang, and kvbm dashboards reference prometheus uid P1809F7CD0C75ACF3, which doesn't match the provisioned uid `prometheus`, causing "datasource not found" panels. Add a second provisioned entry pointing at the same prometheus URL with the legacy uid so both groups of dashboards resolve cleanly.

…ation and per-pool KV breakdown The bundled dashboard previously mixed Dynamo-frontend and SGLang-engine data without source attribution, omitted the per-pool-type gauges added in sglang 0.5.11 (full_token_usage / swa_token_usage / mamba_usage), and referenced a few stale upstream metric names (sglang:hicache_eviction_*, sglang:hicache_load_back_*, sglang:num_retracted_reqs_total) that no longer exist in 0.5.11. Reorganize into 9 explicitly-labeled rows: 1. Overview — stat-row summary (success rate, totals, avg TTFT/ITL/E2E) 2. Dynamo Frontend (:8000) — RPS, E2E + TTFT + ITL p50/p90/p99, request-outcome breakdown, inflight vs queued, ISL/OSL/cached tokens 3. Dynamo KV Router — kv_hit_rate distribution, multi-stage routing overhead (block_hashing / indexer_find_matches / seq_hashing / scheduling / total), per-worker inflight, total KV blocks 4. SGLang Engine (:8081) — running/queued/paused, gen throughput, engine-side TTFT/ITL p50/p90/p99, retractions + new_token_ratio 5. SGLang Engine — KV Pool Breakdown — full / swa / mamba pool usage, absolute used/evictable/available, cache_hit_rate, pending_prealloc (PD), streaming_session_held_tokens, lora_pool_*, capacity reference Collapsed by default (no spam in agg/non-feature runs): 6. P/D Disagg Queues — num_prefill_*_queue_reqs / num_decode_*_queue_reqs, grammar queue, spec accept rate/length 7. HiCache — host_used vs total, eviction & load-back rate by cache_type, latency p99 by cache_type, prefetch / backup 8. Per-Worker Detail — req rate / cache hit / KV util / E2E p99 fanned by worker_id 9. GPU & Runtime Health — DCGM, dynamo_component_gpu_cache_usage_percent, tokio_worker_busy_ratio, request_plane_roundtrip_ttft Every panel description names its source metric so the data provenance (Dynamo vs SGLang vs DCGM) is explicit. Dashboard uid stays sglang-engine so existing links keep working.

_test_agg.sh: aggregated launch with all 0.5.11 observability features turned on (--enable-hierarchical-cache, --enable-streaming-session, --enable-metrics-for-all-schedulers, --enable-mfu-metrics) plus a larger mem-fraction-static / max-running-requests / chunked-prefill / page-size so a load test against it populates every non-disagg row of the Dynamo + SGLang dashboard. Dashboard: rows + descriptions referenced "port 8000" / "port 8081" explicitly, but those are overridable via DYN_HTTP_PORT and DYN_SYSTEM_PORT. Drop the hardcoded numbers; describe the source layer instead. Per-pool gauges (full/swa/mamba) only populate on hybrid attention models — note that in the script header so users know to swap models if they want SWA/Mamba panels to move.

…o-0.5.11-and-cleanups Pulls in observability work that pairs with the 0.5.11 bump: - legacy prometheus uid alias so bundled dashboards stop showing "datasource not found" - SGLang dashboard rebuild with Dynamo<->SGLang separation, per-pool KV breakdown (full / SWA / Mamba), and collapsed disagg/hicache/per-worker rows - _test_agg.sh: launch with all 0.5.11 observability features on so a load test against it lights up every applicable dashboard panel # Conflicts: # container/context.yaml

… under Router - Inflight vs Queued: explain the difference (inflight = dispatched, awaiting completion; queued = received but not yet dispatched). The combined shape is the steady-state diagnostic — flat inflight + growing queue means workers are saturated. - Pool Utilization panel: drop the redundant "overall (legacy)" series. On non-hybrid models it duplicates full_token_usage and adds noise; on hybrid models its semantics overlap with the dominant pool. Spell out what each remaining series (full / SWA / Mamba) means in the description. - Move Per-Worker Detail row to sit directly under the Router row, since per-worker fanout is conceptually a drill-down of routing distribution.

Speculative decoding accept rate / accept length are throughput-level signals (they directly explain tokens-per-step), not disagg-queue signals. Move them out of the disagg row and into "SGLang Engine — throughput & batching" as two side-by-side panels with explanations of when they populate. Disagg row keeps just the grammar queue.

Restructure _test_agg.sh from single-worker to two-worker behind a KV-aware router, matching agg_router.sh's topology: - frontend with --router-mode kv (KV events on by default, --approx falls back to approximate routing without KV events) - 2 SGLang workers, one per GPU (CUDA_VISIBLE_DEVICES=0,1) - per-worker DYN_SYSTEM_PORT (8081, 8082) and per-worker ZMQ KV-events endpoints (5557, 5558) - all 0.5.11 observability features still on for both workers Update prometheus scrape config to pick up both worker metric ports; without 8082 in the dynamo-backend job target list, Grafana would only show worker 1 in the Per-Worker rows.

pull-request-size Bot added the size/L label May 6, 2026

github-actions Bot added documentation Improvements or additions to documentation backend::sglang Relates to the sglang backend multimodal labels May 6, 2026

ishandhanani marked this pull request as ready for review May 6, 2026 22:06

ishandhanani requested review from a team as code owners May 6, 2026 22:06

ishandhanani requested a review from a team as a code owner May 6, 2026 22:10

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 22:10 Inactive

ishandhanani changed the title ~~sglang: bump to 0.5.11 and prune 0.5.9 fallbacks in _compat~~ feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat May 6, 2026

github-actions Bot added the feat label May 6, 2026

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread components/src/dynamo/sglang/_compat.py Outdated

tmonty12 approved these changes May 6, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 22:13 Inactive

dynamo-ops reviewed May 6, 2026

View reviewed changes

Comment thread deploy/observability/grafana_dashboards/sglang.json Outdated

KrishnanPrash approved these changes May 6, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 6, 2026 23:26 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 7, 2026 01:39 Inactive

ishandhanani requested a review from a team as a code owner May 7, 2026 02:52

copy-pr-bot Bot temporarily deployed to GITLAB May 7, 2026 02:52 Inactive

github-actions Bot added the container label May 7, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 7, 2026 04:49 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 16:38 Inactive

dynamo-ops reviewed May 8, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 17:15 Inactive

tests: revert unnecessary timeout bump (293→195s)

122dcb3

The real fix for the aggregated test hangs was the KV budget bump (d27b271), not the pytest timeout. Revert to the original profiled value.

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 18:35 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 18:42 Inactive

dynamo-ops reviewed May 8, 2026

View reviewed changes

ishandhanani added 4 commits May 9, 2026 15:23

pull-request-size Bot added size/XXL and removed size/L labels May 9, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 16:57 Inactive

ishandhanani mentioned this pull request May 9, 2026

fix(observability): rebuild SGLang dashboard + add legacy prometheus uid alias #9367

Closed

9 tasks

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 17:10 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 17:11 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 17:14 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 17:15 Inactive

drop redundant '(collapsed)' from row titles; chevron icon shows state

6c62b37

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 17:21 Inactive

remove grammar queue panel; disagg row now prefill/decode only

2e336cb

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 17:23 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 17:32 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat#9230

feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat#9230
ishandhanani wants to merge 21 commits intomainfrom
idhanani/sgl-to-0.5.11-and-cleanups

ishandhanani commented May 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

ishandhanani commented May 8, 2026

Uh oh!

dynamo-ops May 8, 2026

Uh oh!

dynamo-ops May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ishandhanani commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compat shim deprecation

Grafana dashboard (#8151)

Container runtime image bump

Launch script results (2x L40S)

Test plan

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented May 8, 2026

Fix for test_sglang_deployment[aggregated-2] / [aggregated_unified-2] hangs

What was happening

Why bumping pytest.mark.timeout didn't help

Root cause: SGLang 0.5.11 silently queues over-budget requests

Why 2048 specifically

Local verification

Follow-ups (not in this PR)

Uh oh!

dynamo-ops May 8, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ishandhanani commented May 6, 2026 •

edited

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading

github-actions Bot commented May 6, 2026 •

edited

Loading

Fix for `test_sglang_deployment[aggregated-2]` / `[aggregated_unified-2]` hangs