feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat#9230
feat(sglang): bump to 0.5.11 and prune 0.5.9 fallbacks in _compat#9230ishandhanani wants to merge 21 commits intomainfrom
Conversation
All 9 testable launch scripts in examples/backends/sglang/launch/ pass on 0.5.11 with no handler/init code changes. (disagg_router needs 4 GPUs and multimodal_disagg needs 3 GPUs; both skipped on the test box's 2x L40S.) The new support window is N=0.5.11, N-1=0.5.10, so the 0.5.9-targeted branches in _compat.py come out as part of this bump per the N/N-1 policy in components/src/dynamo/sglang/CLAUDE.md. Removed: * `NetworkAddress` polyfill class + `try/except` around its import (`sglang.srt.utils.network` is canonical from 0.5.10 onward). * `mm_encode()` wrapper. Both 0.5.10 and 0.5.11 take `_encode(mm_items, modality)` and return the same 3-tuple, so the call sites in `encode_worker_handler.py` now invoke `await self.encoder._encode(...)` directly. * `enable_disjoint_streaming_output` `stream_output` fallback. The field has been `incremental_streaming_output` since at-or-before 0.5.10 (verified: `ServerArgs.__dataclass_fields__` in 0.5.11 has `incremental_streaming_output` and not `stream_output`). The wrapper itself stays — diffusion `SimpleNamespace` stubs need the attribute-absent no-op path. Kept (not version-bound): * `ensure_sglang_top_level_exports()`, `filter_supported_async_generate_kwargs`, `get_scheduler_info` (the latter still probes fork/experimental attribute paths). CLAUDE.md example version bumped to reflect 0.5.10 as the new N-1.
Closes #8151 (now unblocked: sgl-project/sglang#22726 landed in v0.5.11, which is the new floor after this PR). Adds a "KV Pool Detail" row to deploy/observability/grafana_dashboards/sglang.json with two new panels driven by the gauges added in 0.5.11: * `KV Pool Breakdown (tokens)` — stacked timeseries of `sglang:kv_used_tokens` (locked by running requests), `sglang:kv_evictable_tokens` (radix-cached, reclaimable), and `sglang:kv_available_tokens` (free). The three series sum to <= `sglang:max_total_num_tokens` per the invariant documented in SGLang's metrics_collector.py. * `KV Pool Physical Usage %` — `(1 - kv_available / (kv_available + kv_evictable + kv_used)) * 100`. Captures true pool occupancy including evictable slots, vs. `sglang:token_usage` which excludes them. 90% threshold drawn in red for the "no headroom even after evict" case. The existing `GPU KV Cache Usage %` panel (driven by `sglang:token_usage`) is unchanged — it's still useful as the "bottleneck across full / SWA / mamba pools" view that the new gauges don't replicate. Verified live on a Qwen/Qwen3-0.6B agg worker: all three gauges export at `<system_port>/metrics`, and `kv_available + kv_evictable + kv_used` = `max_total_num_tokens` after a real request.
WalkthroughThe PR modernizes SGLang compatibility by removing legacy import fallbacks, simplifying streaming configuration logic, and replacing the ChangesSGLang Compatibility Modernization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/src/dynamo/sglang/_compat.py`:
- Around line 23-27: The module currently hard-imports NetworkAddress,
get_local_ip_auto, and get_zmq_socket and unconditionally calls
ensure_sglang_top_level_exports(), which causes ModuleNotFoundError at import
time when sglang isn't installed; wrap the top-level import of
NetworkAddress/get_local_ip_auto/get_zmq_socket in a try/except ImportError and
provide safe fallbacks (e.g., set those names to None or simple proxy callables)
so consumers will get a clear runtime error when used, and also guard the call
to ensure_sglang_top_level_exports() in the same try/except (or remove the
top-level call and call it lazily inside a guarded accessor) so sglang import
failures are deferred to runtime. Ensure you reference the exact symbols
NetworkAddress, get_local_ip_auto, get_zmq_socket and the function
ensure_sglang_top_level_exports() when updating the file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0bae7dfc-2c4a-4179-b6c7-22b1b32d6b13
📒 Files selected for processing (4)
components/src/dynamo/sglang/CLAUDE.mdcomponents/src/dynamo/sglang/_compat.pycomponents/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.pypyproject.toml
`tests/report_pytest_markers.py` mocks sglang submodules so test collection succeeds in the pre-commit isolated venv (which doesn't install sglang). Now that `dynamo.sglang._compat` imports from `sglang.srt.utils.network` unconditionally (the 0.5.9 fallback was pruned in the same PR), the mock list needs that submodule too — otherwise marker reporting fails to collect any sglang test file.
Now that the 0.5.9 fallback for `NetworkAddress` is gone, the
re-exports in `_compat` are pure pass-throughs of
`sglang.srt.utils.network.{NetworkAddress,get_local_ip_auto,
get_zmq_socket}`. Per the policy in `components/src/dynamo/sglang/CLAUDE.md`,
trivial re-exports belong at call sites, not in `_compat` — the shim
is reserved for symbols that have actually broken across versions.
Move the three imports to:
- `register.py` (NetworkAddress, get_local_ip_auto)
- `publisher.py` (NetworkAddress, get_local_ip_auto, get_zmq_socket)
- `request_handlers/handler_base.py` (NetworkAddress, get_local_ip_auto)
`_compat.py` keeps `ensure_sglang_top_level_exports`,
`filter_supported_async_generate_kwargs`, `get_scheduler_info`, and
`enable_disjoint_streaming_output` — all of which are still doing
real version-bridging work (signature probing, multi-attribute
fallbacks, SimpleNamespace stub handling).
Per @dynamo-ops review on #9230: dividing by (kv_available + kv_evictable + kv_used) inflates the percentage when internally protected slots are present, since the SGLang invariant is `available + evictable + used <= max_total_num_tokens` (note <=, not ==). Using `sglang:max_total_num_tokens` as the divisor gives true pool occupancy regardless of how slots are accounted internally. Expression: before: (1 - kv_available / (kv_available + kv_evictable + kv_used)) * 100 after: (1 - kv_available / max_total_num_tokens) * 100 Also rewrote the panel description so it reflects the new framing ("not free, normalized against max_total_num_tokens") rather than the incorrect "physically allocated (used + evictable)" wording, and called out what `sglang:token_usage` actually measures (bottleneck across full/SWA/mamba pools) for clearer reviewer context.
Bumps `lmsysorg/sglang:v0.5.10.post1-runtime` → `lmsysorg/sglang:v0.5.11-runtime` (and the `-cu130-runtime` variant) across: - container/context.yaml - container/compliance/README.md Verified the tags resolve on Docker Hub: digest sha256:6f81caf1d2a24b2cfc212410900c14d633302bc16e6ec8379c0d382e625ab313 These were published by sgl-project/sglang's `Release Docker Runtime Images` workflow run https://github.com/sgl-project/sglang/actions/runs/25470428234 (workflow_dispatch from main with version=0.5.11). The workflow uses the dispatched ref's Dockerfile, so main's already-fixed Dockerfile (sgl-project/sglang#24234) builds against v0.5.11 source — no need to wait for that fix to be cherry-picked into release/v0.5.11. Also drop the libjsoncpp25 apt workaround from container/templates/sglang_runtime.Dockerfile. v0.5.11's runtime image bundles libjsoncpp inside the mooncake wheel itself (/usr/local/lib/python3.12/dist-packages/mooncake_transfer_engine_cuda13.libs/ libjsoncpp-7d699962.so.1.9.5), so `from mooncake.engine import TransferEngine` now succeeds without the system-level package. Verified via `docker run --rm --runtime=runc --entrypoint bash lmsysorg/sglang:v0.5.11-runtime -c 'python3 -c "from mooncake.engine import TransferEngine"'`.
…nt hang SGLang 0.5.11 silently hangs requests when prompt+max_tokens nears max_total_tokens (no scheduler activity, no error). Bisected threshold is ~1040 for the chat_payload_default() case (~16-token prompt + max_tokens=1000); KV=1024 hangs, KV=1040 works (truncated to 962 tokens), KV=1056+ works fully. Test was at KV=96, well below threshold.
Fix for
|
--max-total-tokens |
Result |
|---|---|
| 1024 | hang |
| 1038 | hang |
| 1040 | OK, response truncated to 962 tokens |
| 1056 | OK, full 1000 tokens |
| 2048 | OK, full 1000 tokens |
Likely culprit is PrefillAdder.add_one_req in sglang/srt/managers/schedule_policy.py:740-754:
cur_rem_tokens = self.cur_rem_tokens - self.ceil_paged_tokens(req.extend_input_len)
for i, (tokens_left, tokens_occupied) in enumerate(self.req_states):
bs = len(self.req_states) - i
min_free_tokens = cur_rem_tokens + tokens_freed - tokens_left * bs
if min_free_tokens <= IGNORE_EOS_RESERVE_TOKENS * bs:
return AddReqResult.NO_TOKENwith tokens_left = max_new_tokens * new_token_ratio (default 0.7). When the request's projected occupancy approaches max_total_tokens, min_free_tokens goes ≤ IGNORE_EOS_RESERVE_TOKENS and the request goes back to waiting_queue — scheduler retries forever instead of clamping or rejecting. Worth filing upstream; SGLang should clamp max_new_tokens (already does at KV=1040+ — the 962-truncation case) or surface an error rather than spin.
Why 2048 specifically
chat_payload_defaultsendsmax_tokens=1000; with chat-template padding the prompt is ~16 tokens. Needmax_total_tokens > 1000 + 16 + scheduler reserve.- Local probe shows ~1040 is the floor; 1056 lets generation hit the full 1000 tokens; 2048 leaves comfortable headroom for the secondary payloads (
completion_payload_default,responses_payload_default,metric_payload_default(min_num_requests=6)— the latter sends 6 concurrent requests, so the budget needs to fit several small completions in flight). - The
disaggregated_same_gpuconfig is unaffected (already atrequested_sglang_kv_tokens(37472)).
Local verification
$ pytest tests/serve/test_sglang.py::test_sglang_deployment[aggregated-2] \
tests/serve/test_sglang.py::test_sglang_deployment[aggregated_unified-2]
========================= 2 passed in 81.01s (0:01:21) =========================
# previously: 2 failed in 1184s after both timed out at 60s each
Follow-ups (not in this PR)
pytest.mark.timeout(293)fromb7165adc329is now unnecessary (each test runs ~40s); could be reverted to ~195 in a later cleanup, but harmless to leave.- Upstream issue against SGLang for the silent-hang behavior — want to file once we have a minimal repro script written up.
- The
requested_sglang_kv_tokensprofiler logic (tests/utils/profile_pytest.py) found min=48 historically. On 0.5.11 the profiler will now find a much higher min (or worse, hang during profiling) — worth re-running the auto-profile next time we re-derive these markers.
| runtime_image: lmsysorg/sglang | ||
| base_image_tag: 25.06-cuda12.9-devel-ubuntu24.04 | ||
| runtime_image_tag: v0.5.10.post1-runtime | ||
| runtime_image_tag: v0.5.11-runtime |
There was a problem hiding this comment.
The CUDA 12.9 SGLang runtime now points at v0.5.11-runtime, which aliases to the CUDA 13.0 image, so the cuda12.9 build uses the wrong upstream CUDA stack. Fix: use v0.5.11-cu129-runtime here and update the compliance README entry to match.
The real fix for the aggregated test hangs was the KV budget bump (d27b271), not the pytest timeout. Revert to the original profiled value.
| runtime_image: lmsysorg/sglang | ||
| base_image_tag: 25.06-cuda12.9-devel-ubuntu24.04 | ||
| runtime_image_tag: v0.5.10.post1-runtime | ||
| runtime_image_tag: v0.5.11-runtime |
There was a problem hiding this comment.
The CUDA 12.9 SGLang build still uses v0.5.11-runtime, whose image config reports CUDA_VERSION=13.0.1, so this target is built against the wrong upstream CUDA runtime. Fix: use v0.5.11-cu129-runtime here and update the compliance README row to the same tag.
The dynamo, sglang, and kvbm dashboards reference prometheus uid P1809F7CD0C75ACF3, which doesn't match the provisioned uid `prometheus`, causing "datasource not found" panels. Add a second provisioned entry pointing at the same prometheus URL with the legacy uid so both groups of dashboards resolve cleanly.
…ation and per-pool KV breakdown
The bundled dashboard previously mixed Dynamo-frontend and SGLang-engine
data without source attribution, omitted the per-pool-type gauges added
in sglang 0.5.11 (full_token_usage / swa_token_usage / mamba_usage), and
referenced a few stale upstream metric names (sglang:hicache_eviction_*,
sglang:hicache_load_back_*, sglang:num_retracted_reqs_total) that no
longer exist in 0.5.11.
Reorganize into 9 explicitly-labeled rows:
1. Overview — stat-row summary (success rate, totals, avg TTFT/ITL/E2E)
2. Dynamo Frontend (:8000) — RPS, E2E + TTFT + ITL p50/p90/p99,
request-outcome breakdown, inflight vs queued, ISL/OSL/cached tokens
3. Dynamo KV Router — kv_hit_rate distribution, multi-stage routing
overhead (block_hashing / indexer_find_matches / seq_hashing /
scheduling / total), per-worker inflight, total KV blocks
4. SGLang Engine (:8081) — running/queued/paused, gen throughput,
engine-side TTFT/ITL p50/p90/p99, retractions + new_token_ratio
5. SGLang Engine — KV Pool Breakdown — full / swa / mamba pool usage,
absolute used/evictable/available, cache_hit_rate, pending_prealloc
(PD), streaming_session_held_tokens, lora_pool_*, capacity reference
Collapsed by default (no spam in agg/non-feature runs):
6. P/D Disagg Queues — num_prefill_*_queue_reqs / num_decode_*_queue_reqs,
grammar queue, spec accept rate/length
7. HiCache — host_used vs total, eviction & load-back rate by cache_type,
latency p99 by cache_type, prefetch / backup
8. Per-Worker Detail — req rate / cache hit / KV util / E2E p99 fanned by worker_id
9. GPU & Runtime Health — DCGM, dynamo_component_gpu_cache_usage_percent,
tokio_worker_busy_ratio, request_plane_roundtrip_ttft
Every panel description names its source metric so the data provenance
(Dynamo vs SGLang vs DCGM) is explicit. Dashboard uid stays sglang-engine
so existing links keep working.
_test_agg.sh: aggregated launch with all 0.5.11 observability features turned on (--enable-hierarchical-cache, --enable-streaming-session, --enable-metrics-for-all-schedulers, --enable-mfu-metrics) plus a larger mem-fraction-static / max-running-requests / chunked-prefill / page-size so a load test against it populates every non-disagg row of the Dynamo + SGLang dashboard. Dashboard: rows + descriptions referenced "port 8000" / "port 8081" explicitly, but those are overridable via DYN_HTTP_PORT and DYN_SYSTEM_PORT. Drop the hardcoded numbers; describe the source layer instead. Per-pool gauges (full/swa/mamba) only populate on hybrid attention models — note that in the script header so users know to swap models if they want SWA/Mamba panels to move.
…o-0.5.11-and-cleanups Pulls in observability work that pairs with the 0.5.11 bump: - legacy prometheus uid alias so bundled dashboards stop showing "datasource not found" - SGLang dashboard rebuild with Dynamo<->SGLang separation, per-pool KV breakdown (full / SWA / Mamba), and collapsed disagg/hicache/per-worker rows - _test_agg.sh: launch with all 0.5.11 observability features on so a load test against it lights up every applicable dashboard panel # Conflicts: # container/context.yaml
… under Router - Inflight vs Queued: explain the difference (inflight = dispatched, awaiting completion; queued = received but not yet dispatched). The combined shape is the steady-state diagnostic — flat inflight + growing queue means workers are saturated. - Pool Utilization panel: drop the redundant "overall (legacy)" series. On non-hybrid models it duplicates full_token_usage and adds noise; on hybrid models its semantics overlap with the dominant pool. Spell out what each remaining series (full / SWA / Mamba) means in the description. - Move Per-Worker Detail row to sit directly under the Router row, since per-worker fanout is conceptually a drill-down of routing distribution.
Speculative decoding accept rate / accept length are throughput-level signals (they directly explain tokens-per-step), not disagg-queue signals. Move them out of the disagg row and into "SGLang Engine — throughput & batching" as two side-by-side panels with explanations of when they populate. Disagg row keeps just the grammar queue.
Restructure _test_agg.sh from single-worker to two-worker behind a
KV-aware router, matching agg_router.sh's topology:
- frontend with --router-mode kv (KV events on by default, --approx
falls back to approximate routing without KV events)
- 2 SGLang workers, one per GPU (CUDA_VISIBLE_DEVICES=0,1)
- per-worker DYN_SYSTEM_PORT (8081, 8082) and per-worker ZMQ KV-events
endpoints (5557, 5558)
- all 0.5.11 observability features still on for both workers
Update prometheus scrape config to pick up both worker metric ports;
without 8082 in the dynamo-backend job target list, Grafana would only
show worker 1 in the Per-Worker rows.
Summary
Bumps the SGLang dependency from
0.5.10.post1to0.5.11. All testable launch scripts inexamples/backends/sglang/launch/pass on 0.5.11 with no handler/init code changes — the entire bump is the version pin plus pruning 0.5.9 fallbacks fromcomponents/src/dynamo/sglang/_compat.pyper the N/N-1 support policy incomponents/src/dynamo/sglang/CLAUDE.md.Also closes #8151 —
sgl-project/sglang#22726(raw KV pool token gauges) landed in 0.5.11, so the dashboard work it was blocked on is now actionable. Bundled into this PR rather than a follow-up since both pieces gate on the same version floor.Compat shim deprecation
New support window after this PR: N = 0.5.11, N-1 = 0.5.10. 0.5.9-targeted branches removed:
NetworkAddresspolyfill class (~70 LoC)sglang.srt.utils.network.NetworkAddressis canonical from 0.5.10 onward.try/except ImportErroraround the network importNetworkAddress/get_local_ip_auto/get_zmq_socketnow import directly fromsglang.srt.utils.networkinregister.py,publisher.py, andrequest_handlers/handler_base.py.mm_encode()wrapper_encode(mm_items)is out of window. Both 0.5.10 and 0.5.11 accept(mm_items, modality)and return the same(grid_dim, embedding, aux_data)3-tuple, so the two call sites inencode_worker_handler.pynow invokeawait self.encoder._encode(...)directly.enable_disjoint_streaming_outputstream_outputfallbackincremental_streaming_outputsince at-or-before 0.5.10 (ServerArgs.__dataclass_fields__in 0.5.11 hasincremental_streaming_outputand notstream_output). The wrapper itself stays — diffusionSimpleNamespacestubs need the attribute-absent no-op path.import ipaddress, socket(only used by polyfill)Kept (still doing real version-bridging):
ensure_sglang_top_level_exports,filter_supported_async_generate_kwargs,get_scheduler_info(multi-attribute fork fallbacks),enable_disjoint_streaming_output(no-op for diffusionSimpleNamespacestubs).components/src/dynamo/sglang/CLAUDE.mdexample version comment bumped to reflect 0.5.10 as the new N-1.Grafana dashboard (#8151)
Adds a
KV Pool Detail (sglang ≥ 0.5.11)row todeploy/observability/grafana_dashboards/sglang.json:sglang:kv_used_tokens(locked),sglang:kv_evictable_tokens(radix-cached),sglang:kv_available_tokens(free). Series sum to ≤max_total_num_tokens.(1 - sglang:kv_available_tokens / sglang:max_total_num_tokens) * 100. Captures true pool occupancy including used + evictable + any internally protected slots, vs.sglang:token_usage(which is the bottleneck across full/SWA/mamba pools and excludes evictable/protected slots). 90% threshold drawn in red. Normalizing againstmax_total_num_tokensrather than the sum of the three gauges avoids inflating usage when protected slots are present, per @dynamo-ops review.Existing
GPU KV Cache Usage %panel (driven bysglang:token_usage) is unchanged — still useful as the bottleneck-pool view.Container runtime image bump
Bumps
lmsysorg/sglang:v0.5.10.post1-runtime→v0.5.11-runtime(and-cu130-runtimevariant) incontainer/context.yaml,container/compliance/README.md. Verified the tags resolve on Docker Hub (digestsha256:6f81caf1...).These tags were published by sgl-project/sglang's Release Docker Runtime Images run 25470428234 —
workflow_dispatchfrom main withversion=0.5.11. The workflow uses the dispatched ref's Dockerfile, so main's already-fixed Dockerfile (sgl-project/sglang#24234) builds against v0.5.11 source — sidestepped the broken-on-aarch64 silent-cubin-skip bug that died on the prior tag-triggered runs.Also drop the
libjsoncpp25apt workaround fromcontainer/templates/sglang_runtime.Dockerfile: v0.5.11's runtime image now bundleslibjsoncppinside the mooncake wheel (mooncake_transfer_engine_cuda13.libs/libjsoncpp-7d699962.so.1.9.5), sofrom mooncake.engine import TransferEnginesucceeds without the system-level package. Verified viadocker run lmsysorg/sglang:v0.5.11-runtime ....A future-proofing cherry-pick of sgl-project/sglang#24234 onto
release/v0.5.11is open as sgl-project/sglang#24567 (draft) — not strictly required for this PR since the workflow dispatch from main works, but it lets any future tag-triggered orv0.5.11.post1build include the fix.Launch script results (2x L40S)
agg.shagg_embed.shagg_router.shagg_vision.shSGLANG_DISABLE_CUDNN_CHECK=1, unchanged from 0.5.9)disagg.shdiffusion_llada.shimage_diffusion.shtext-to-video-diffusion.shmultimodal_epd.shdisagg_router.shmultimodal_disagg.shdisagg_same_gpu.shTest plan
examples/backends/sglang/launch/validated end-to-end with health probes (chat / embeddings / images / videos / multimodal vision)multimodal_epd.shre-validated after_compat.pyprune to confirm inlined_encodecalls workpytest -k compat components/src/dynamo/sglang/tests/test_sglang_unit.py— 6/6 passruff check+ruff format --checkclean on changed filesuvx pre-commit run --all-files— green (incl.Report pytest markersafter addingsglang.srt.utils.networkto the mock list)sglang:kv_used_tokens,sglang:kv_evictable_tokens,sglang:kv_available_tokens,sglang:max_total_num_tokens) verified scraping live from<system_port>/metricson a Qwen/Qwen3-0.6B agg worker; invariantkv_used + kv_evictable + kv_available <= max_total_num_tokensconfirmed after a real request.json.loadclean, no duplicate panel IDs)lmsysorg/sglang:v0.5.11-runtimeandv0.5.11-cu130-runtimeresolve on Docker Hub; mooncake import probed clean without the apt workaround.