[model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels#15160
Merged
[model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels#15160
Conversation
BREAKING CHANGE: All sgl_router_* metrics have been removed. This commit completes Phase 3 of the metrics migration plan, removing the legacy metrics system in favor of the new layered SmgMetrics architecture. ## What Changed Removed the entire RouterMetrics struct (~700 lines) and all sgl_router_* prefixed metrics from 16 files across the codebase. The new smg_* metrics with the layered architecture remain as the sole metrics system. ## Why The legacy sgl_router_* metrics had several issues: - Inconsistent labeling across different metric types - No clear separation of concerns between layers - High cardinality labels in some metrics - Difficult to correlate metrics across the request lifecycle The new SmgMetrics architecture provides: - 6 clear layers: HTTP → Router → Worker → Discovery → MCP → Database - Consistent labeling with predefined constants (smg_labels module) - Low cardinality where possible with bounded label values - Better observability for gRPC streaming (TTFT, TPOT, token counts) - Support for Responses API (MCP tool calls, database operations) ## Migration Required Users must update their dashboards and alerts to use the new smg_* metrics. See .claude/metrics-architecture.md for the full metrics reference and example PromQL queries. Old metrics (removed): sgl_router_requests_total, sgl_router_request_duration_seconds, sgl_router_worker_health, sgl_router_processed_requests_total, etc. New metrics (use these): smg_router_requests_total, smg_router_request_duration_seconds, smg_worker_pool_size, smg_worker_requests_active, etc.
Simplify the metrics API naming for better ergonomics: - SmgMetrics → Metrics - smg_labels → metrics_labels The smg_ prefix on the struct name was redundant since the module path already provides context (e.g., metrics::Metrics). The exported Prometheus metric names retain their smg_* prefix for consistency in dashboards. No functional changes - this is purely a rename refactor.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Closed
Liwansi
added a commit
to iforgetmyname/sglang
that referenced
this pull request
Dec 15, 2025
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (89 commits) [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160) [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047) [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017) [diffusion] fix: cache dit with parallel (sgl-project#15163) chore: change npu pr-test a2 runner (sgl-project#15152) [Feature] Fuse mrope all in 1 kernel (sgl-project#14906) Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116) Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862) fix: adding date and fixing release name issue (sgl-project#15174) [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324) feature: PR wheel (sgl-project#15170) [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005) fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914) Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158) [model-gateway] add mcp and discovery metrics (sgl-project#15156) fix: move ci-bot (sgl-project#15154) Fix import warnings (sgl-project#15144) ci: adding errors to Github summary (sgl-project#14778) [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147) [model-gateway] upgrade axum and axum server (sgl-project#15146) ... # Conflicts: # python/sglang/srt/server_args.py
tonyluj
pushed a commit
to openanolis/sglang
that referenced
this pull request
Dec 17, 2025
…Metrics and smg_labels to metrics_labels (sgl-project#15160)
YChange01
pushed a commit
to YChange01/sglang
that referenced
this pull request
Jan 13, 2026
…Metrics and smg_labels to metrics_labels (sgl-project#15160)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BREAKING CHANGE: All sgl_router_* metrics have been removed.
This commit removing the legacy metrics system in favor of the new layered SmgMetrics.
What Changed
Removed the entire RouterMetrics struct (~700 lines) and all sgl_router_* prefixed metrics from 16 files across the codebase. The new smg_* metrics remain as the sole metrics system.
Why
The legacy sgl_router_* metrics had several issues:
The new SmgMetrics architecture provides:
Migration Required
Users must update their dashboards and alerts to use the new smg_* metrics.
Sample PromQL
Request Rate by Router Type
P99 Latency by Model
Error Rate by Backend Type
Worker Pool Health
Pipeline Stage Breakdown
P50 TTFT by Model (gRPC)
Average TPOT by Model (gRPC)
Token Throughput (tokens/sec)
MCP Tool Call Success Rate
Checklist