Skip to content

[model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels#15160

Merged
slin1237 merged 2 commits intomainfrom
metric-n/12
Dec 15, 2025
Merged

[model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels#15160
slin1237 merged 2 commits intomainfrom
metric-n/12

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 commented Dec 15, 2025

BREAKING CHANGE: All sgl_router_* metrics have been removed.

This commit removing the legacy metrics system in favor of the new layered SmgMetrics.

What Changed

Removed the entire RouterMetrics struct (~700 lines) and all sgl_router_* prefixed metrics from 16 files across the codebase. The new smg_* metrics remain as the sole metrics system.

Why

The legacy sgl_router_* metrics had several issues:

  • Inconsistent labeling across different metric types
  • No clear separation of concerns between layers
  • Difficult to correlate metrics across the request lifecycle

The new SmgMetrics architecture provides:

  • 6 clear layers: HTTP → Router → Worker → Discovery → MCP → Database
  • Consistent labeling with predefined constants (smg_labels module)
  • Low cardinality where possible with bounded label values
  • Better observability for gRPC streaming (TTFT, TPOT, token counts)
  • Support for Responses API (MCP tool calls, database operations)

Migration Required

Users must update their dashboards and alerts to use the new smg_* metrics.

Sample PromQL

Request Rate by Router Type

sum(rate(smg_router_requests_total[5m])) by (router_type)

P99 Latency by Model

histogram_quantile(0.99, sum(rate(smg_router_request_duration_seconds_bucket[5m])) by (le, model_id))

Error Rate by Backend Type

sum(rate(smg_router_request_errors_total[5m])) by (backend_type, error_type)
/
sum(rate(smg_router_requests_total[5m])) by (backend_type)

Worker Pool Health

smg_worker_pool_size{worker_type="regular"}
-
sum(rate(smg_worker_errors_total{worker_type="regular"}[5m]))

Pipeline Stage Breakdown

sum(rate(smg_router_stage_duration_seconds_sum[5m])) by (stage)
/
sum(rate(smg_router_stage_duration_seconds_count[5m])) by (stage)

P50 TTFT by Model (gRPC)

histogram_quantile(0.50, sum(rate(smg_router_ttft_seconds_bucket{router_type="grpc"}[5m])) by (le, model_id))

Average TPOT by Model (gRPC)

sum(rate(smg_router_tpot_seconds_sum{router_type="grpc"}[5m])) by (model_id)
/
sum(rate(smg_router_tpot_seconds_count{router_type="grpc"}[5m])) by (model_id)

Token Throughput (tokens/sec)

sum(rate(smg_router_tokens_total{token_type="output"}[5m])) by (model_id)

MCP Tool Call Success Rate

sum(rate(smg_mcp_tool_calls_total{result="success"}[5m])) by (tool_name)
/
sum(rate(smg_mcp_tool_calls_total[5m])) by (tool_name)

Checklist

BREAKING CHANGE: All sgl_router_* metrics have been removed.

This commit completes Phase 3 of the metrics migration plan, removing the
legacy metrics system in favor of the new layered SmgMetrics architecture.

## What Changed

Removed the entire RouterMetrics struct (~700 lines) and all sgl_router_*
prefixed metrics from 16 files across the codebase. The new smg_* metrics
with the layered architecture remain as the sole metrics system.

## Why

The legacy sgl_router_* metrics had several issues:
- Inconsistent labeling across different metric types
- No clear separation of concerns between layers
- High cardinality labels in some metrics
- Difficult to correlate metrics across the request lifecycle

The new SmgMetrics architecture provides:
- 6 clear layers: HTTP → Router → Worker → Discovery → MCP → Database
- Consistent labeling with predefined constants (smg_labels module)
- Low cardinality where possible with bounded label values
- Better observability for gRPC streaming (TTFT, TPOT, token counts)
- Support for Responses API (MCP tool calls, database operations)

## Migration Required

Users must update their dashboards and alerts to use the new smg_* metrics.
See .claude/metrics-architecture.md for the full metrics reference and
example PromQL queries.

Old metrics (removed):
  sgl_router_requests_total, sgl_router_request_duration_seconds,
  sgl_router_worker_health, sgl_router_processed_requests_total, etc.

New metrics (use these):
  smg_router_requests_total, smg_router_request_duration_seconds,
  smg_worker_pool_size, smg_worker_requests_active, etc.
Simplify the metrics API naming for better ergonomics:
- SmgMetrics → Metrics
- smg_labels → metrics_labels

The smg_ prefix on the struct name was redundant since the module path
already provides context (e.g., metrics::Metrics). The exported Prometheus
metric names retain their smg_* prefix for consistency in dashboards.

No functional changes - this is purely a rename refactor.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@slin1237 slin1237 linked an issue Dec 15, 2025 that may be closed by this pull request
@slin1237 slin1237 merged commit 3518b33 into main Dec 15, 2025
69 checks passed
@slin1237 slin1237 deleted the metric-n/12 branch December 15, 2025 12:57
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 15, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (89 commits)
  [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160)
  [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047)
  [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017)
  [diffusion] fix: cache dit with parallel (sgl-project#15163)
  chore: change npu pr-test a2 runner (sgl-project#15152)
  [Feature] Fuse mrope all in 1 kernel (sgl-project#14906)
  Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116)
  Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862)
  fix: adding date and fixing release name issue (sgl-project#15174)
  [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324)
  feature: PR wheel (sgl-project#15170)
  [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005)
  fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914)
  Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158)
  [model-gateway] add mcp and discovery metrics (sgl-project#15156)
  fix: move ci-bot (sgl-project#15154)
  Fix import warnings (sgl-project#15144)
  ci: adding errors to Github summary (sgl-project#14778)
  [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147)
  [model-gateway] upgrade axum and axum server (sgl-project#15146)
  ...

# Conflicts:
#	python/sglang/srt/server_args.py
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 17, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SMG Metrics Design

1 participant