Skip to content

[Observability] Add pending token count to prefill log and get_load#22480

Merged
ch-wan merged 3 commits intomainfrom
chengwan/add-pending-token-metric
Apr 10, 2026
Merged

[Observability] Add pending token count to prefill log and get_load#22480
ch-wan merged 3 commits intomainfrom
chengwan/add-pending-token-metric

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented Apr 10, 2026

Motivation

For prefill requests, the engine log only shows #queue-req and #running-req, which is insufficient for understanding load — especially for long-context requests where a single request can have hundreds of thousands of tokens pending prefill across multiple chunks.

This PR adds a #pending-token metric that shows the total number of tokens still waiting to be prefilled. This is useful for:

  • Monitoring chunked prefill progress in engine logs
  • Balancing load across engines when handling long-context requests via /get_load

Modifications

  • PrefillStats (scheduler_metrics_mixin.py): Added num_pending_tokens field, snapshotted at batch-scheduling time for correct reporting under the overlap scheduler.
  • _get_num_pending_tokens() (scheduler_metrics_mixin.py): New shared helper that computes pending tokens from the waiting queue plus remaining tokens of the currently chunked request. Accepts a chunk_deduct parameter to handle the timing difference between scheduling time (where prefix_indices has not yet been updated) and query time (get_load, where it has).
  • Prefill batch log (scheduler_metrics_mixin.py): Added #pending-token: {N} to the log line.
  • GetLoadReqOutput (io_struct.py): Added num_pending_tokens field returned by /get_load.
  • get_load() (scheduler_metrics_mixin.py): Populates num_pending_tokens using the shared helper.
  • get_new_batch_prefill() (scheduler.py): Snapshots num_pending_tokens into PrefillStats at scheduling time.

Example log output with a 30K-token chunked prefill (chunk size 8192):

Prefill batch, #new-seq: 1, #new-token: 8192, ..., #pending-token: 21825
Prefill batch, #new-seq: 1, #new-token: 8192, ..., #pending-token: 13633
Prefill batch, #new-seq: 1, #new-token: 8192, ..., #pending-token: 5441
Prefill batch, #new-seq: 1, #new-token: 5441, ..., #pending-token: 0

Accuracy Tests

N/A — observability-only change, no model output affected.

Speed Tests and Profiling

N/A — adds a lightweight sum over the waiting queue during logging and load queries.

Checklist

Add a `#pending-token` metric to the prefill batch log and a
`num_pending_tokens` field to the `GetLoadReqOutput` returned by
`/get_load`. This shows the total number of tokens still waiting to be
prefilled, including remaining tokens from the currently chunked
request. This is particularly useful for load balancing long-context
requests across engines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented Apr 10, 2026

/tag-and-rerun-ci

ch-wan and others added 2 commits April 9, 2026 20:00
Removed outdated comments regarding prefix_indices and chunk_deduct.
@ch-wan ch-wan merged commit 37107be into main Apr 10, 2026
173 of 213 checks passed
@ch-wan ch-wan deleted the chengwan/add-pending-token-metric branch April 10, 2026 09:05
Fridge003 pushed a commit that referenced this pull request Apr 11, 2026
…22480)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant