[Feature] Add monotonically increasing step counter to vLLM scheduler by sriumcp · Pull Request #1 · inference-sim/vllm

sriumcp · 2026-01-23T16:41:54Z

Summary

Adds a monotonically increasing scheduler step counter to track scheduler invocations. This counter is included in SchedulerOutput and serves as a building block for future trace streams (step stream, KV cache transfer stream) and request tracing correlation.

Motivation

The scheduler currently lacks a way to uniquely identify and correlate scheduling iterations across the system. This counter provides:

Trace stream infrastructure: Foundation for step-level tracing and debugging
Request correlation: Ability to track requests across scheduling iterations
KV cache tracing: Correlation of KV cache operations with specific scheduler steps
Performance analysis: Temporal markers for profiling and optimization

Implementation

Core Changes

Scheduler class (vllm/v1/core/sched/scheduler.py)
- Added scheduler_step_counter: int = 0 instance variable
- Increments at the start of every schedule() call
- First call produces step=1, subsequent calls increment monotonically
SchedulerOutput dataclass (vllm/v1/core/sched/output.py)
- Added scheduler_step: int = 0 field with default value
- Placed at end of dataclass to avoid field ordering issues
- Default value ensures backward compatibility
Unit test (tests/v1/core/test_scheduler.py)
- Added test_scheduler_step_counter() verifying:
  - First schedule() produces step=1
  - Subsequent calls increment (2, 3, 4...)
  - Empty schedules still increment counter
  - Counter continues after reset_prefix_cache()

Design Decisions

Truly monotonic: Never resets throughout scheduler lifetime
Always increments: Even on empty schedules and early returns
First step = 1: Initialized to 0, incremented at start
Clear naming: scheduler_step (not step) to distinguish from decode steps
Backward compatible: Default value prevents breaking existing code
AsyncScheduler compatible: Inherits correctly via super().schedule()

Testing

All tests pass with no regressions:

✅ New test: test_scheduler_step_counter - PASSED
✅ All scheduler tests: 81/81 PASSED
✅ Async scheduler tests: 8/8 PASSED
✅ Prefix caching tests: 46/46 PASSED
✅ Output module tests: 2/2 PASSED
✅ Attention tests: 12/12 PASSED
✅ Total: 149 tests PASSED

Backward Compatibility Verified

SchedulerOutput.make_empty() works without modification
Manual construction without scheduler_step uses default value
Existing test code continues to work

Use Cases

This counter enables:

Step stream: Track all scheduler operations per step
KV cache transfer stream: Correlate KV operations with scheduler steps
Request tracing: Follow requests through scheduling iterations
Distributed tracing: Correlate events across workers using step numbers
Performance debugging: Identify scheduling bottlenecks by step

Example Usage

# In engine/worker code
output = scheduler.schedule()
print(f"Scheduler step: {output.scheduler_step}")
# Output: Scheduler step: 1, 2, 3, ... (monotonically increasing)

# Even with no requests (idle periods)
output = scheduler.schedule()  # Empty schedule
print(f"Scheduler step: {output.scheduler_step}")  # Still increments

Notes

Counter is per-scheduler instance (each EngineCore has independent counter)
Python ints don't overflow, safe for long-running services
Counter tracks scheduler invocations, not token generation steps
May advance during idle periods when engine ticks scheduler

Adds scheduler_step counter to track scheduler invocations for trace streams and request tracing. Counter increments with each schedule() call, never resets, and is included in SchedulerOutput with backward-compatible default value. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Adds comprehensive repository guide to help AI assistants work effectively with the vLLM codebase. Includes structure overview, conventions, testing patterns, and common tasks. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Renames repository guide to CLAUDE.md (consistent with README.md, CONTRIBUTING.md) and removes it from .gitignore to ensure it's tracked in the repository for future use. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

sriumcp

lgtm

/9) (#8) * [Docs] Update journey tracing plan to reflect completed PR #0 Update plan document to account for completed work: - Document PR #0 (EngineCoreEvent removal) as completed prerequisite - Clarify that do_tracing() is current OTEL mechanism (not legacy) - Update PR #9 to keep RequestJourneyEvent dataclass (needed for Prometheus) - Fix terminology: 'legacy' = EngineCoreEvent (removed), 'current' = RequestJourneyEvent - Add PR #0 to dependencies, timeline, and progress tracking sections Key corrections: - do_tracing() will NOT be removed (it's the current system) - RequestJourneyEvent dataclass will NOT be removed (needed for metrics) - Only buffering LOGIC will be removed in PR #9 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Feature] Initialize OTEL tracer in scheduler for journey tracing Add tracer initialization in Scheduler.__init__() to support dual-stream journey tracing architecture. This is the foundation for PR #2 which will create and manage core spans. Changes: - Add defensive SpanAttributes import with None fallback - Initialize tracer when enable_journey_tracing=True and endpoint configured - Add try/except with warning log for graceful degradation - Add otlp_traces_endpoint parameter to test utilities - Add 4 comprehensive tests with proper mocking Safety guarantees: - Zero per-request state (tracer is class-level only) - Zero overhead when disabled (boolean + endpoint guard) - No spans created (initialization only) - No cleanup needed (shared tracer instance) - Backward compatible (all parameters optional) Test results: All 85 tests passing (81 existing + 4 new) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Extends the centralized cleanup method to handle journey tracing state alongside core span cleanup. Fixes memory leak on natural completion path. Changes: - Extend _end_core_span_and_cleanup() with decoupled cleanup logic - Cleanup #1: Core spans (always runs, independent of flags) - Cleanup #2: Journey state (only if journey tracing enabled) - Remove duplicate inline cleanup from finish_requests() - Add 4 tests verifying state cleanup on all termination paths Tests: - test_journey_state_created: Verify state initialization - test_journey_state_cleaned_on_finish: Explicit abort cleanup - test_journey_state_cleaned_on_completion: Natural completion cleanup - test_no_state_leak: No accumulation over 20 iterations All 95 tests passing (4 new + 91 existing). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Feature] Add journey state cleanup to scheduler (PR #3/9) Extends the centralized cleanup method to handle journey tracing state alongside core span cleanup. Fixes memory leak on natural completion path. Changes: - Extend _end_core_span_and_cleanup() with decoupled cleanup logic - Cleanup #1: Core spans (always runs, independent of flags) - Cleanup #2: Journey state (only if journey tracing enabled) - Remove duplicate inline cleanup from finish_requests() - Add 4 tests verifying state cleanup on all termination paths Tests: - test_journey_state_created: Verify state initialization - test_journey_state_cleaned_on_finish: Explicit abort cleanup - test_journey_state_cleaned_on_completion: Natural completion cleanup - test_no_state_leak: No accumulation over 20 iterations All 95 tests passing (4 new + 91 existing). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Mark PR #3 as completed in journey tracing plan Updates: - Mark PR #3 as COMPLETED in PR sequence summary - Update PR dependencies to show PR #3 complete - Add PR #3 to Implementation History section with full details - Document commit hash (f4cf790) and PR number (vllm-project#33126) - Record test results, code review process, and key achievements Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Address review feedback on journey tracing documentation: - Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9) - Correct test counts: 88 new tests (was inconsistently stated as 27+/45+) - Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes) - Fix PR #6 streaming snippet to show finalize before yield [DONE] - Label overhead numbers as ballpark estimates - Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds) - Explain trace context propagation (HTTP headers vs internal dict) - Document error flow edge cases (truncated core events on early abort) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…it (#18) * [Docs] Fix journey tracing documentation inconsistencies Address review feedback on journey tracing documentation: - Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9) - Correct test counts: 88 new tests (was inconsistently stated as 27+/45+) - Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes) - Fix PR #6 streaming snippet to show finalize before yield [DONE] - Label overhead numbers as ballpark estimates - Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds) - Explain trace context propagation (HTTP headers vs internal dict) - Document error flow edge cases (truncated core events on early abort) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Tests] Remove obsolete journey buffering tests and add regression audit Remove two failing tests that reference the legacy journey event buffering system removed in PR #9 (commit 1d9b9f3): - test_no_events_when_span_none: Referenced _journey_events_buffer_by_client - test_legacy_buffering_still_works: Tested parallel buffering (no longer exists) These tests validated the legacy buffering pathway that was intentionally removed. Comprehensive coverage of the new span-based tracing exists in tests/v1/core/test_pr9_no_buffering.py (16 tests, 337 lines). Add REGRESSION_AUDIT_REPORT.md documenting comprehensive regression analysis from v0.0.1 to HEAD: - 42 files changed analyzed (10,824 insertions, 1,074 deletions) - All production code paths verified safe - Zero regressions to existing functionality - Proper backward compatibility maintained - OTEL imports optional and safe - Metrics work independently of tracing Test Results: 99 passed (all non-journey scheduler tests) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Refresh plan to capture completed PRs #3, #4, #5 with accurate history: Progress tracking: - Add Implementation Progress section with status table - Mark PR #3, #4, #5 as complete with commit hashes - Mark PR #1, #2 as deferred (low priority, orthogonal) - Update dependency graph with status indicators Historical corrections: - PR #3: CLI args defined but wiring missing (fixed in PR #5) - PR #5: Added CLI wiring fix for all 3 step tracing flags - Add NOTE in PR #3 section about wiring gap - Update PR #5 behavioral contract to document CLI fix Technical corrections: - Fix output tokens source: len(_output_token_ids) → num_output_tokens (property) - Update test file references: test_scheduler.py → test_step_tracing.py - Change test count "15/15" → "test suite passing" (future-proof) Verification updates: - Mark all PR #3, #4, #5 checklist items as complete - Add CLI wiring regression test item to PR #5 checklist Current state: PR #5 ready for merge at commit f951860

…ty (PR #5) (#27) * [Feature] Add rich request snapshot stream (PR #5) Implements subsampled per-request detailed progress events with KV metrics: - Add step_tracing_rich_subsample_rate config (default 0.001 = 0.1%) - Emit step.REQUEST_SNAPSHOT events for running requests when subsampled - Use PR #4 get_per_request_kv_metrics() for KV cache data - Two-stage sampling: batch summary sampled AND rich subsampled - SpanAttributes: 10 new constants for per-request metrics - Emission after batch summary, before _update_after_schedule() Also fixes PR #3 CLI wiring bug: - Wire step_tracing_enabled/sample_rate through EngineArgs - Add fields to EngineArgs dataclass - Pass to ObservabilityConfig constructor - Add test_step_tracing_cli_wiring() for regression prevention Tests: 6 new tests (5 rich snapshot + 1 CLI wiring), all 15 pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Update step tracing plan with implementation progress Refresh plan to capture completed PRs #3, #4, #5 with accurate history: Progress tracking: - Add Implementation Progress section with status table - Mark PR #3, #4, #5 as complete with commit hashes - Mark PR #1, #2 as deferred (low priority, orthogonal) - Update dependency graph with status indicators Historical corrections: - PR #3: CLI args defined but wiring missing (fixed in PR #5) - PR #5: Added CLI wiring fix for all 3 step tracing flags - Add NOTE in PR #3 section about wiring gap - Update PR #5 behavioral contract to document CLI fix Technical corrections: - Fix output tokens source: len(_output_token_ids) → num_output_tokens (property) - Update test file references: test_scheduler.py → test_step_tracing.py - Change test count "15/15" → "test suite passing" (future-proof) Verification updates: - Mark all PR #3, #4, #5 checklist items as complete - Add CLI wiring regression test item to PR #5 checklist Current state: PR #5 ready for merge at commit f951860 --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Removes all float equality comparisons (e.g., assert ts.monotonic == value) from integration tests. Tests now only verify: - Presence of both timestamp fields - Type correctness (float/int) - Exact consistency via integer round-trip validation This ensures robustness against float precision issues as specified in the PR #1 constraints. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Feature] Add journey tracing probabilistic sampling Implements PR #2: Journey Tracing API-Side Sampling in vLLM. Changes: - Add journey_tracing_sample_rate config (default 1.0, backward compatible) - API layer makes probabilistic sampling decision per request - Custom header x-vllm-journey-sampled propagates decision to engine - Engine obeys API decision (authority model) - End-to-end atomic: both API+engine spans exist or neither - Independent of OTEL traceparent sampled bit - Centralized header injection helper across all endpoints - Robustness fix: normalize to mutable dict (handles immutable Mapping) Tests: - 10 new tests verify atomicity and backward compatibility - All existing tests pass (backward compatible) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Update JOURNEY_TRACING.md for sampling feature Update user-facing documentation to reflect PR #2 implementation. Changes: - Add comprehensive "Sampling for Production" section with 3 strategies - Document new --journey-tracing-sample-rate flag (default 1.0) - Explain vLLM native sampling vs OTEL sampling vs collector sampling - Add comparison table for choosing the right sampling strategy - Update configuration examples with sampling use cases - Add Technical Details section on sampling architecture - Add FAQ entries: vLLM vs OTEL sampling, atomicity guarantees - Update Performance Impact section with sampling overhead details - Update troubleshooting section with vLLM sampling solutions - Add early mention of sampling capability in introduction Key messages for users: - Default behavior unchanged (sample_rate=1.0, backward compatible) - vLLM native sampling reduces all overhead (recommended for production) - End-to-end atomic: either both spans exist or neither (no partial traces) - Independent from OTEL traceparent sampled bit - Recommended rates: 10% for 1K-10K RPS, 1% for >10K RPS * [Docs] Fix JOURNEY_TRACING.md accuracy issues and contradictions Critical fixes: - Fix service name vs tracer scope confusion in Jaeger navigation (service.name is what users select, scope.name is span attribute) - Correct AsyncLLM span creation claims (was: "creates only core span", now: "creates no spans by default, core-only if manual header set") - Eliminate contradiction: early doc claimed AsyncLLM creates spans, later sections correctly said no spans without manual header - Qualify "every request creates two spans" to "when using vllm serve" - Qualify sampling sections to explicitly state vllm serve requirement Accuracy improvements: - Soften overhead numbers: "~200-300ns" → "sub-microsecond" (less brittle) - Qualify authority model as "OpenAI API Server" (not generic "API layer") - Add comprehensive AsyncLLM FAQ with working code examples - Add deployment modes section distinguishing vllm serve vs AsyncLLM Impact: Prevents user confusion about AsyncLLM behavior (expecting automatic tracing → getting zero traces → filing bugs). Documentation now accurately reflects codebase reality verified in scheduler.py and test_journey_tracing_sampling.py. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Bugfix] Add missing API span finalization for non-streaming completions Non-streaming completion requests (/v1/completions with stream=false) were missing all _finalize_api_span() calls, causing llm_request spans to never export to OTLP collectors. This resulted in incomplete traces with only llm_core (engine layer) spans visible, while llm_request (API layer) spans remained orphaned in memory. Root cause: The non-streaming code path (lines 319-368) had no finalization on success, error paths, or fake stream generator (beam search with stream=true). Added comprehensive span finalization matching the pattern used in streaming completions and chat completions: - Error paths: Finalize with ABORTED for CancelledError, GenerationError, ValueError - Fake stream generator: Added try-finally with DEPARTED before [DONE] - Success path: Finalize with DEPARTED before returning response - Outer finally block: Unconditional cleanup for any uncaught exceptions Impact: - Fixes: Non-streaming /v1/completions now exports complete API-layer traces - Preserves: Streaming completions continue to work (no changes to that path) - Matches: Behavior now consistent with /v1/chat/completions endpoint Testing: curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen2.5-0.5B", "prompt": "Test", "max_tokens": 20}' Expected result: Both llm_request (scope: vllm.api) and llm_core (scope: vllm.scheduler) spans now appear in OTLP traces with proper parent-child relationship. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Feature] Add nanosecond-precision timestamps to journey events Adds ts_monotonic_ns field to RequestJourneyEvent for improved timestamp precision. Uses single clock read with exact consistency (derive float from int) to ensure both ts_monotonic and ts_monotonic_ns represent identical instant. Fully backward compatible with default value of 0 for legacy code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Misc] Remove completed STEP_TRACING_PR_PLAN.md Step tracing work is complete. Removing planning document. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Test] Remove float equality assertions from journey timestamp tests Removes all float equality comparisons (e.g., assert ts.monotonic == value) from integration tests. Tests now only verify: - Presence of both timestamp fields - Type correctness (float/int) - Exact consistency via integer round-trip validation This ensures robustness against float precision issues as specified in the PR #1 constraints. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

sriumcp and others added 3 commits January 23, 2026 11:40

sriumcp commented Jan 23, 2026

View reviewed changes

sriumcp merged commit 2566135 into main Jan 23, 2026

This was referenced Jan 23, 2026

[Feature] Add CLI flag for journey tracing with OTEL integration #3

Merged

[Feature] Initialize OTEL tracer in scheduler for journey tracing (PR #1/9) #8

Merged

sriumcp mentioned this pull request Jan 27, 2026

[Feature] Add journey state cleanup to scheduler (PR #3/9) #11

Merged

This was referenced Jan 27, 2026

[Feature] Add API parent span lifecycle management (PR #6/9) #14

Merged

[Feature] Remove journey event buffering (PR #9/9) #17

Merged

sriumcp mentioned this pull request Jan 28, 2026

Journey Tracing: Complete Implementation (PRs #0-#9) + Regression Audit #18

Merged

sriumcp mentioned this pull request Jan 29, 2026

[Bugfix] Add API span finalization and endpoint attributes to all serving endpoints #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add monotonically increasing step counter to vLLM scheduler#1

[Feature] Add monotonically increasing step counter to vLLM scheduler#1
sriumcp merged 3 commits intomainfrom
stepcounter

sriumcp commented Jan 23, 2026

Uh oh!

sriumcp left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sriumcp commented Jan 23, 2026

Summary

Motivation

Implementation

Core Changes

Design Decisions

Testing

Backward Compatibility Verified

Use Cases

Example Usage

Notes

Uh oh!

sriumcp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant