[Feature] Add monotonically increasing step counter to vLLM scheduler#1
Merged
[Feature] Add monotonically increasing step counter to vLLM scheduler#1
Conversation
Adds scheduler_step counter to track scheduler invocations for trace streams and request tracing. Counter increments with each schedule() call, never resets, and is included in SchedulerOutput with backward-compatible default value. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds comprehensive repository guide to help AI assistants work effectively with the vLLM codebase. Includes structure overview, conventions, testing patterns, and common tasks. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Renames repository guide to CLAUDE.md (consistent with README.md, CONTRIBUTING.md) and removes it from .gitignore to ensure it's tracked in the repository for future use. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This was referenced Jan 23, 2026
sriumcp
added a commit
that referenced
this pull request
Jan 26, 2026
/9) (#8) * [Docs] Update journey tracing plan to reflect completed PR #0 Update plan document to account for completed work: - Document PR #0 (EngineCoreEvent removal) as completed prerequisite - Clarify that do_tracing() is current OTEL mechanism (not legacy) - Update PR #9 to keep RequestJourneyEvent dataclass (needed for Prometheus) - Fix terminology: 'legacy' = EngineCoreEvent (removed), 'current' = RequestJourneyEvent - Add PR #0 to dependencies, timeline, and progress tracking sections Key corrections: - do_tracing() will NOT be removed (it's the current system) - RequestJourneyEvent dataclass will NOT be removed (needed for metrics) - Only buffering LOGIC will be removed in PR #9 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Feature] Initialize OTEL tracer in scheduler for journey tracing Add tracer initialization in Scheduler.__init__() to support dual-stream journey tracing architecture. This is the foundation for PR #2 which will create and manage core spans. Changes: - Add defensive SpanAttributes import with None fallback - Initialize tracer when enable_journey_tracing=True and endpoint configured - Add try/except with warning log for graceful degradation - Add otlp_traces_endpoint parameter to test utilities - Add 4 comprehensive tests with proper mocking Safety guarantees: - Zero per-request state (tracer is class-level only) - Zero overhead when disabled (boolean + endpoint guard) - No spans created (initialization only) - No cleanup needed (shared tracer instance) - Backward compatible (all parameters optional) Test results: All 85 tests passing (81 existing + 4 new) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 27, 2026
Extends the centralized cleanup method to handle journey tracing state alongside core span cleanup. Fixes memory leak on natural completion path. Changes: - Extend _end_core_span_and_cleanup() with decoupled cleanup logic - Cleanup #1: Core spans (always runs, independent of flags) - Cleanup #2: Journey state (only if journey tracing enabled) - Remove duplicate inline cleanup from finish_requests() - Add 4 tests verifying state cleanup on all termination paths Tests: - test_journey_state_created: Verify state initialization - test_journey_state_cleaned_on_finish: Explicit abort cleanup - test_journey_state_cleaned_on_completion: Natural completion cleanup - test_no_state_leak: No accumulation over 20 iterations All 95 tests passing (4 new + 91 existing). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 27, 2026
* [Feature] Add journey state cleanup to scheduler (PR #3/9) Extends the centralized cleanup method to handle journey tracing state alongside core span cleanup. Fixes memory leak on natural completion path. Changes: - Extend _end_core_span_and_cleanup() with decoupled cleanup logic - Cleanup #1: Core spans (always runs, independent of flags) - Cleanup #2: Journey state (only if journey tracing enabled) - Remove duplicate inline cleanup from finish_requests() - Add 4 tests verifying state cleanup on all termination paths Tests: - test_journey_state_created: Verify state initialization - test_journey_state_cleaned_on_finish: Explicit abort cleanup - test_journey_state_cleaned_on_completion: Natural completion cleanup - test_no_state_leak: No accumulation over 20 iterations All 95 tests passing (4 new + 91 existing). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Mark PR #3 as completed in journey tracing plan Updates: - Mark PR #3 as COMPLETED in PR sequence summary - Update PR dependencies to show PR #3 complete - Add PR #3 to Implementation History section with full details - Document commit hash (f4cf790) and PR number (vllm-project#33126) - Record test results, code review process, and key achievements Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This was referenced Jan 27, 2026
sriumcp
added a commit
that referenced
this pull request
Jan 28, 2026
Address review feedback on journey tracing documentation: - Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9) - Correct test counts: 88 new tests (was inconsistently stated as 27+/45+) - Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes) - Fix PR #6 streaming snippet to show finalize before yield [DONE] - Label overhead numbers as ballpark estimates - Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds) - Explain trace context propagation (HTTP headers vs internal dict) - Document error flow edge cases (truncated core events on early abort) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 28, 2026
…it (#18) * [Docs] Fix journey tracing documentation inconsistencies Address review feedback on journey tracing documentation: - Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9) - Correct test counts: 88 new tests (was inconsistently stated as 27+/45+) - Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes) - Fix PR #6 streaming snippet to show finalize before yield [DONE] - Label overhead numbers as ballpark estimates - Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds) - Explain trace context propagation (HTTP headers vs internal dict) - Document error flow edge cases (truncated core events on early abort) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Tests] Remove obsolete journey buffering tests and add regression audit Remove two failing tests that reference the legacy journey event buffering system removed in PR #9 (commit 1d9b9f3): - test_no_events_when_span_none: Referenced _journey_events_buffer_by_client - test_legacy_buffering_still_works: Tested parallel buffering (no longer exists) These tests validated the legacy buffering pathway that was intentionally removed. Comprehensive coverage of the new span-based tracing exists in tests/v1/core/test_pr9_no_buffering.py (16 tests, 337 lines). Add REGRESSION_AUDIT_REPORT.md documenting comprehensive regression analysis from v0.0.1 to HEAD: - 42 files changed analyzed (10,824 insertions, 1,074 deletions) - All production code paths verified safe - Zero regressions to existing functionality - Proper backward compatibility maintained - OTEL imports optional and safe - Metrics work independently of tracing Test Results: 99 passed (all non-journey scheduler tests) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
Refresh plan to capture completed PRs #3, #4, #5 with accurate history: Progress tracking: - Add Implementation Progress section with status table - Mark PR #3, #4, #5 as complete with commit hashes - Mark PR #1, #2 as deferred (low priority, orthogonal) - Update dependency graph with status indicators Historical corrections: - PR #3: CLI args defined but wiring missing (fixed in PR #5) - PR #5: Added CLI wiring fix for all 3 step tracing flags - Add NOTE in PR #3 section about wiring gap - Update PR #5 behavioral contract to document CLI fix Technical corrections: - Fix output tokens source: len(_output_token_ids) → num_output_tokens (property) - Update test file references: test_scheduler.py → test_step_tracing.py - Change test count "15/15" → "test suite passing" (future-proof) Verification updates: - Mark all PR #3, #4, #5 checklist items as complete - Add CLI wiring regression test item to PR #5 checklist Current state: PR #5 ready for merge at commit f951860
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
…ty (PR #5) (#27) * [Feature] Add rich request snapshot stream (PR #5) Implements subsampled per-request detailed progress events with KV metrics: - Add step_tracing_rich_subsample_rate config (default 0.001 = 0.1%) - Emit step.REQUEST_SNAPSHOT events for running requests when subsampled - Use PR #4 get_per_request_kv_metrics() for KV cache data - Two-stage sampling: batch summary sampled AND rich subsampled - SpanAttributes: 10 new constants for per-request metrics - Emission after batch summary, before _update_after_schedule() Also fixes PR #3 CLI wiring bug: - Wire step_tracing_enabled/sample_rate through EngineArgs - Add fields to EngineArgs dataclass - Pass to ObservabilityConfig constructor - Add test_step_tracing_cli_wiring() for regression prevention Tests: 6 new tests (5 rich snapshot + 1 CLI wiring), all 15 pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Update step tracing plan with implementation progress Refresh plan to capture completed PRs #3, #4, #5 with accurate history: Progress tracking: - Add Implementation Progress section with status table - Mark PR #3, #4, #5 as complete with commit hashes - Mark PR #1, #2 as deferred (low priority, orthogonal) - Update dependency graph with status indicators Historical corrections: - PR #3: CLI args defined but wiring missing (fixed in PR #5) - PR #5: Added CLI wiring fix for all 3 step tracing flags - Add NOTE in PR #3 section about wiring gap - Update PR #5 behavioral contract to document CLI fix Technical corrections: - Fix output tokens source: len(_output_token_ids) → num_output_tokens (property) - Update test file references: test_scheduler.py → test_step_tracing.py - Change test count "15/15" → "test suite passing" (future-proof) Verification updates: - Mark all PR #3, #4, #5 checklist items as complete - Add CLI wiring regression test item to PR #5 checklist Current state: PR #5 ready for merge at commit f951860 --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
Removes all float equality comparisons (e.g., assert ts.monotonic == value) from integration tests. Tests now only verify: - Presence of both timestamp fields - Type correctness (float/int) - Exact consistency via integer round-trip validation This ensures robustness against float precision issues as specified in the PR #1 constraints. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
* [Feature] Add journey tracing probabilistic sampling Implements PR #2: Journey Tracing API-Side Sampling in vLLM. Changes: - Add journey_tracing_sample_rate config (default 1.0, backward compatible) - API layer makes probabilistic sampling decision per request - Custom header x-vllm-journey-sampled propagates decision to engine - Engine obeys API decision (authority model) - End-to-end atomic: both API+engine spans exist or neither - Independent of OTEL traceparent sampled bit - Centralized header injection helper across all endpoints - Robustness fix: normalize to mutable dict (handles immutable Mapping) Tests: - 10 new tests verify atomicity and backward compatibility - All existing tests pass (backward compatible) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Update JOURNEY_TRACING.md for sampling feature Update user-facing documentation to reflect PR #2 implementation. Changes: - Add comprehensive "Sampling for Production" section with 3 strategies - Document new --journey-tracing-sample-rate flag (default 1.0) - Explain vLLM native sampling vs OTEL sampling vs collector sampling - Add comparison table for choosing the right sampling strategy - Update configuration examples with sampling use cases - Add Technical Details section on sampling architecture - Add FAQ entries: vLLM vs OTEL sampling, atomicity guarantees - Update Performance Impact section with sampling overhead details - Update troubleshooting section with vLLM sampling solutions - Add early mention of sampling capability in introduction Key messages for users: - Default behavior unchanged (sample_rate=1.0, backward compatible) - vLLM native sampling reduces all overhead (recommended for production) - End-to-end atomic: either both spans exist or neither (no partial traces) - Independent from OTEL traceparent sampled bit - Recommended rates: 10% for 1K-10K RPS, 1% for >10K RPS * [Docs] Fix JOURNEY_TRACING.md accuracy issues and contradictions Critical fixes: - Fix service name vs tracer scope confusion in Jaeger navigation (service.name is what users select, scope.name is span attribute) - Correct AsyncLLM span creation claims (was: "creates only core span", now: "creates no spans by default, core-only if manual header set") - Eliminate contradiction: early doc claimed AsyncLLM creates spans, later sections correctly said no spans without manual header - Qualify "every request creates two spans" to "when using vllm serve" - Qualify sampling sections to explicitly state vllm serve requirement Accuracy improvements: - Soften overhead numbers: "~200-300ns" → "sub-microsecond" (less brittle) - Qualify authority model as "OpenAI API Server" (not generic "API layer") - Add comprehensive AsyncLLM FAQ with working code examples - Add deployment modes section distinguishing vllm serve vs AsyncLLM Impact: Prevents user confusion about AsyncLLM behavior (expecting automatic tracing → getting zero traces → filing bugs). Documentation now accurately reflects codebase reality verified in scheduler.py and test_journey_tracing_sampling.py. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Bugfix] Add missing API span finalization for non-streaming completions Non-streaming completion requests (/v1/completions with stream=false) were missing all _finalize_api_span() calls, causing llm_request spans to never export to OTLP collectors. This resulted in incomplete traces with only llm_core (engine layer) spans visible, while llm_request (API layer) spans remained orphaned in memory. Root cause: The non-streaming code path (lines 319-368) had no finalization on success, error paths, or fake stream generator (beam search with stream=true). Added comprehensive span finalization matching the pattern used in streaming completions and chat completions: - Error paths: Finalize with ABORTED for CancelledError, GenerationError, ValueError - Fake stream generator: Added try-finally with DEPARTED before [DONE] - Success path: Finalize with DEPARTED before returning response - Outer finally block: Unconditional cleanup for any uncaught exceptions Impact: - Fixes: Non-streaming /v1/completions now exports complete API-layer traces - Preserves: Streaming completions continue to work (no changes to that path) - Matches: Behavior now consistent with /v1/chat/completions endpoint Testing: curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen2.5-0.5B", "prompt": "Test", "max_tokens": 20}' Expected result: Both llm_request (scope: vllm.api) and llm_core (scope: vllm.scheduler) spans now appear in OTLP traces with proper parent-child relationship. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Feature] Add nanosecond-precision timestamps to journey events Adds ts_monotonic_ns field to RequestJourneyEvent for improved timestamp precision. Uses single clock read with exact consistency (derive float from int) to ensure both ts_monotonic and ts_monotonic_ns represent identical instant. Fully backward compatible with default value of 0 for legacy code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Misc] Remove completed STEP_TRACING_PR_PLAN.md Step tracing work is complete. Removing planning document. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Test] Remove float equality assertions from journey timestamp tests Removes all float equality comparisons (e.g., assert ts.monotonic == value) from integration tests. Tests now only verify: - Presence of both timestamp fields - Type correctness (float/int) - Exact consistency via integer round-trip validation This ensures robustness against float precision issues as specified in the PR #1 constraints. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a monotonically increasing scheduler step counter to track scheduler invocations. This counter is included in
SchedulerOutputand serves as a building block for future trace streams (step stream, KV cache transfer stream) and request tracing correlation.Motivation
The scheduler currently lacks a way to uniquely identify and correlate scheduling iterations across the system. This counter provides:
Implementation
Core Changes
Scheduler class (
vllm/v1/core/sched/scheduler.py)scheduler_step_counter: int = 0instance variableschedule()callSchedulerOutput dataclass (
vllm/v1/core/sched/output.py)scheduler_step: int = 0field with default valueUnit test (
tests/v1/core/test_scheduler.py)test_scheduler_step_counter()verifying:reset_prefix_cache()Design Decisions
scheduler_step(notstep) to distinguish from decode stepssuper().schedule()Testing
All tests pass with no regressions:
test_scheduler_step_counter- PASSEDBackward Compatibility Verified
SchedulerOutput.make_empty()works without modificationscheduler_stepuses default valueUse Cases
This counter enables:
Example Usage
Notes