[Feature] Add request journey event tracing to v1 scheduler#2
Merged
Conversation
Implements sparse lifecycle event tracking for requests with 5 event types: QUEUED, SCHEDULED (with FIRST/RESUME), FIRST_TOKEN, PREEMPTED, FINISHED. Key features: - Prefill progress tracking that survives preemption via scheduler-side high-water mark dict (_journey_prefill_hiwater) - Per-client event buffering with guaranteed flush - O(events) complexity - no full request iteration - Near-zero overhead when disabled (single boolean check) - msgspec.Struct compatibility for IPC serialization - Backward compatible with optional EngineCoreOutputs.journey_events field Events delivered via EngineCoreOutputs.journey_events with full progress snapshots (prefill/decode tokens, phase, scheduler_step, preemption count). Config: ObservabilityConfig.enable_journey_tracing (default False) Tests: 8 new journey event tests + 89 existing tests pass (no regressions) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds detailed documentation explaining request journey tracing: - What it is and why use it - Quick start guide with code examples - Complete event type reference with examples - Common use cases (latency analysis, preemption tracking, monitoring) - Progress tracking explanation (high-water mark approach) - Performance considerations - Architecture overview - Troubleshooting guide - FAQ section Makes it easy for new contributors to understand and use journey tracing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Addresses all must-fix items from review: - Clarify scope (within scheduler, not end-to-end system) - Fix Quick Start to show VllmConfig (not LLM API) - Convert performance numbers to qualitative statements - Clarify sampling is consumer-side implementation - Change event ordering to typical sequences - Add Semantics & Guarantees section - Clarify TTFT definition (scheduler-QUEUED → first token) - Tone down export language (not built-in) - Mark DEPARTED as reserved/unused Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Clarify 6 event types defined, 5 currently emitted (DEPARTED reserved) - Improve flush mechanism guarantee wording Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
9 tasks
sriumcp
added a commit
that referenced
this pull request
Jan 26, 2026
This plan enforces the critical discipline: if a PR creates resources (spans, dicts, sets), that same PR must clean them on all termination paths. Key improvements over V1: - PR #2 includes span cleanup (not in separate PR) - PR #6 includes DEPARTED/ABORTED (not in separate PR) - Every PR is independently safe when merged - No 'we'll fix it later' patterns - Explicit termination path coverage for each PR 9 PRs total (~2 weeks): - Phase 1 (Core): 4 PRs with span lifecycle complete - Phase 2 (API): 4 PRs with full closure paths - Phase 3 (Cleanup): 1 PR removing legacy buffering Each PR is 15-30 minutes to review vs hours for large PR.
sriumcp
added a commit
that referenced
this pull request
Jan 26, 2026
Add tracer initialization in Scheduler.__init__() to support dual-stream journey tracing architecture. This is the foundation for PR #2 which will create and manage core spans. Changes: - Add defensive SpanAttributes import with None fallback - Initialize tracer when enable_journey_tracing=True and endpoint configured - Add try/except with warning log for graceful degradation - Add otlp_traces_endpoint parameter to test utilities - Add 4 comprehensive tests with proper mocking Safety guarantees: - Zero per-request state (tracer is class-level only) - Zero overhead when disabled (boolean + endpoint guard) - No spans created (initialization only) - No cleanup needed (shared tracer instance) - Backward compatible (all parameters optional) Test results: All 85 tests passing (81 existing + 4 new) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7 tasks
sriumcp
added a commit
that referenced
this pull request
Jan 26, 2026
/9) (#8) * [Docs] Update journey tracing plan to reflect completed PR #0 Update plan document to account for completed work: - Document PR #0 (EngineCoreEvent removal) as completed prerequisite - Clarify that do_tracing() is current OTEL mechanism (not legacy) - Update PR #9 to keep RequestJourneyEvent dataclass (needed for Prometheus) - Fix terminology: 'legacy' = EngineCoreEvent (removed), 'current' = RequestJourneyEvent - Add PR #0 to dependencies, timeline, and progress tracking sections Key corrections: - do_tracing() will NOT be removed (it's the current system) - RequestJourneyEvent dataclass will NOT be removed (needed for metrics) - Only buffering LOGIC will be removed in PR #9 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Feature] Initialize OTEL tracer in scheduler for journey tracing Add tracer initialization in Scheduler.__init__() to support dual-stream journey tracing architecture. This is the foundation for PR #2 which will create and manage core spans. Changes: - Add defensive SpanAttributes import with None fallback - Initialize tracer when enable_journey_tracing=True and endpoint configured - Add try/except with warning log for graceful degradation - Add otlp_traces_endpoint parameter to test utilities - Add 4 comprehensive tests with proper mocking Safety guarantees: - Zero per-request state (tracer is class-level only) - Zero overhead when disabled (boolean + endpoint guard) - No spans created (initialization only) - No cleanup needed (shared tracer instance) - Backward compatible (all parameters optional) Test results: All 85 tests passing (81 existing + 4 new) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 27, 2026
Extends the centralized cleanup method to handle journey tracing state alongside core span cleanup. Fixes memory leak on natural completion path. Changes: - Extend _end_core_span_and_cleanup() with decoupled cleanup logic - Cleanup #1: Core spans (always runs, independent of flags) - Cleanup #2: Journey state (only if journey tracing enabled) - Remove duplicate inline cleanup from finish_requests() - Add 4 tests verifying state cleanup on all termination paths Tests: - test_journey_state_created: Verify state initialization - test_journey_state_cleaned_on_finish: Explicit abort cleanup - test_journey_state_cleaned_on_completion: Natural completion cleanup - test_no_state_leak: No accumulation over 20 iterations All 95 tests passing (4 new + 91 existing). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 27, 2026
* [Feature] Add journey state cleanup to scheduler (PR #3/9) Extends the centralized cleanup method to handle journey tracing state alongside core span cleanup. Fixes memory leak on natural completion path. Changes: - Extend _end_core_span_and_cleanup() with decoupled cleanup logic - Cleanup #1: Core spans (always runs, independent of flags) - Cleanup #2: Journey state (only if journey tracing enabled) - Remove duplicate inline cleanup from finish_requests() - Add 4 tests verifying state cleanup on all termination paths Tests: - test_journey_state_created: Verify state initialization - test_journey_state_cleaned_on_finish: Explicit abort cleanup - test_journey_state_cleaned_on_completion: Natural completion cleanup - test_no_state_leak: No accumulation over 20 iterations All 95 tests passing (4 new + 91 existing). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Mark PR #3 as completed in journey tracing plan Updates: - Mark PR #3 as COMPLETED in PR sequence summary - Update PR dependencies to show PR #3 complete - Add PR #3 to Implementation History section with full details - Document commit hash (f4cf790) and PR number (vllm-project#33126) - Record test results, code review process, and key achievements Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This was referenced Jan 27, 2026
sriumcp
added a commit
that referenced
this pull request
Jan 27, 2026
…/9) This PR implements W3C Trace Context propagation from API spans to core spans, enabling parent-child linkage in distributed traces. Completes the handshake between PR #6 (API span lifecycle) and PR #2 (core span lifecycle). Changes: - Add inject_trace_context() helper to vllm/tracing.py - Inject API span context into trace_headers after span creation - Context flows to engine.generate() and scheduler for parent-child linkage - Defensive error handling: injection failures never break requests - Zero overhead when tracing disabled (early return) Behavioral guarantees verified by tests: - G1: Trace ID continuity (API and core spans share same trace_id) - G2: W3C Trace Context format (traceparent header valid) - G3: Trace continuation (trace_id preserved through Client→API→Core) - G4: Graceful degradation (request continues on injection failure) - G5: No exception propagation (injection failures caught) - G6: Conditional injection (only when API span exists) Invariants: - I1: Backward compatibility (early return when tracing disabled) - I2: Zero overhead when disabled (no propagator/allocation access) - I3: No resource leaks (only modifies existing trace_headers dict) Test coverage: - 12 new tests (100% pass) covering all unit-testable properties - 17 existing API span lifecycle tests pass (no regressions) - Tests focus on behavioral properties, not implementation details Safety properties: - Zero new resources (only modifies existing dict) - No cleanup obligations (dict managed by request lifecycle) - Stateless transformation (span context → headers) - Single injection point (strict ordering preserved) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 27, 2026
…/9) (#15) * [Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) This PR implements W3C Trace Context propagation from API spans to core spans, enabling parent-child linkage in distributed traces. Completes the handshake between PR #6 (API span lifecycle) and PR #2 (core span lifecycle). Changes: - Add inject_trace_context() helper to vllm/tracing.py - Inject API span context into trace_headers after span creation - Context flows to engine.generate() and scheduler for parent-child linkage - Defensive error handling: injection failures never break requests - Zero overhead when tracing disabled (early return) Behavioral guarantees verified by tests: - G1: Trace ID continuity (API and core spans share same trace_id) - G2: W3C Trace Context format (traceparent header valid) - G3: Trace continuation (trace_id preserved through Client→API→Core) - G4: Graceful degradation (request continues on injection failure) - G5: No exception propagation (injection failures caught) - G6: Conditional injection (only when API span exists) Invariants: - I1: Backward compatibility (early return when tracing disabled) - I2: Zero overhead when disabled (no propagator/allocation access) - I3: No resource leaks (only modifies existing trace_headers dict) Test coverage: - 12 new tests (100% pass) covering all unit-testable properties - 17 existing API span lifecycle tests pass (no regressions) - Tests focus on behavioral properties, not implementation details Safety properties: - Zero new resources (only modifies existing dict) - No cleanup obligations (dict managed by request lifecycle) - Stateless transformation (span context → headers) - Single injection point (strict ordering preserved) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Polish] Improve inject_trace_context docstring and strengthen test Two quality improvements following code review: 1. Clarify inject_trace_context() docstring: - Previous: "or None if injection failed" (misleading) - Now: Explicitly documents when carrier is returned unchanged - Details all three early-return paths (OTEL unavailable, span None, exception) 2. Strengthen test_trace_id_preserved_through_chain(): - Mock propagator now actually reads span.get_span_context() - Extracts trace_id and span_id from span context - Generates traceparent using those values (simulates real OTEL behavior) - Asserts get_span_context() was called - Better proves G1/G3 guarantees without requiring real OTLP exporter Test results: All 29 tests pass (12 context propagation + 17 lifecycle) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Mark PR #7 as completed in journey tracing plan Updates to reflect PR #7 completion: - PR sequence table: Mark #7 as COMPLETED with 12 tests - Dependency chain: Mark #6 and #7 as COMPLETED - PR #7 section: Add completion status with commit hashes - Document deliverables: inject_trace_context(), tests, guarantees Remaining: PRs #8 (API events), #9 (remove buffering) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * removing PR7_summary Signed-off-by: Srinivasan Parthasarathy <spartha@us.ibm.com> --------- Signed-off-by: Srinivasan Parthasarathy <spartha@us.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This was referenced Jan 28, 2026
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
Refresh plan to capture completed PRs #3, #4, #5 with accurate history: Progress tracking: - Add Implementation Progress section with status table - Mark PR #3, #4, #5 as complete with commit hashes - Mark PR #1, #2 as deferred (low priority, orthogonal) - Update dependency graph with status indicators Historical corrections: - PR #3: CLI args defined but wiring missing (fixed in PR #5) - PR #5: Added CLI wiring fix for all 3 step tracing flags - Add NOTE in PR #3 section about wiring gap - Update PR #5 behavioral contract to document CLI fix Technical corrections: - Fix output tokens source: len(_output_token_ids) → num_output_tokens (property) - Update test file references: test_scheduler.py → test_step_tracing.py - Change test count "15/15" → "test suite passing" (future-proof) Verification updates: - Mark all PR #3, #4, #5 checklist items as complete - Add CLI wiring regression test item to PR #5 checklist Current state: PR #5 ready for merge at commit f951860
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
…ty (PR #5) (#27) * [Feature] Add rich request snapshot stream (PR #5) Implements subsampled per-request detailed progress events with KV metrics: - Add step_tracing_rich_subsample_rate config (default 0.001 = 0.1%) - Emit step.REQUEST_SNAPSHOT events for running requests when subsampled - Use PR #4 get_per_request_kv_metrics() for KV cache data - Two-stage sampling: batch summary sampled AND rich subsampled - SpanAttributes: 10 new constants for per-request metrics - Emission after batch summary, before _update_after_schedule() Also fixes PR #3 CLI wiring bug: - Wire step_tracing_enabled/sample_rate through EngineArgs - Add fields to EngineArgs dataclass - Pass to ObservabilityConfig constructor - Add test_step_tracing_cli_wiring() for regression prevention Tests: 6 new tests (5 rich snapshot + 1 CLI wiring), all 15 pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Update step tracing plan with implementation progress Refresh plan to capture completed PRs #3, #4, #5 with accurate history: Progress tracking: - Add Implementation Progress section with status table - Mark PR #3, #4, #5 as complete with commit hashes - Mark PR #1, #2 as deferred (low priority, orthogonal) - Update dependency graph with status indicators Historical corrections: - PR #3: CLI args defined but wiring missing (fixed in PR #5) - PR #5: Added CLI wiring fix for all 3 step tracing flags - Add NOTE in PR #3 section about wiring gap - Update PR #5 behavioral contract to document CLI fix Technical corrections: - Fix output tokens source: len(_output_token_ids) → num_output_tokens (property) - Update test file references: test_scheduler.py → test_step_tracing.py - Change test count "15/15" → "test suite passing" (future-proof) Verification updates: - Mark all PR #3, #4, #5 checklist items as complete - Add CLI wiring regression test item to PR #5 checklist Current state: PR #5 ready for merge at commit f951860 --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
Implements PR #2: Journey Tracing API-Side Sampling in vLLM. Changes: - Add journey_tracing_sample_rate config (default 1.0, backward compatible) - API layer makes probabilistic sampling decision per request - Custom header x-vllm-journey-sampled propagates decision to engine - Engine obeys API decision (authority model) - End-to-end atomic: both API+engine spans exist or neither - Independent of OTEL traceparent sampled bit - Centralized header injection helper across all endpoints - Robustness fix: normalize to mutable dict (handles immutable Mapping) Tests: - 10 new tests verify atomicity and backward compatibility - All existing tests pass (backward compatible) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
Update user-facing documentation to reflect PR #2 implementation. Changes: - Add comprehensive "Sampling for Production" section with 3 strategies - Document new --journey-tracing-sample-rate flag (default 1.0) - Explain vLLM native sampling vs OTEL sampling vs collector sampling - Add comparison table for choosing the right sampling strategy - Update configuration examples with sampling use cases - Add Technical Details section on sampling architecture - Add FAQ entries: vLLM vs OTEL sampling, atomicity guarantees - Update Performance Impact section with sampling overhead details - Update troubleshooting section with vLLM sampling solutions - Add early mention of sampling capability in introduction Key messages for users: - Default behavior unchanged (sample_rate=1.0, backward compatible) - vLLM native sampling reduces all overhead (recommended for production) - End-to-end atomic: either both spans exist or neither (no partial traces) - Independent from OTEL traceparent sampled bit - Recommended rates: 10% for 1K-10K RPS, 1% for >10K RPS
sriumcp
added a commit
that referenced
this pull request
Jan 29, 2026
* [Feature] Add journey tracing probabilistic sampling Implements PR #2: Journey Tracing API-Side Sampling in vLLM. Changes: - Add journey_tracing_sample_rate config (default 1.0, backward compatible) - API layer makes probabilistic sampling decision per request - Custom header x-vllm-journey-sampled propagates decision to engine - Engine obeys API decision (authority model) - End-to-end atomic: both API+engine spans exist or neither - Independent of OTEL traceparent sampled bit - Centralized header injection helper across all endpoints - Robustness fix: normalize to mutable dict (handles immutable Mapping) Tests: - 10 new tests verify atomicity and backward compatibility - All existing tests pass (backward compatible) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Docs] Update JOURNEY_TRACING.md for sampling feature Update user-facing documentation to reflect PR #2 implementation. Changes: - Add comprehensive "Sampling for Production" section with 3 strategies - Document new --journey-tracing-sample-rate flag (default 1.0) - Explain vLLM native sampling vs OTEL sampling vs collector sampling - Add comparison table for choosing the right sampling strategy - Update configuration examples with sampling use cases - Add Technical Details section on sampling architecture - Add FAQ entries: vLLM vs OTEL sampling, atomicity guarantees - Update Performance Impact section with sampling overhead details - Update troubleshooting section with vLLM sampling solutions - Add early mention of sampling capability in introduction Key messages for users: - Default behavior unchanged (sample_rate=1.0, backward compatible) - vLLM native sampling reduces all overhead (recommended for production) - End-to-end atomic: either both spans exist or neither (no partial traces) - Independent from OTEL traceparent sampled bit - Recommended rates: 10% for 1K-10K RPS, 1% for >10K RPS * [Docs] Fix JOURNEY_TRACING.md accuracy issues and contradictions Critical fixes: - Fix service name vs tracer scope confusion in Jaeger navigation (service.name is what users select, scope.name is span attribute) - Correct AsyncLLM span creation claims (was: "creates only core span", now: "creates no spans by default, core-only if manual header set") - Eliminate contradiction: early doc claimed AsyncLLM creates spans, later sections correctly said no spans without manual header - Qualify "every request creates two spans" to "when using vllm serve" - Qualify sampling sections to explicitly state vllm serve requirement Accuracy improvements: - Soften overhead numbers: "~200-300ns" → "sub-microsecond" (less brittle) - Qualify authority model as "OpenAI API Server" (not generic "API layer") - Add comprehensive AsyncLLM FAQ with working code examples - Add deployment modes section distinguishing vllm serve vs AsyncLLM Impact: Prevents user confusion about AsyncLLM behavior (expecting automatic tracing → getting zero traces → filing bugs). Documentation now accurately reflects codebase reality verified in scheduler.py and test_journey_tracing_sampling.py. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Bugfix] Add missing API span finalization for non-streaming completions Non-streaming completion requests (/v1/completions with stream=false) were missing all _finalize_api_span() calls, causing llm_request spans to never export to OTLP collectors. This resulted in incomplete traces with only llm_core (engine layer) spans visible, while llm_request (API layer) spans remained orphaned in memory. Root cause: The non-streaming code path (lines 319-368) had no finalization on success, error paths, or fake stream generator (beam search with stream=true). Added comprehensive span finalization matching the pattern used in streaming completions and chat completions: - Error paths: Finalize with ABORTED for CancelledError, GenerationError, ValueError - Fake stream generator: Added try-finally with DEPARTED before [DONE] - Success path: Finalize with DEPARTED before returning response - Outer finally block: Unconditional cleanup for any uncaught exceptions Impact: - Fixes: Non-streaming /v1/completions now exports complete API-layer traces - Preserves: Streaming completions continue to work (no changes to that path) - Matches: Behavior now consistent with /v1/chat/completions endpoint Testing: curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen2.5-0.5B", "prompt": "Test", "max_tokens": 20}' Expected result: Both llm_request (scope: vllm.api) and llm_core (scope: vllm.scheduler) spans now appear in OTLP traces with proper parent-child relationship. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Feature] Add nanosecond-precision timestamps to journey events Adds ts_monotonic_ns field to RequestJourneyEvent for improved timestamp precision. Uses single clock read with exact consistency (derive float from int) to ensure both ts_monotonic and ts_monotonic_ns represent identical instant. Fully backward compatible with default value of 0 for legacy code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Misc] Remove completed STEP_TRACING_PR_PLAN.md Step tracing work is complete. Removing planning document. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * [Test] Remove float equality assertions from journey timestamp tests Removes all float equality comparisons (e.g., assert ts.monotonic == value) from integration tests. Tests now only verify: - Presence of both timestamp fields - Type correctness (float/int) - Exact consistency via integer round-trip validation This ensures robustness against float precision issues as specified in the PR #1 constraints. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp
added a commit
that referenced
this pull request
Jan 30, 2026
Fixes critical bug where OpenAIServing.__init__() did not initialize self.observability_config, causing AttributeError when journey tracing accessed self.observability_config.journey_tracing_sample_rate. Root Cause: - PR #2 (b242cc3) added journey tracing probabilistic sampling - _create_api_span() method accessed self.observability_config.journey_tracing_sample_rate - But OpenAIServing.__init__() never initialized self.observability_config - All serving endpoints (completions, chat, embeddings, pooling, score) inherited the bug The Bug: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen2.5-0.5B", "prompt": "Once upon a time", "max_tokens": 20}' Response: {"error":{"message":"'OpenAIServingCompletion' object has no attribute 'observability_config'",...}} The Fix: - Add one line to OpenAIServing.__init__() (line 265): self.observability_config = engine_client.vllm_config.observability_config - Follows same pattern as v1/engine/async_llm.py, v1/core/sched/scheduler.py - Fixes all endpoints via inheritance (single point of fix) Testing: - Added comprehensive integration test suite (5 tests) - Tests verify observability_config initialization and actual usage - Tests would have caught this bug (verified by temporarily removing fix) - All existing tests pass (no regressions) Impact: - Fixes all journey tracing-enabled endpoints: • /v1/completions (OpenAIServingCompletion) • /v1/chat/completions (OpenAIServingChat) • /v1/embeddings (EmbeddingMixin) • /v1/classify (ClassificationMixin) • /v1/pooling (OpenAIServingPooling) • /v1/score and /v1/rerank (ServingScores) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds comprehensive request lifecycle event tracing to the v1 scheduler, enabling detailed observability of request journeys through the system. This feature emits sparse lifecycle events (QUEUED, SCHEDULED, FIRST_TOKEN, PREEMPTED, FINISHED) with full progress snapshots, perfect for debugging, monitoring, and performance analysis.
Motivation
Request journey tracing provides critical visibility into:
Key Features
5 Lifecycle Events
Accurate Progress Tracking
_journey_prefill_hiwater)Performance Optimized
Production Ready
Usage
Enable Journey Tracing
Access Events
Implementation Details
Event Data Structure
Emission Points
add_request()after adding to waiting queueschedule()after RUNNING transitionupdate_from_output()after token append_preempt_request()after status changefinish_requests()after status changePrefill Progress Tracking
Problem:
num_computed_tokensresets to 0 on preemption,num_cached_tokensis cache-hit length (not processing progress).Solution: Scheduler maintains high-water mark dict:
Flush Mechanism
Events buffered per-client and flushed in
update_from_output():Performance Impact
When Disabled (Default)
When Enabled
Testing
Test Coverage
Test Categories
Breaking Changes
None. Fully backward compatible:
EngineCoreOutputs.journey_events(defaults to None)ObservabilityConfig.enable_journey_tracing(defaults to False)Files Changed
New Files (2)
vllm/v1/core/sched/journey_events.py- Event data structurestests/v1/core/test_journey_events.py- Comprehensive test suiteModified Files (5)
vllm/v1/core/sched/scheduler.py- Core implementation (+250 lines)vllm/config/observability.py- Config flag (+6 lines)vllm/v1/engine/__init__.py- EngineCoreOutputs field (+2 lines)vllm/v1/core/sched/interface.py- Interface signature (+3 lines)tests/v1/core/utils.py- Test utilities (+7 lines)Total: 706 insertions(+), 4 deletions(-)
Future Work (Out of Scope)
Checklist
Ready for review! This feature provides critical observability infrastructure for vLLM v1 scheduler with minimal overhead and zero impact when disabled.