Journey Tracing: Complete Implementation (PRs #0-#9) + Regression Audit by sriumcp · Pull Request #18 · inference-sim/vllm

sriumcp · 2026-01-28T03:11:34Z

Summary

This PR completes the journey tracing dual-stream architecture implementation across 10 PRs (PR #0 prerequisite + PRs #1-#9), including comprehensive regression audit and documentation.

What's Included

Implementation (PRs #0-#9)

✅ PR #0: Remove EngineCoreEvent system (prerequisite cleanup)
✅ PR [Feature] Add monotonically increasing step counter to vLLM scheduler #1: Initialize OTEL tracer in scheduler
✅ PR [Feature] Add request journey event tracing to v1 scheduler #2: Core span lifecycle management
✅ PR [Feature] Add CLI flag for journey tracing with OTEL integration #3: Journey state cleanup
✅ PR [Refactor] Use SpanAttributes constants for journey event attributes #4: Emit journey events to core spans
✅ PR [Bugfix] Fix prefill progress tracking for chunked prefill preemption #5: API span tracking infrastructure
✅ PR [Feature] Implement dual-stream journey tracing with OTEL spans #6: API parent span lifecycle
✅ PR [Bugfix] Remove legacy EngineCoreEvent system and restore Prometheus metrics #7: API↔Engine context propagation
✅ PR [Feature] Initialize OTEL tracer in scheduler for journey tracing (PR #1/9) #8: API lifecycle events and request attributes
✅ PR [CI] Add Docker build and push workflow #9: Remove journey event buffering (real-time emission)

Documentation & Quality

✅ Comprehensive end-user documentation (JOURNEY_TRACING.md)
✅ Detailed implementation plan (JOURNEY_TRACING_PR_PLAN.md)
✅ Repository guide for contributors (CLAUDE.md)
✅ Regression audit report (REGRESSION_AUDIT_REPORT.md)

Architecture

Dual-Stream Design:

API Layer: Parent spans (llm_request) track end-to-end request lifecycle
Engine Core: Child spans (llm_core) track scheduler-level processing
Real-Time Emission: Events emitted directly to OTEL spans (no buffering)
Parent-Child Linkage: W3C Trace Context propagation

Testing

Comprehensive Test Coverage:

88 new journey tracing tests (5,000+ lines)
All existing tests passing (99/99 scheduler tests)
Integration tests for API→Engine flow
Backwards compatibility tests
Zero overhead when disabled tests

Regression Audit Results

Verdict: ✅ NO PRODUCTION REGRESSIONS FOUND

Verified Safe:

✅ Backward compatible (all existing APIs preserved)
✅ Zero overhead when disabled (proper early-return guards)
✅ OTEL imports optional (graceful degradation without opentelemetry)
✅ Metrics independent (Prometheus works without tracing)
✅ Exception handling improved (better cleanup guarantees)
✅ No memory leaks (centralized cleanup with try/finally)

Changes Analyzed:

42 files changed (+10,824 insertions, -1,074 deletions)
All production code paths verified
OpenAI API correctness maintained
Scheduler behavior unchanged
Metrics/stats timestamp capture improved

Key Features

Optional Feature - Disabled by default (--enable-journey-tracing)
OTEL Native - Uses OpenTelemetry for industry-standard observability
Real-Time - Events emitted immediately (no buffering/deferred export)
Progress Tracking - Accurate token counts that survive preemption
Dual Stream - API and engine-core events in linked parent-child spans

Usage

# Enable journey tracing with OTEL endpoint
vllm serve MODEL --enable-journey-tracing --otlp-traces-endpoint http://localhost:4317

# View traces in Jaeger/Tempo/other OTEL backends

Breaking Changes

None - Fully backward compatible:

New CLI flag is optional (default: false)
No new required dependencies
OTEL optional (works without opentelemetry installed)
Existing APIs unchanged

Files Changed

Core Implementation:

vllm/v1/core/sched/scheduler.py (+467 lines)
vllm/entrypoints/openai/chat_completion/serving.py (major refactor)
vllm/entrypoints/openai/engine/serving.py (+265 lines)
vllm/tracing.py (+49 lines)
vllm/config/observability.py (+6 lines)

Tests:

9 new test files (5,000+ lines)
Test cleanup (removed 2 obsolete buffering tests)

Documentation:

JOURNEY_TRACING.md (623 lines) - End-user guide
JOURNEY_TRACING_PR_PLAN.md (2,226 lines) - Implementation plan
REGRESSION_AUDIT_REPORT.md (530 lines) - Audit results
CLAUDE.md (397 lines) - Repository guide

Review Focus

Regression Audit Report - Comprehensive analysis of all changes
Test Coverage - 88 new tests verify all scenarios
Documentation - Clear end-user and implementation docs
Backward Compatibility - All existing functionality preserved

Address review feedback on journey tracing documentation: - Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9) - Correct test counts: 88 new tests (was inconsistently stated as 27+/45+) - Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes) - Fix PR #6 streaming snippet to show finalize before yield [DONE] - Label overhead numbers as ballpark estimates - Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds) - Explain trace context propagation (HTTP headers vs internal dict) - Document error flow edge cases (truncated core events on early abort) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove two failing tests that reference the legacy journey event buffering system removed in PR #9 (commit 1d9b9f3): - test_no_events_when_span_none: Referenced _journey_events_buffer_by_client - test_legacy_buffering_still_works: Tested parallel buffering (no longer exists) These tests validated the legacy buffering pathway that was intentionally removed. Comprehensive coverage of the new span-based tracing exists in tests/v1/core/test_pr9_no_buffering.py (16 tests, 337 lines). Add REGRESSION_AUDIT_REPORT.md documenting comprehensive regression analysis from v0.0.1 to HEAD: - 42 files changed analyzed (10,824 insertions, 1,074 deletions) - All production code paths verified safe - Zero regressions to existing functionality - Proper backward compatibility maintained - OTEL imports optional and safe - Metrics work independently of tracing Test Results: 99 passed (all non-journey scheduler tests) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

sriumcp and others added 2 commits January 27, 2026 21:27

sriumcp merged commit 519c0a7 into main Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Journey Tracing: Complete Implementation (PRs #0-#9) + Regression Audit#18

Journey Tracing: Complete Implementation (PRs #0-#9) + Regression Audit#18
sriumcp merged 2 commits intomainfrom
sanitycheck

sriumcp commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sriumcp commented Jan 28, 2026

Summary

What's Included

Implementation (PRs #0-#9)

Documentation & Quality

Architecture

Testing

Regression Audit Results

Key Features

Usage

Breaking Changes

Files Changed

Review Focus

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant