Skip to content

Journey Tracing: Complete Implementation (PRs #0-#9) + Regression Audit#18

Merged
sriumcp merged 2 commits intomainfrom
sanitycheck
Jan 28, 2026
Merged

Journey Tracing: Complete Implementation (PRs #0-#9) + Regression Audit#18
sriumcp merged 2 commits intomainfrom
sanitycheck

Conversation

@sriumcp
Copy link
Copy Markdown

@sriumcp sriumcp commented Jan 28, 2026

Summary

This PR completes the journey tracing dual-stream architecture implementation across 10 PRs (PR #0 prerequisite + PRs #1-#9), including comprehensive regression audit and documentation.

What's Included

Implementation (PRs #0-#9)

Documentation & Quality

  • ✅ Comprehensive end-user documentation (JOURNEY_TRACING.md)
  • ✅ Detailed implementation plan (JOURNEY_TRACING_PR_PLAN.md)
  • ✅ Repository guide for contributors (CLAUDE.md)
  • ✅ Regression audit report (REGRESSION_AUDIT_REPORT.md)

Architecture

Dual-Stream Design:

  • API Layer: Parent spans (llm_request) track end-to-end request lifecycle
  • Engine Core: Child spans (llm_core) track scheduler-level processing
  • Real-Time Emission: Events emitted directly to OTEL spans (no buffering)
  • Parent-Child Linkage: W3C Trace Context propagation

Testing

Comprehensive Test Coverage:

  • 88 new journey tracing tests (5,000+ lines)
  • All existing tests passing (99/99 scheduler tests)
  • Integration tests for API→Engine flow
  • Backwards compatibility tests
  • Zero overhead when disabled tests

Regression Audit Results

Verdict: ✅ NO PRODUCTION REGRESSIONS FOUND

Verified Safe:

  • ✅ Backward compatible (all existing APIs preserved)
  • ✅ Zero overhead when disabled (proper early-return guards)
  • ✅ OTEL imports optional (graceful degradation without opentelemetry)
  • ✅ Metrics independent (Prometheus works without tracing)
  • ✅ Exception handling improved (better cleanup guarantees)
  • ✅ No memory leaks (centralized cleanup with try/finally)

Changes Analyzed:

  • 42 files changed (+10,824 insertions, -1,074 deletions)
  • All production code paths verified
  • OpenAI API correctness maintained
  • Scheduler behavior unchanged
  • Metrics/stats timestamp capture improved

Key Features

  1. Optional Feature - Disabled by default (--enable-journey-tracing)
  2. OTEL Native - Uses OpenTelemetry for industry-standard observability
  3. Real-Time - Events emitted immediately (no buffering/deferred export)
  4. Progress Tracking - Accurate token counts that survive preemption
  5. Dual Stream - API and engine-core events in linked parent-child spans

Usage

# Enable journey tracing with OTEL endpoint
vllm serve MODEL --enable-journey-tracing --otlp-traces-endpoint http://localhost:4317

# View traces in Jaeger/Tempo/other OTEL backends

Breaking Changes

None - Fully backward compatible:

  • New CLI flag is optional (default: false)
  • No new required dependencies
  • OTEL optional (works without opentelemetry installed)
  • Existing APIs unchanged

Files Changed

Core Implementation:

  • vllm/v1/core/sched/scheduler.py (+467 lines)
  • vllm/entrypoints/openai/chat_completion/serving.py (major refactor)
  • vllm/entrypoints/openai/engine/serving.py (+265 lines)
  • vllm/tracing.py (+49 lines)
  • vllm/config/observability.py (+6 lines)

Tests:

  • 9 new test files (5,000+ lines)
  • Test cleanup (removed 2 obsolete buffering tests)

Documentation:

  • JOURNEY_TRACING.md (623 lines) - End-user guide
  • JOURNEY_TRACING_PR_PLAN.md (2,226 lines) - Implementation plan
  • REGRESSION_AUDIT_REPORT.md (530 lines) - Audit results
  • CLAUDE.md (397 lines) - Repository guide

Review Focus

  1. Regression Audit Report - Comprehensive analysis of all changes
  2. Test Coverage - 88 new tests verify all scenarios
  3. Documentation - Clear end-user and implementation docs
  4. Backward Compatibility - All existing functionality preserved

Related


🤖 Generated with Claude Code

sriumcp and others added 2 commits January 27, 2026 21:27
Address review feedback on journey tracing documentation:

- Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9)
- Correct test counts: 88 new tests (was inconsistently stated as 27+/45+)
- Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes)
- Fix PR #6 streaming snippet to show finalize before yield [DONE]
- Label overhead numbers as ballpark estimates
- Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds)
- Explain trace context propagation (HTTP headers vs internal dict)
- Document error flow edge cases (truncated core events on early abort)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove two failing tests that reference the legacy journey event buffering
system removed in PR #9 (commit 1d9b9f3):

- test_no_events_when_span_none: Referenced _journey_events_buffer_by_client
- test_legacy_buffering_still_works: Tested parallel buffering (no longer exists)

These tests validated the legacy buffering pathway that was intentionally
removed. Comprehensive coverage of the new span-based tracing exists in
tests/v1/core/test_pr9_no_buffering.py (16 tests, 337 lines).

Add REGRESSION_AUDIT_REPORT.md documenting comprehensive regression analysis
from v0.0.1 to HEAD:
- 42 files changed analyzed (10,824 insertions, 1,074 deletions)
- All production code paths verified safe
- Zero regressions to existing functionality
- Proper backward compatibility maintained
- OTEL imports optional and safe
- Metrics work independently of tracing

Test Results: 99 passed (all non-journey scheduler tests)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sriumcp sriumcp merged commit 519c0a7 into main Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant