Skip to content

[Feature] Initialize OTEL tracer in scheduler for journey tracing (PR #1/9)#8

Merged
sriumcp merged 2 commits intomainfrom
pr1ofjourney
Jan 26, 2026
Merged

[Feature] Initialize OTEL tracer in scheduler for journey tracing (PR #1/9)#8
sriumcp merged 2 commits intomainfrom
pr1ofjourney

Conversation

@sriumcp
Copy link
Copy Markdown

@sriumcp sriumcp commented Jan 26, 2026

Overview

This is the 1st of 9 PRs in the journey tracing dual-stream architecture implementation. This PR initializes an OpenTelemetry tracer in the scheduler without creating any per-request state or spans. It establishes the foundation for the next PR, which will create and manage core spans.

Branch: pr1ofjourney
Depends on: #7 (EngineCoreEvent removal - already merged)
Next: The next PR will use this tracer to create core spans with complete lifecycle management


What This PR Does

Adds tracer initialization to Scheduler.__init__() that:

  • Creates a tracer instance when both enable_journey_tracing=True AND otlp_traces_endpoint is configured
  • Uses defensive programming with try/except and warning logs
  • Provides graceful degradation if OTEL packages are unavailable
  • Does NOT create any spans - just initialization
  • Does NOT introduce per-request state - tracer is class-level only

Changes

Production Code (19 lines)

vllm/v1/core/sched/scheduler.py:

# Defensive import at top of file
try:
    from vllm.tracing import SpanAttributes
except Exception:
    SpanAttributes = None  # type: ignore

# Tracer initialization in __init__()
self.tracer: Any | None = None
if self._enable_journey_tracing:
    endpoint = self.observability_config.otlp_traces_endpoint
    if endpoint is not None:
        try:
            from vllm.tracing import init_tracer
            self.tracer = init_tracer("vllm.scheduler", endpoint)
        except Exception as e:
            logger.warning(
                "Failed to initialize tracer for journey tracing: %s", e
            )

Test Changes (112 lines)

tests/v1/core/utils.py (2 lines):

  • Added otlp_traces_endpoint: str | None = None parameter to create_scheduler()
  • Pass to ObservabilityConfig

tests/v1/core/test_scheduler.py (110 lines):

  • Added patch import for mocking
  • Added 4 comprehensive tests (all use mocking, no external dependencies):
    1. test_tracer_init_when_endpoint_set() - Positive path
    2. test_tracer_none_when_endpoint_not_set() - Negative paths (3 cases)
    3. test_scheduler_init_succeeds_with_tracing_enabled() - Smoke test
    4. test_tracer_init_handles_failure_gracefully() - Error handling

Safety Guarantees

✅ No Per-Request State

  • Tracer is stored as self.tracer (class-level instance variable)
  • Shared across all requests
  • No per-request cleanup needed

✅ Zero Overhead When Disabled

  • Guarded by TWO conditions:
    • enable_journey_tracing=False → tracer stays None
    • otlp_traces_endpoint is None → tracer stays None
  • No OTEL imports on hot paths
  • No tracer creation unless explicitly enabled

✅ No Spans Created

  • This PR only initializes the tracer
  • Spans will be created in the next PR (with cleanup in same PR)
  • Zero tracing activity in this PR

✅ Graceful Degradation

  • SpanAttributes import wrapped in try/except (None fallback)
  • init_tracer() wrapped in try/except (warning log on failure)
  • Scheduler initialization succeeds even if OTEL fails
  • Warning log helps with debugging

✅ Backward Compatible

  • All new parameters have defaults
  • otlp_traces_endpoint defaults to None
  • Existing code works unchanged
  • No behavior changes unless explicitly enabled

✅ Legacy Tracing Untouched

  • RequestJourneyEvent buffering still works
  • OutputProcessor.do_tracing() still functional
  • No changes to existing journey tracing code

Test Results

All 85 tests passing (81 existing + 4 new):

pytest tests/v1/core/test_scheduler.py -v
# 85 passed, 16 warnings in 24.53s

Test Coverage:

  • ✅ Positive path: Tracer initialized when configured
  • ✅ Negative paths: Tracer is None when disabled/unconfigured
  • ✅ Smoke test: Scheduler initializes successfully
  • ✅ Error handling: Graceful failure with warning log
  • ✅ No regressions: All existing tests pass

Test Quality:

  • All tests use mocking (no external OTEL collector needed)
  • Deterministic and isolated
  • Fast execution
  • No flakiness

Code Review Notes

Issue identified during review: Test 3 initially called real init_tracer()
Fix applied: Added @patch decorator for deterministic testing
Result: All 4 tests now properly mocked and consistent


Resource Safety Checklist

  • ✅ No spans created → no spans to close
  • ✅ No per-request state → no per-request cleanup
  • ✅ No buffering when tracer absent → graceful degradation
  • ✅ Legacy tracing untouched → no regression risk
  • ✅ Tests prove no leaks → all tests verify tracer is None when expected
  • ✅ Defensive error handling → tracing failures don't break scheduler
  • ✅ Zero overhead when disabled → early returns, no allocations

Architecture Context

This PR is part of the dual-stream journey tracing architecture (9 PRs total):

  • Core Layer (PRs 1-4): Initialize tracer, create child spans, emit events
  • API Layer (PRs 5-8): Add metadata, create parent spans, emit events
  • Linkage (PR 7): W3C Trace Context propagation
  • Cleanup (PR 9): Remove legacy buffering

This PR establishes the foundation for core layer tracing by initializing the tracer that will be used in the next PR.

See JOURNEY_TRACING_PR_PLAN.md for the complete implementation roadmap.


Next Steps

The next PR (2nd of 9) will:

  • Use self.tracer to create core spans in add_request()
  • Add _core_spans: dict[str, Span] to track active spans
  • Implement _end_core_span_and_cleanup() for all termination paths
  • Create spans AND cleanup in the same PR (Iron Rule compliance)

Related Documentation

  • Implementation Plan: JOURNEY_TRACING_PR_PLAN.md (updated with completion status)
  • User Guide: JOURNEY_TRACING.md (no changes needed - internal only)

Reviewer Checklist

When reviewing, please verify:

  • No per-request state introduced
  • No spans created (initialization only)
  • Zero overhead when disabled
  • Graceful error handling
  • All 4 tests pass and use mocking
  • No regressions in existing tests
  • Backward compatible (optional parameters)

Size: 4 files changed, 236 insertions(+), 32 deletions(-)
Review Time: ~10 minutes
Safe to merge: Yes - no per-request state, no spans, complete test coverage

sriumcp and others added 2 commits January 26, 2026 09:14
Update plan document to account for completed work:
- Document PR #0 (EngineCoreEvent removal) as completed prerequisite
- Clarify that do_tracing() is current OTEL mechanism (not legacy)
- Update PR #9 to keep RequestJourneyEvent dataclass (needed for Prometheus)
- Fix terminology: 'legacy' = EngineCoreEvent (removed), 'current' = RequestJourneyEvent
- Add PR #0 to dependencies, timeline, and progress tracking sections

Key corrections:
- do_tracing() will NOT be removed (it's the current system)
- RequestJourneyEvent dataclass will NOT be removed (needed for metrics)
- Only buffering LOGIC will be removed in PR #9

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add tracer initialization in Scheduler.__init__() to support dual-stream
journey tracing architecture. This is the foundation for PR #2 which will
create and manage core spans.

Changes:
- Add defensive SpanAttributes import with None fallback
- Initialize tracer when enable_journey_tracing=True and endpoint configured
- Add try/except with warning log for graceful degradation
- Add otlp_traces_endpoint parameter to test utilities
- Add 4 comprehensive tests with proper mocking

Safety guarantees:
- Zero per-request state (tracer is class-level only)
- Zero overhead when disabled (boolean + endpoint guard)
- No spans created (initialization only)
- No cleanup needed (shared tracer instance)
- Backward compatible (all parameters optional)

Test results: All 85 tests passing (81 existing + 4 new)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sriumcp sriumcp merged commit 24f2636 into main Jan 26, 2026
sriumcp added a commit that referenced this pull request Jan 27, 2026
Updates to reflect PR #7 completion:
- PR sequence table: Mark #7 as COMPLETED with 12 tests
- Dependency chain: Mark #6 and #7 as COMPLETED
- PR #7 section: Add completion status with commit hashes
- Document deliverables: inject_trace_context(), tests, guarantees

Remaining: PRs #8 (API events), #9 (remove buffering)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 27, 2026
…/9) (#15)

* [Feature] Add API↔Engine context propagation for journey tracing (PR #7/9)

This PR implements W3C Trace Context propagation from API spans to core spans,
enabling parent-child linkage in distributed traces. Completes the handshake
between PR #6 (API span lifecycle) and PR #2 (core span lifecycle).

Changes:
- Add inject_trace_context() helper to vllm/tracing.py
- Inject API span context into trace_headers after span creation
- Context flows to engine.generate() and scheduler for parent-child linkage
- Defensive error handling: injection failures never break requests
- Zero overhead when tracing disabled (early return)

Behavioral guarantees verified by tests:
- G1: Trace ID continuity (API and core spans share same trace_id)
- G2: W3C Trace Context format (traceparent header valid)
- G3: Trace continuation (trace_id preserved through Client→API→Core)
- G4: Graceful degradation (request continues on injection failure)
- G5: No exception propagation (injection failures caught)
- G6: Conditional injection (only when API span exists)

Invariants:
- I1: Backward compatibility (early return when tracing disabled)
- I2: Zero overhead when disabled (no propagator/allocation access)
- I3: No resource leaks (only modifies existing trace_headers dict)

Test coverage:
- 12 new tests (100% pass) covering all unit-testable properties
- 17 existing API span lifecycle tests pass (no regressions)
- Tests focus on behavioral properties, not implementation details

Safety properties:
- Zero new resources (only modifies existing dict)
- No cleanup obligations (dict managed by request lifecycle)
- Stateless transformation (span context → headers)
- Single injection point (strict ordering preserved)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Polish] Improve inject_trace_context docstring and strengthen test

Two quality improvements following code review:

1. Clarify inject_trace_context() docstring:
   - Previous: "or None if injection failed" (misleading)
   - Now: Explicitly documents when carrier is returned unchanged
   - Details all three early-return paths (OTEL unavailable, span None, exception)

2. Strengthen test_trace_id_preserved_through_chain():
   - Mock propagator now actually reads span.get_span_context()
   - Extracts trace_id and span_id from span context
   - Generates traceparent using those values (simulates real OTEL behavior)
   - Asserts get_span_context() was called
   - Better proves G1/G3 guarantees without requiring real OTLP exporter

Test results: All 29 tests pass (12 context propagation + 17 lifecycle)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Docs] Mark PR #7 as completed in journey tracing plan

Updates to reflect PR #7 completion:
- PR sequence table: Mark #7 as COMPLETED with 12 tests
- Dependency chain: Mark #6 and #7 as COMPLETED
- PR #7 section: Add completion status with commit hashes
- Document deliverables: inject_trace_context(), tests, guarantees

Remaining: PRs #8 (API events), #9 (remove buffering)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* removing PR7_summary

Signed-off-by: Srinivasan Parthasarathy <spartha@us.ibm.com>

---------

Signed-off-by: Srinivasan Parthasarathy <spartha@us.ibm.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 27, 2026
Implements journey tracing PR #8:
- Add EVENT_TS_MONOTONIC attribute for API event timestamps
- Emit HANDOFF_TO_CORE event after engine.generate()
- Emit FIRST_RESPONSE_FROM_CORE event on first response (streaming and non-streaming)
- Set request attributes on API spans (model, prompt tokens, sampling params)
- Add _update_first_response_time() helper to track first response timing
- All span operations wrapped defensively (G7 compliance)
- Zero overhead when span not recording (G6 compliance)
- 12 behavioral tests covering G1, G3-G7 (G2 verified by code inspection)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 27, 2026
sriumcp added a commit that referenced this pull request Jan 27, 2026
* [Feature] Add API lifecycle events and request attributes (PR #8)

Implements journey tracing PR #8:
- Add EVENT_TS_MONOTONIC attribute for API event timestamps
- Emit HANDOFF_TO_CORE event after engine.generate()
- Emit FIRST_RESPONSE_FROM_CORE event on first response (streaming and non-streaming)
- Set request attributes on API spans (model, prompt tokens, sampling params)
- Add _update_first_response_time() helper to track first response timing
- All span operations wrapped defensively (G7 compliance)
- Zero overhead when span not recording (G6 compliance)
- 12 behavioral tests covering G1, G3-G7 (G2 verified by code inspection)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Update master plan: Mark PR #8 as completed

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant