Skip to content

[Feature] Add monotonically increasing step counter to vLLM scheduler#1

Merged
sriumcp merged 3 commits intomainfrom
stepcounter
Jan 23, 2026
Merged

[Feature] Add monotonically increasing step counter to vLLM scheduler#1
sriumcp merged 3 commits intomainfrom
stepcounter

Conversation

@sriumcp
Copy link
Copy Markdown

@sriumcp sriumcp commented Jan 23, 2026

Summary

Adds a monotonically increasing scheduler step counter to track scheduler invocations. This counter is included in SchedulerOutput and serves as a building block for future trace streams (step stream, KV cache transfer stream) and request tracing correlation.

Motivation

The scheduler currently lacks a way to uniquely identify and correlate scheduling iterations across the system. This counter provides:

  • Trace stream infrastructure: Foundation for step-level tracing and debugging
  • Request correlation: Ability to track requests across scheduling iterations
  • KV cache tracing: Correlation of KV cache operations with specific scheduler steps
  • Performance analysis: Temporal markers for profiling and optimization

Implementation

Core Changes

  1. Scheduler class (vllm/v1/core/sched/scheduler.py)

    • Added scheduler_step_counter: int = 0 instance variable
    • Increments at the start of every schedule() call
    • First call produces step=1, subsequent calls increment monotonically
  2. SchedulerOutput dataclass (vllm/v1/core/sched/output.py)

    • Added scheduler_step: int = 0 field with default value
    • Placed at end of dataclass to avoid field ordering issues
    • Default value ensures backward compatibility
  3. Unit test (tests/v1/core/test_scheduler.py)

    • Added test_scheduler_step_counter() verifying:
      • First schedule() produces step=1
      • Subsequent calls increment (2, 3, 4...)
      • Empty schedules still increment counter
      • Counter continues after reset_prefix_cache()

Design Decisions

  • Truly monotonic: Never resets throughout scheduler lifetime
  • Always increments: Even on empty schedules and early returns
  • First step = 1: Initialized to 0, incremented at start
  • Clear naming: scheduler_step (not step) to distinguish from decode steps
  • Backward compatible: Default value prevents breaking existing code
  • AsyncScheduler compatible: Inherits correctly via super().schedule()

Testing

All tests pass with no regressions:

  • ✅ New test: test_scheduler_step_counter - PASSED
  • ✅ All scheduler tests: 81/81 PASSED
  • ✅ Async scheduler tests: 8/8 PASSED
  • ✅ Prefix caching tests: 46/46 PASSED
  • ✅ Output module tests: 2/2 PASSED
  • ✅ Attention tests: 12/12 PASSED
  • Total: 149 tests PASSED

Backward Compatibility Verified

  • SchedulerOutput.make_empty() works without modification
  • Manual construction without scheduler_step uses default value
  • Existing test code continues to work

Use Cases

This counter enables:

  • Step stream: Track all scheduler operations per step
  • KV cache transfer stream: Correlate KV operations with scheduler steps
  • Request tracing: Follow requests through scheduling iterations
  • Distributed tracing: Correlate events across workers using step numbers
  • Performance debugging: Identify scheduling bottlenecks by step

Example Usage

# In engine/worker code
output = scheduler.schedule()
print(f"Scheduler step: {output.scheduler_step}")
# Output: Scheduler step: 1, 2, 3, ... (monotonically increasing)

# Even with no requests (idle periods)
output = scheduler.schedule()  # Empty schedule
print(f"Scheduler step: {output.scheduler_step}")  # Still increments

Notes

  • Counter is per-scheduler instance (each EngineCore has independent counter)
  • Python ints don't overflow, safe for long-running services
  • Counter tracks scheduler invocations, not token generation steps
  • May advance during idle periods when engine ticks scheduler

sriumcp and others added 3 commits January 23, 2026 11:40
Adds scheduler_step counter to track scheduler invocations for trace
streams and request tracing. Counter increments with each schedule()
call, never resets, and is included in SchedulerOutput with
backward-compatible default value.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds comprehensive repository guide to help AI assistants work
effectively with the vLLM codebase. Includes structure overview,
conventions, testing patterns, and common tasks.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Renames repository guide to CLAUDE.md (consistent with README.md,
CONTRIBUTING.md) and removes it from .gitignore to ensure it's
tracked in the repository for future use.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@sriumcp sriumcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sriumcp sriumcp merged commit 2566135 into main Jan 23, 2026
sriumcp added a commit that referenced this pull request Jan 26, 2026
/9) (#8)

* [Docs] Update journey tracing plan to reflect completed PR #0

Update plan document to account for completed work:
- Document PR #0 (EngineCoreEvent removal) as completed prerequisite
- Clarify that do_tracing() is current OTEL mechanism (not legacy)
- Update PR #9 to keep RequestJourneyEvent dataclass (needed for Prometheus)
- Fix terminology: 'legacy' = EngineCoreEvent (removed), 'current' = RequestJourneyEvent
- Add PR #0 to dependencies, timeline, and progress tracking sections

Key corrections:
- do_tracing() will NOT be removed (it's the current system)
- RequestJourneyEvent dataclass will NOT be removed (needed for metrics)
- Only buffering LOGIC will be removed in PR #9

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Feature] Initialize OTEL tracer in scheduler for journey tracing

Add tracer initialization in Scheduler.__init__() to support dual-stream
journey tracing architecture. This is the foundation for PR #2 which will
create and manage core spans.

Changes:
- Add defensive SpanAttributes import with None fallback
- Initialize tracer when enable_journey_tracing=True and endpoint configured
- Add try/except with warning log for graceful degradation
- Add otlp_traces_endpoint parameter to test utilities
- Add 4 comprehensive tests with proper mocking

Safety guarantees:
- Zero per-request state (tracer is class-level only)
- Zero overhead when disabled (boolean + endpoint guard)
- No spans created (initialization only)
- No cleanup needed (shared tracer instance)
- Backward compatible (all parameters optional)

Test results: All 85 tests passing (81 existing + 4 new)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 27, 2026
Extends the centralized cleanup method to handle journey tracing state
alongside core span cleanup. Fixes memory leak on natural completion path.

Changes:
- Extend _end_core_span_and_cleanup() with decoupled cleanup logic
  - Cleanup #1: Core spans (always runs, independent of flags)
  - Cleanup #2: Journey state (only if journey tracing enabled)
- Remove duplicate inline cleanup from finish_requests()
- Add 4 tests verifying state cleanup on all termination paths

Tests:
- test_journey_state_created: Verify state initialization
- test_journey_state_cleaned_on_finish: Explicit abort cleanup
- test_journey_state_cleaned_on_completion: Natural completion cleanup
- test_no_state_leak: No accumulation over 20 iterations

All 95 tests passing (4 new + 91 existing).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 27, 2026
* [Feature] Add journey state cleanup to scheduler (PR #3/9)

Extends the centralized cleanup method to handle journey tracing state
alongside core span cleanup. Fixes memory leak on natural completion path.

Changes:
- Extend _end_core_span_and_cleanup() with decoupled cleanup logic
  - Cleanup #1: Core spans (always runs, independent of flags)
  - Cleanup #2: Journey state (only if journey tracing enabled)
- Remove duplicate inline cleanup from finish_requests()
- Add 4 tests verifying state cleanup on all termination paths

Tests:
- test_journey_state_created: Verify state initialization
- test_journey_state_cleaned_on_finish: Explicit abort cleanup
- test_journey_state_cleaned_on_completion: Natural completion cleanup
- test_no_state_leak: No accumulation over 20 iterations

All 95 tests passing (4 new + 91 existing).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Docs] Mark PR #3 as completed in journey tracing plan

Updates:
- Mark PR #3 as COMPLETED in PR sequence summary
- Update PR dependencies to show PR #3 complete
- Add PR #3 to Implementation History section with full details
- Document commit hash (f4cf790) and PR number (vllm-project#33126)
- Record test results, code review process, and key achievements

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 28, 2026
Address review feedback on journey tracing documentation:

- Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9)
- Correct test counts: 88 new tests (was inconsistently stated as 27+/45+)
- Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes)
- Fix PR #6 streaming snippet to show finalize before yield [DONE]
- Label overhead numbers as ballpark estimates
- Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds)
- Explain trace context propagation (HTTP headers vs internal dict)
- Document error flow edge cases (truncated core events on early abort)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 28, 2026
…it (#18)

* [Docs] Fix journey tracing documentation inconsistencies

Address review feedback on journey tracing documentation:

- Fix PR count: clarify 10 PRs total (PR #0 prerequisite + PRs #1-#9)
- Correct test counts: 88 new tests (was inconsistently stated as 27+/45+)
- Add event naming clarification (api.ARRIVED, journey.QUEUED prefixes)
- Fix PR #6 streaming snippet to show finalize before yield [DONE]
- Label overhead numbers as ballpark estimates
- Clarify time domain usage (monotonic vs epoch, seconds vs nanoseconds)
- Explain trace context propagation (HTTP headers vs internal dict)
- Document error flow edge cases (truncated core events on early abort)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Tests] Remove obsolete journey buffering tests and add regression audit

Remove two failing tests that reference the legacy journey event buffering
system removed in PR #9 (commit 1d9b9f3):

- test_no_events_when_span_none: Referenced _journey_events_buffer_by_client
- test_legacy_buffering_still_works: Tested parallel buffering (no longer exists)

These tests validated the legacy buffering pathway that was intentionally
removed. Comprehensive coverage of the new span-based tracing exists in
tests/v1/core/test_pr9_no_buffering.py (16 tests, 337 lines).

Add REGRESSION_AUDIT_REPORT.md documenting comprehensive regression analysis
from v0.0.1 to HEAD:
- 42 files changed analyzed (10,824 insertions, 1,074 deletions)
- All production code paths verified safe
- Zero regressions to existing functionality
- Proper backward compatibility maintained
- OTEL imports optional and safe
- Metrics work independently of tracing

Test Results: 99 passed (all non-journey scheduler tests)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 29, 2026
Refresh plan to capture completed PRs #3, #4, #5 with accurate history:

Progress tracking:
- Add Implementation Progress section with status table
- Mark PR #3, #4, #5 as complete with commit hashes
- Mark PR #1, #2 as deferred (low priority, orthogonal)
- Update dependency graph with status indicators

Historical corrections:
- PR #3: CLI args defined but wiring missing (fixed in PR #5)
- PR #5: Added CLI wiring fix for all 3 step tracing flags
- Add NOTE in PR #3 section about wiring gap
- Update PR #5 behavioral contract to document CLI fix

Technical corrections:
- Fix output tokens source: len(_output_token_ids) → num_output_tokens (property)
- Update test file references: test_scheduler.py → test_step_tracing.py
- Change test count "15/15" → "test suite passing" (future-proof)

Verification updates:
- Mark all PR #3, #4, #5 checklist items as complete
- Add CLI wiring regression test item to PR #5 checklist

Current state: PR #5 ready for merge at commit f951860
sriumcp added a commit that referenced this pull request Jan 29, 2026
…ty (PR #5) (#27)

* [Feature] Add rich request snapshot stream (PR #5)

Implements subsampled per-request detailed progress events with KV metrics:

- Add step_tracing_rich_subsample_rate config (default 0.001 = 0.1%)
- Emit step.REQUEST_SNAPSHOT events for running requests when subsampled
- Use PR #4 get_per_request_kv_metrics() for KV cache data
- Two-stage sampling: batch summary sampled AND rich subsampled
- SpanAttributes: 10 new constants for per-request metrics
- Emission after batch summary, before _update_after_schedule()

Also fixes PR #3 CLI wiring bug:
- Wire step_tracing_enabled/sample_rate through EngineArgs
- Add fields to EngineArgs dataclass
- Pass to ObservabilityConfig constructor
- Add test_step_tracing_cli_wiring() for regression prevention

Tests: 6 new tests (5 rich snapshot + 1 CLI wiring), all 15 pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Docs] Update step tracing plan with implementation progress

Refresh plan to capture completed PRs #3, #4, #5 with accurate history:

Progress tracking:
- Add Implementation Progress section with status table
- Mark PR #3, #4, #5 as complete with commit hashes
- Mark PR #1, #2 as deferred (low priority, orthogonal)
- Update dependency graph with status indicators

Historical corrections:
- PR #3: CLI args defined but wiring missing (fixed in PR #5)
- PR #5: Added CLI wiring fix for all 3 step tracing flags
- Add NOTE in PR #3 section about wiring gap
- Update PR #5 behavioral contract to document CLI fix

Technical corrections:
- Fix output tokens source: len(_output_token_ids) → num_output_tokens (property)
- Update test file references: test_scheduler.py → test_step_tracing.py
- Change test count "15/15" → "test suite passing" (future-proof)

Verification updates:
- Mark all PR #3, #4, #5 checklist items as complete
- Add CLI wiring regression test item to PR #5 checklist

Current state: PR #5 ready for merge at commit f951860

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 29, 2026
Removes all float equality comparisons (e.g., assert ts.monotonic == value)
from integration tests. Tests now only verify:
- Presence of both timestamp fields
- Type correctness (float/int)
- Exact consistency via integer round-trip validation

This ensures robustness against float precision issues as specified in
the PR #1 constraints.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 29, 2026
* [Feature] Add journey tracing probabilistic sampling

Implements PR #2: Journey Tracing API-Side Sampling in vLLM.

Changes:
- Add journey_tracing_sample_rate config (default 1.0, backward compatible)
- API layer makes probabilistic sampling decision per request
- Custom header x-vllm-journey-sampled propagates decision to engine
- Engine obeys API decision (authority model)
- End-to-end atomic: both API+engine spans exist or neither
- Independent of OTEL traceparent sampled bit
- Centralized header injection helper across all endpoints
- Robustness fix: normalize to mutable dict (handles immutable Mapping)

Tests:
- 10 new tests verify atomicity and backward compatibility
- All existing tests pass (backward compatible)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Docs] Update JOURNEY_TRACING.md for sampling feature

Update user-facing documentation to reflect PR #2 implementation.

Changes:
- Add comprehensive "Sampling for Production" section with 3 strategies
- Document new --journey-tracing-sample-rate flag (default 1.0)
- Explain vLLM native sampling vs OTEL sampling vs collector sampling
- Add comparison table for choosing the right sampling strategy
- Update configuration examples with sampling use cases
- Add Technical Details section on sampling architecture
- Add FAQ entries: vLLM vs OTEL sampling, atomicity guarantees
- Update Performance Impact section with sampling overhead details
- Update troubleshooting section with vLLM sampling solutions
- Add early mention of sampling capability in introduction

Key messages for users:
- Default behavior unchanged (sample_rate=1.0, backward compatible)
- vLLM native sampling reduces all overhead (recommended for production)
- End-to-end atomic: either both spans exist or neither (no partial traces)
- Independent from OTEL traceparent sampled bit
- Recommended rates: 10% for 1K-10K RPS, 1% for >10K RPS

* [Docs] Fix JOURNEY_TRACING.md accuracy issues and contradictions

Critical fixes:
- Fix service name vs tracer scope confusion in Jaeger navigation
  (service.name is what users select, scope.name is span attribute)
- Correct AsyncLLM span creation claims (was: "creates only core span",
  now: "creates no spans by default, core-only if manual header set")
- Eliminate contradiction: early doc claimed AsyncLLM creates spans,
  later sections correctly said no spans without manual header
- Qualify "every request creates two spans" to "when using vllm serve"
- Qualify sampling sections to explicitly state vllm serve requirement

Accuracy improvements:
- Soften overhead numbers: "~200-300ns" → "sub-microsecond" (less brittle)
- Qualify authority model as "OpenAI API Server" (not generic "API layer")
- Add comprehensive AsyncLLM FAQ with working code examples
- Add deployment modes section distinguishing vllm serve vs AsyncLLM

Impact: Prevents user confusion about AsyncLLM behavior (expecting
automatic tracing → getting zero traces → filing bugs). Documentation
now accurately reflects codebase reality verified in scheduler.py and
test_journey_tracing_sampling.py.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Bugfix] Add missing API span finalization for non-streaming completions

Non-streaming completion requests (/v1/completions with stream=false) were
missing all _finalize_api_span() calls, causing llm_request spans to never
export to OTLP collectors. This resulted in incomplete traces with only
llm_core (engine layer) spans visible, while llm_request (API layer) spans
remained orphaned in memory.

Root cause: The non-streaming code path (lines 319-368) had no finalization
on success, error paths, or fake stream generator (beam search with stream=true).

Added comprehensive span finalization matching the pattern used in streaming
completions and chat completions:
- Error paths: Finalize with ABORTED for CancelledError, GenerationError, ValueError
- Fake stream generator: Added try-finally with DEPARTED before [DONE]
- Success path: Finalize with DEPARTED before returning response
- Outer finally block: Unconditional cleanup for any uncaught exceptions

Impact:
- Fixes: Non-streaming /v1/completions now exports complete API-layer traces
- Preserves: Streaming completions continue to work (no changes to that path)
- Matches: Behavior now consistent with /v1/chat/completions endpoint

Testing:
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-0.5B", "prompt": "Test", "max_tokens": 20}'

Expected result: Both llm_request (scope: vllm.api) and llm_core
(scope: vllm.scheduler) spans now appear in OTLP traces with proper
parent-child relationship.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Feature] Add nanosecond-precision timestamps to journey events

Adds ts_monotonic_ns field to RequestJourneyEvent for improved timestamp
precision. Uses single clock read with exact consistency (derive float from
int) to ensure both ts_monotonic and ts_monotonic_ns represent identical
instant. Fully backward compatible with default value of 0 for legacy code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Misc] Remove completed STEP_TRACING_PR_PLAN.md

Step tracing work is complete. Removing planning document.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Test] Remove float equality assertions from journey timestamp tests

Removes all float equality comparisons (e.g., assert ts.monotonic == value)
from integration tests. Tests now only verify:
- Presence of both timestamp fields
- Type correctness (float/int)
- Exact consistency via integer round-trip validation

This ensures robustness against float precision issues as specified in
the PR #1 constraints.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant