Skip to content

[Feature] Add step-level batch summary tracing (PR #3)#22

Merged
sriumcp merged 1 commit intomainfrom
pr3ofsteptracing
Jan 28, 2026
Merged

[Feature] Add step-level batch summary tracing (PR #3)#22
sriumcp merged 1 commit intomainfrom
pr3ofsteptracing

Conversation

@sriumcp
Copy link
Copy Markdown

@sriumcp sriumcp commented Jan 28, 2026

Implements PR #3 from STEP_TRACING_PR_PLAN.md

Summary

Step-level observability with probabilistic sampling for vLLM scheduler.

Features

  • CLI flags: --step-tracing-enabled, --step-tracing-sample-rate
  • 16 span attributes: queue depths, batch composition, token counts, KV metrics
  • Long-lived scheduler_steps span with step.BATCH_SUMMARY events
  • Deterministic sampling for tests (hash-based)

Testing

  • 9 comprehensive unit tests
  • All 111/111 tests passing
  • Zero regressions

Performance

  • O(n) complexity (optimized from initial O(n²))
  • Zero overhead when disabled
  • Failure-safe (try/except wrapper)

Review Fixes Applied

  • Fixed O(n²) → O(n) complexity with dict-based lookup
  • Moved emission before _update_after_schedule() for spec compliance
  • Removed dead code, explicit endpoint handling
  • Gated timestamp capture, documented test assumptions

Files Changed

  • 7 files changed, 1,462 insertions(+)
  • 2 new files: STEP_TRACING_PR_PLAN.md, test_step_tracing.py
  • 5 modified: config, CLI, constants, scheduler, test utils

🤖 Generated with Claude Code

Implements step-level observability with probabilistic sampling:
- CLI flags: --step-tracing-enabled, --step-tracing-sample-rate
- Emits batch summary events per sampled scheduler step
- 16 attributes: queue depths, batch composition, token counts, KV metrics
- O(n) complexity, failure-safe, disabled by default
- 9 comprehensive tests, zero regressions (111/111 tests pass)

Fixes applied based on review:
- Fixed O(n²) → O(n) complexity with dict-based lookup
- Moved emission before _update_after_schedule() for spec compliance
- Removed dead code, explicit endpoint handling, gated timestamp capture
- Documented test assumptions for spec decode compatibility

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sriumcp sriumcp merged commit 4c1afa8 into main Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant