Skip to content

[Bugfix] Fix prefill progress tracking for chunked prefill preemption#5

Merged
sriumcp merged 1 commit intomainfrom
candidateissues
Jan 24, 2026
Merged

[Bugfix] Fix prefill progress tracking for chunked prefill preemption#5
sriumcp merged 1 commit intomainfrom
candidateissues

Conversation

@sriumcp
Copy link
Copy Markdown

@sriumcp sriumcp commented Jan 24, 2026

Summary

Fixes a bug where prefill progress tracking was incomplete for chunked prefill requests that get preempted after multiple scheduling iterations. This caused PREEMPTED journey events to show stale prefill progress (e.g., 256 tokens instead of 768 tokens actually scheduled).

Problem

The prefill hi-water mark (_journey_prefill_hiwater) was only updated when scheduling requests from the waiting queue, not from the running queue.

For chunked prefill (when long_prefill_token_threshold is set), requests are:

  1. Scheduled from waiting queue for first chunk (e.g., 256 tokens) → hiwater updated ✓
  2. Moved to running queue with status=RUNNING
  3. Scheduled again from running queue for next chunk (e.g., 256 more tokens) → hiwater NOT updated ✗
  4. If preempted at this point, the PREEMPTED event shows only 256 tokens progress instead of 512+

This affected:

  • Workloads with long prompts (> long_prefill_token_threshold)
  • High preemption pressure scenarios
  • OpenTelemetry trace consumers relying on accurate prefill progress

Solution

Centralized the hiwater update in _update_after_schedule() method, which runs for ALL scheduling paths:

  • Waiting queue schedules
  • Running queue schedules (chunked prefill continuation)
  • Resumed requests after preemption

This ensures hiwater accurately tracks cumulative prefill progress across all scheduling iterations.

Key changes:

  • Added hiwater update logic to _update_after_schedule() (lines 933-940 in scheduler.py)
  • Removed redundant update from waiting queue loop to avoid timing inconsistencies
  • Update happens after num_computed_tokens is incremented, ensuring correct values

Testing

Added 3 comprehensive tests covering the bug and edge cases:

  1. test_chunked_prefill_running_queue_hiwater: Verifies hiwater updates across multiple running queue scheduling iterations (chunked prefill)
  2. test_chunked_prefill_preemption_accurate_progress: Verifies PREEMPTED events show correct prefill progress after multiple chunks (core bug fix validation)
  3. test_hiwater_monotonic_across_preemption_cycles: Verifies hiwater survives preemption and never decreases

Updated 1 existing test to reflect new timing (hiwater updated after event emission).

Test results:

  • ✅ All 11 journey event tests pass
  • ✅ All 81 scheduler tests pass
  • ✅ No regressions

Impact

  • Backward compatible: No API/schema changes, only data accuracy improvements
  • Performance: Near-zero overhead when journey tracing is disabled (single boolean check)
  • Correctness: PREEMPTED events now show accurate prefill progress for chunked prefill scenarios
  • Coverage: Fixes apply to all request types (standard, prefix-cached, KV-transferred, speculative)

Example

Before fix:

Iteration 1: Schedule 256 tokens → hiwater = 256
Iteration 2: Schedule 256 tokens → hiwater = 256 (stale)
Iteration 3: Schedule 256 tokens → hiwater = 256 (stale)
Preemption → PREEMPTED event shows 256/1000 tokens (25.6% progress) ❌

After fix:

Iteration 1: Schedule 256 tokens → hiwater = 256
Iteration 2: Schedule 256 tokens → hiwater = 512
Iteration 3: Schedule 256 tokens → hiwater = 768
Preemption → PREEMPTED event shows 768/1000 tokens (76.8% progress) ✓

🤖 Generated with Claude Code

The prefill hi-water mark was only updated when scheduling from the
waiting queue, not when scheduling from the running queue. This caused
chunked prefill requests (scheduled multiple times across iterations)
to report stale progress in PREEMPTED events.

Fix: Centralize hiwater update in _update_after_schedule() to cover
all scheduling paths (waiting, running, resume). This ensures accurate
prefill progress tracking for chunked prefill scenarios.

Tests: Add comprehensive tests for chunked prefill hiwater updates,
preemption accuracy, and monotonicity invariants.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sriumcp sriumcp merged commit a8f3dfc into main Jan 24, 2026
sriumcp added a commit that referenced this pull request Jan 27, 2026
Add span tracking dict and helper methods to OpenAIServing for future
API-side journey tracing. This PR adds infrastructure only - no spans
are created or managed yet (deferred to PR #6).

Changes:
- Add _api_spans dict to track (span, arrival_time, first_response_time)
- Add _cached_is_tracing_enabled flag with defensive error handling
- Add helper methods: _get_is_tracing_enabled, _store_api_span,
  _get_api_span_info, _cleanup_api_span
- Add 8 comprehensive unit tests for dict operations and caching

All state is private and lives outside Pydantic models to avoid
serialization issues. Safe to merge independently - introduces no
lifecycle obligations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 27, 2026
Mark PR #5 as completed with implementation summary:
- Status: Completed (PR #13 merged)
- Branch: pr5ofjourney
- Size: 67 lines production code, 177 lines test code
- Tests: 8 comprehensive tests, all passing
- Implementation details: Added _api_spans dict, helper methods, and
  defensive error handling

Updated code examples to match actual implementation including
error handling in _get_is_tracing_enabled() method.
sriumcp added a commit that referenced this pull request Jan 27, 2026
Reduce Implementation History from 229 lines to 53 lines (~77% reduction):
- Condensed PR #0-#5 to essential info only (commit, PR number, line stats)
- Removed verbose implementation details (already in individual PR sections)
- Added PR #5 completion entry
- Added summary line: 5 PRs completed, ~350 production lines, ~1022 test lines, 31 tests

Total file size: 2266 → 2089 lines (177 line reduction)
sriumcp added a commit that referenced this pull request Jan 27, 2026
* [Feature] Add API span tracking infrastructure (PR #5/9)

Add span tracking dict and helper methods to OpenAIServing for future
API-side journey tracing. This PR adds infrastructure only - no spans
are created or managed yet (deferred to PR #6).

Changes:
- Add _api_spans dict to track (span, arrival_time, first_response_time)
- Add _cached_is_tracing_enabled flag with defensive error handling
- Add helper methods: _get_is_tracing_enabled, _store_api_span,
  _get_api_span_info, _cleanup_api_span
- Add 8 comprehensive unit tests for dict operations and caching

All state is private and lives outside Pydantic models to avoid
serialization issues. Safe to merge independently - introduces no
lifecycle obligations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Docs] Update journey tracing plan with PR #5 completion

Mark PR #5 as completed with implementation summary:
- Status: Completed (PR #13 merged)
- Branch: pr5ofjourney
- Size: 67 lines production code, 177 lines test code
- Tests: 8 comprehensive tests, all passing
- Implementation details: Added _api_spans dict, helper methods, and
  defensive error handling

Updated code examples to match actual implementation including
error handling in _get_is_tracing_enabled() method.

* [Docs] Condense Implementation History section

Reduce Implementation History from 229 lines to 53 lines (~77% reduction):
- Condensed PR #0-#5 to essential info only (commit, PR number, line stats)
- Removed verbose implementation details (already in individual PR sections)
- Added PR #5 completion entry
- Added summary line: 5 PRs completed, ~350 production lines, ~1022 test lines, 31 tests

Total file size: 2266 → 2089 lines (177 line reduction)

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 29, 2026
Implements subsampled per-request detailed progress events with KV metrics:

- Add step_tracing_rich_subsample_rate config (default 0.001 = 0.1%)
- Emit step.REQUEST_SNAPSHOT events for running requests when subsampled
- Use PR #4 get_per_request_kv_metrics() for KV cache data
- Two-stage sampling: batch summary sampled AND rich subsampled
- SpanAttributes: 10 new constants for per-request metrics
- Emission after batch summary, before _update_after_schedule()

Also fixes PR #3 CLI wiring bug:
- Wire step_tracing_enabled/sample_rate through EngineArgs
- Add fields to EngineArgs dataclass
- Pass to ObservabilityConfig constructor
- Add test_step_tracing_cli_wiring() for regression prevention

Tests: 6 new tests (5 rich snapshot + 1 CLI wiring), all 15 pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
sriumcp added a commit that referenced this pull request Jan 29, 2026
Refresh plan to capture completed PRs #3, #4, #5 with accurate history:

Progress tracking:
- Add Implementation Progress section with status table
- Mark PR #3, #4, #5 as complete with commit hashes
- Mark PR #1, #2 as deferred (low priority, orthogonal)
- Update dependency graph with status indicators

Historical corrections:
- PR #3: CLI args defined but wiring missing (fixed in PR #5)
- PR #5: Added CLI wiring fix for all 3 step tracing flags
- Add NOTE in PR #3 section about wiring gap
- Update PR #5 behavioral contract to document CLI fix

Technical corrections:
- Fix output tokens source: len(_output_token_ids) → num_output_tokens (property)
- Update test file references: test_scheduler.py → test_step_tracing.py
- Change test count "15/15" → "test suite passing" (future-proof)

Verification updates:
- Mark all PR #3, #4, #5 checklist items as complete
- Add CLI wiring regression test item to PR #5 checklist

Current state: PR #5 ready for merge at commit f951860
sriumcp added a commit that referenced this pull request Jan 29, 2026
…ty (PR #5) (#27)

* [Feature] Add rich request snapshot stream (PR #5)

Implements subsampled per-request detailed progress events with KV metrics:

- Add step_tracing_rich_subsample_rate config (default 0.001 = 0.1%)
- Emit step.REQUEST_SNAPSHOT events for running requests when subsampled
- Use PR #4 get_per_request_kv_metrics() for KV cache data
- Two-stage sampling: batch summary sampled AND rich subsampled
- SpanAttributes: 10 new constants for per-request metrics
- Emission after batch summary, before _update_after_schedule()

Also fixes PR #3 CLI wiring bug:
- Wire step_tracing_enabled/sample_rate through EngineArgs
- Add fields to EngineArgs dataclass
- Pass to ObservabilityConfig constructor
- Add test_step_tracing_cli_wiring() for regression prevention

Tests: 6 new tests (5 rich snapshot + 1 CLI wiring), all 15 pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* [Docs] Update step tracing plan with implementation progress

Refresh plan to capture completed PRs #3, #4, #5 with accurate history:

Progress tracking:
- Add Implementation Progress section with status table
- Mark PR #3, #4, #5 as complete with commit hashes
- Mark PR #1, #2 as deferred (low priority, orthogonal)
- Update dependency graph with status indicators

Historical corrections:
- PR #3: CLI args defined but wiring missing (fixed in PR #5)
- PR #5: Added CLI wiring fix for all 3 step tracing flags
- Add NOTE in PR #3 section about wiring gap
- Update PR #5 behavioral contract to document CLI fix

Technical corrections:
- Fix output tokens source: len(_output_token_ids) → num_output_tokens (property)
- Update test file references: test_scheduler.py → test_step_tracing.py
- Change test count "15/15" → "test suite passing" (future-proof)

Verification updates:
- Mark all PR #3, #4, #5 checklist items as complete
- Add CLI wiring regression test item to PR #5 checklist

Current state: PR #5 ready for merge at commit f951860

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant