[Testing] Replace obsolete journey tracing integration tests with focused post-PR#9 contracts#30
Conversation
…ests
This commit fixes failing integration tests by removing tests for obsolete
buffering architecture and adding 2 NEW focused integration tests that verify
the post-PR#9 cross-component contracts.
CHANGES:
1. Deleted tests/v1/engine/test_journey_tracing_integration.py (12 tests)
- All tests were testing removed buffering architecture (RequestState.journey_events)
2. Updated test_journey_events_populate_prometheus_timestamps
- Changed to use EngineCoreOutput timestamp fields (new architecture)
- Removed deprecated journey_events parameter usage
3. Added 2 NEW integration tests (focused, high-signal):
a) test_timestamp_propagation_batch_processing
- Verifies timestamps propagate correctly for multiple concurrent requests
- Tests cross-component wiring: EngineCoreOutput → OutputProcessor → stats
- Why integration: Tests batch coordination and per-request isolation
b) test_request_finish_metrics_completeness
- Verifies RequestOutput.metrics is complete when request finishes
- Tests finish flow: EngineCore → OutputProcessor → API
- Why integration: Tests finish trigger mechanism across components
ROOT CAUSE:
PR #9 removed journey event buffering architecture. Journey events now emit
directly to OTEL spans in the scheduler. Metrics timestamps flow through
EngineCoreOutput fields, not journey_events parameter.
NEW INTEGRATION CONTRACTS (Post-PR#9):
1. Timestamp Propagation: EngineCoreOutput → OutputProcessor → RequestState.stats
- Works for batch processing (multiple concurrent requests)
- Maintains per-request isolation
- Independent of journey tracing
2. Metrics Completeness: RequestOutput.metrics fully populated on finish
- Timing metrics (arrival, queued, scheduled, first/last token)
- Token count metrics
- Finish reason propagated correctly
3. Deprecated API Inertness: journey_events parameter ignored (backward compat)
- Already tested by updated test_journey_events_populate_prometheus_timestamps
WHY THESE TESTS (Not 1:1 Porting):
- Deleted tests verified buffering (intentionally removed)
- New tests verify cross-component wiring (post-PR#9 architecture)
- Focused on stable public contracts, not internal implementation
- Complement existing coverage:
* test_pr9_no_buffering.py: Scheduler-level behavior (16 tests)
* New integration tests: Cross-component wiring (2 tests)
MAINTAINER NARRATIVE:
Why integration tests instead of unit tests?
- Timestamp batch processing: Unit tests verify single request, integration
verifies batch coordination and per-request isolation
- Finish metrics: Unit tests verify individual calculations, integration
verifies finish trigger mechanism across EngineCore→OutputProcessor→API
- These test WIRING, not just individual component behavior
Test Results:
- test_timestamp_propagation_batch_processing: PASSED
- test_request_finish_metrics_completeness: PASSED
- test_pr9_no_buffering.py (16 tests): PASSED
- test_journey_events_populate_prometheus_timestamps: PASSED
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Improvements based on rigorous maintainer review and code verification: P0 - CRITICAL (Missing Coverage): - Add finish_reason propagation validation in test_request_finish_metrics_completeness * Verify FinishReason.LENGTH → "length" string conversion (via __str__) * Assert finish_reason appears in CompletionOutput.finish_reason field * Assert RequestOutput.finished reflects completion state * Verified: FinishReason.LENGTH stringifies to "length" via FINISH_REASON_STRINGS[1] P1 - HIGH (Strengthen Weak Assertions): - Replace `> 0.0` checks with exact equality for first_token_ts and last_token_ts * first_token_ts == 1100.0 (set on first update_from_output while is_prefilling=True) * last_token_ts == 1170.0 (set from engine_core_timestamp in final step) * Verified: Both are direct assignments from engine_core_timestamp (no transformation) - Add timing invariants: queued_ts <= scheduled_ts, first_token_ts <= last_token_ts * Removed scheduled_ts <= first_token_ts (not guaranteed in this test scenario) P2 - MEDIUM (Documentation): - Enhanced test_journey_events_populate_prometheus_timestamps docstring * Document first-write-wins conditions: source > 0.0 AND dest == 0.0 * Reference exact code locations (output_processor.py:535,537) * Add comment explaining copy condition sensitivity - Enhanced test_timestamp_propagation_batch_processing docstring * Define "isolation": each request's stats contain only its own timestamps * Clarify no cross-contamination between requests in batch - Add clock domain documentation to test_request_finish_metrics_completeness * arrival_time: wall-clock (time.time()) - verified from input_processor.py:477 * queued_ts/scheduled_ts/first/last_token_ts: monotonic - verified from scheduler.py * Warning: DO NOT compare across domains in production code P3 - LOW (Internal State Check): - Label request_states cleanup check as "leak-guard invariant" * Document guarantee: output_processor.py:587 unconditional pop on finish * Valid contract verification, not just implementation detail All changes verified against actual code behavior. No assumptions or speculation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CRITICAL CLARIFICATION based on code verification:
The field name "first_token_ts" is semantically misleading. It does NOT mean
"timestamp when first token is generated" but rather "timestamp of first
prefill update" (even if no tokens generated).
Code Evidence (stats.py:288-289):
```python
if is_prefilling:
req_stats.first_token_ts = engine_core_timestamp
```
Condition: ONLY checks `is_prefilling` flag (no check on token emission)
Order of Operations (output_processor.py:541-552):
1. _update_stats_from_output() called (sets first_token_ts)
2. THEN is_prefilling = False
Test Execution:
- Step 1 (new_token_ids=[], timestamp=1100.0): first_token_ts = 1100.0 ✓
- Step 2 (new_token_ids=[42], timestamp=1150.0): first_token_ts unchanged (is_prefilling=False)
Assertion is CORRECT (1100.0), but semantic meaning needed clarification.
Changes:
- Enhanced inline comment explaining condition (stats.py:288 reference)
- Note semantic confusion: "first_token_ts" ≠ "first token generated"
- Clarify it means "first prefill update timestamp"
- Add docstring section explaining this behavior
- Reference exact test steps to avoid future confusion
No behavior changes. Documentation only.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Revised wording to be neutral and professional: - "Semantic confusion" → "Note on first_token_ts definition" - "doesn't mean" → "Currently defined as" - "Despite the name" → "Currently defined as" - "Does NOT check" → "Condition checks ... only" - "NOT 1150.0" → "not step 2" Preserves all technical accuracy while avoiding judgmental language. No behavioral changes. Documentation wording only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces 12 obsolete integration tests (1260 lines) that tested the removed buffering architecture with 3 focused integration tests that verify stable post-PR#9 contracts:
Changes
Deleted
tests/v1/engine/test_journey_tracing_integration.py(12 tests, 1260 lines)RequestState.journey_eventslist removed in PR [CI] Add Docker build and push workflow #9Updated
tests/v1/engine/test_output_processor.pytest_journey_events_populate_prometheus_timestampsEngineCoreOutput.queued_ts/scheduled_tsfields (stable public API)test_timestamp_propagation_batch_processing(193 lines)test_request_finish_metrics_completeness(147 lines)finish_reasonpropagation (FinishReason.LENGTH → "length")Test Results
All 34 post-v0.0.1 tests passing (100%):
# Verification commands pytest tests/v1/core/test_pr9_no_buffering.py -v pytest tests/v1/core/test_step_tracing.py -v pytest tests/v1/engine/test_output_processor.py::test_journey_events_populate_prometheus_timestamps \ tests/v1/engine/test_output_processor.py::test_timestamp_propagation_batch_processing \ tests/v1/engine/test_output_processor.py::test_request_finish_metrics_completeness -vKey Improvements
P0: Critical Coverage
P1: Strengthened Assertions
> 0.0checks with exact equalityP2: Enhanced Documentation
Why These Tests Are Better
Before (12 obsolete tests):
> 0.0)After (3 focused tests):
Code Evidence
All assertions verified against actual code:
FinishReason.__str__returns"length"(vllm/v1/engine/init.py:22-30)first_token_ts = engine_core_timestamp(vllm/v1/metrics/stats.py:289)last_token_ts = engine_core_timestamp(vllm/v1/metrics/stats.py:294)source > 0.0 and dest == 0.0(vllm/v1/engine/output_processor.py:535,537)🤖 Generated with Claude Code