[Testing] Replace obsolete journey tracing integration tests with focused post-PR#9 contracts by sriumcp · Pull Request #30 · inference-sim/vllm

sriumcp · 2026-01-30T01:05:31Z

Summary

Replaces 12 obsolete integration tests (1260 lines) that tested the removed buffering architecture with 3 focused integration tests that verify stable post-PR#9 contracts:

Timestamp Propagation - First-write-wins semantics for metrics timestamps
Batch Processing Isolation - Per-request timestamp independence
Request Finish Completeness - Complete metrics + finish_reason propagation

Changes

Deleted

tests/v1/engine/test_journey_tracing_integration.py (12 tests, 1260 lines)
- All tests expected RequestState.journey_events list removed in PR [CI] Add Docker build and push workflow #9
- Tested internal buffering API that no longer exists

Updated

tests/v1/engine/test_output_processor.py
- Updated: test_journey_events_populate_prometheus_timestamps
  - Now uses EngineCoreOutput.queued_ts/scheduled_ts fields (stable public API)
  - Enhanced docstring documenting first-write-wins contract
- Added: test_timestamp_propagation_batch_processing (193 lines)
  - Verifies batch isolation - no timestamp cross-contamination
  - Tests concurrent request handling
- Added: test_request_finish_metrics_completeness (147 lines)
  - Critical: Validates finish_reason propagation (FinishReason.LENGTH → "length")
  - Verifies complete metrics on finish (all timestamps populated)
  - Asserts timing invariants (queued ≤ scheduled, first ≤ last)
  - Documents clock domains (wall-clock vs monotonic)

Test Results

All 34 post-v0.0.1 tests passing (100%):

Core PR [CI] Add Docker build and push workflow #9 tests: 16/16 ✅
Step tracing tests: 15/15 ✅
Our integration tests: 3/3 ✅

# Verification commands
pytest tests/v1/core/test_pr9_no_buffering.py -v
pytest tests/v1/core/test_step_tracing.py -v
pytest tests/v1/engine/test_output_processor.py::test_journey_events_populate_prometheus_timestamps \
       tests/v1/engine/test_output_processor.py::test_timestamp_propagation_batch_processing \
       tests/v1/engine/test_output_processor.py::test_request_finish_metrics_completeness -v

Key Improvements

P0: Critical Coverage

✅ Added finish_reason propagation validation
✅ Verified FinishReason.LENGTH → "length" string conversion
✅ Assert CompletionOutput.finish_reason == "length"

P1: Strengthened Assertions

✅ Replaced weak > 0.0 checks with exact equality
✅ Added timing invariants (queued ≤ scheduled, first ≤ last)
✅ Verified direct assignment (no transformation)

P2: Enhanced Documentation

✅ Documented first-write-wins conditions (source > 0.0 AND dest == 0.0)
✅ Clarified batch isolation guarantees
✅ Added clock domain warnings (wall-clock vs monotonic)
✅ Explained first_token_ts definition (first prefill update)

Why These Tests Are Better

Before (12 obsolete tests):

Tested removed internal buffering API
Fragile coupling to implementation details
No coverage of finish_reason propagation
Weak assertions (> 0.0)

After (3 focused tests):

Test stable cross-component contracts
Verify public API behavior
Critical coverage added (finish_reason)
Strong assertions (exact equality + invariants)
Comprehensive documentation

Code Evidence

All assertions verified against actual code:

✅ FinishReason.__str__ returns "length" (vllm/v1/engine/init.py:22-30)
✅ first_token_ts = engine_core_timestamp (vllm/v1/metrics/stats.py:289)
✅ last_token_ts = engine_core_timestamp (vllm/v1/metrics/stats.py:294)
✅ Copy condition: source > 0.0 and dest == 0.0 (vllm/v1/engine/output_processor.py:535,537)

🤖 Generated with Claude Code

…ests This commit fixes failing integration tests by removing tests for obsolete buffering architecture and adding 2 NEW focused integration tests that verify the post-PR#9 cross-component contracts. CHANGES: 1. Deleted tests/v1/engine/test_journey_tracing_integration.py (12 tests) - All tests were testing removed buffering architecture (RequestState.journey_events) 2. Updated test_journey_events_populate_prometheus_timestamps - Changed to use EngineCoreOutput timestamp fields (new architecture) - Removed deprecated journey_events parameter usage 3. Added 2 NEW integration tests (focused, high-signal): a) test_timestamp_propagation_batch_processing - Verifies timestamps propagate correctly for multiple concurrent requests - Tests cross-component wiring: EngineCoreOutput → OutputProcessor → stats - Why integration: Tests batch coordination and per-request isolation b) test_request_finish_metrics_completeness - Verifies RequestOutput.metrics is complete when request finishes - Tests finish flow: EngineCore → OutputProcessor → API - Why integration: Tests finish trigger mechanism across components ROOT CAUSE: PR #9 removed journey event buffering architecture. Journey events now emit directly to OTEL spans in the scheduler. Metrics timestamps flow through EngineCoreOutput fields, not journey_events parameter. NEW INTEGRATION CONTRACTS (Post-PR#9): 1. Timestamp Propagation: EngineCoreOutput → OutputProcessor → RequestState.stats - Works for batch processing (multiple concurrent requests) - Maintains per-request isolation - Independent of journey tracing 2. Metrics Completeness: RequestOutput.metrics fully populated on finish - Timing metrics (arrival, queued, scheduled, first/last token) - Token count metrics - Finish reason propagated correctly 3. Deprecated API Inertness: journey_events parameter ignored (backward compat) - Already tested by updated test_journey_events_populate_prometheus_timestamps WHY THESE TESTS (Not 1:1 Porting): - Deleted tests verified buffering (intentionally removed) - New tests verify cross-component wiring (post-PR#9 architecture) - Focused on stable public contracts, not internal implementation - Complement existing coverage: * test_pr9_no_buffering.py: Scheduler-level behavior (16 tests) * New integration tests: Cross-component wiring (2 tests) MAINTAINER NARRATIVE: Why integration tests instead of unit tests? - Timestamp batch processing: Unit tests verify single request, integration verifies batch coordination and per-request isolation - Finish metrics: Unit tests verify individual calculations, integration verifies finish trigger mechanism across EngineCore→OutputProcessor→API - These test WIRING, not just individual component behavior Test Results: - test_timestamp_propagation_batch_processing: PASSED - test_request_finish_metrics_completeness: PASSED - test_pr9_no_buffering.py (16 tests): PASSED - test_journey_events_populate_prometheus_timestamps: PASSED Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Improvements based on rigorous maintainer review and code verification: P0 - CRITICAL (Missing Coverage): - Add finish_reason propagation validation in test_request_finish_metrics_completeness * Verify FinishReason.LENGTH → "length" string conversion (via __str__) * Assert finish_reason appears in CompletionOutput.finish_reason field * Assert RequestOutput.finished reflects completion state * Verified: FinishReason.LENGTH stringifies to "length" via FINISH_REASON_STRINGS[1] P1 - HIGH (Strengthen Weak Assertions): - Replace `> 0.0` checks with exact equality for first_token_ts and last_token_ts * first_token_ts == 1100.0 (set on first update_from_output while is_prefilling=True) * last_token_ts == 1170.0 (set from engine_core_timestamp in final step) * Verified: Both are direct assignments from engine_core_timestamp (no transformation) - Add timing invariants: queued_ts <= scheduled_ts, first_token_ts <= last_token_ts * Removed scheduled_ts <= first_token_ts (not guaranteed in this test scenario) P2 - MEDIUM (Documentation): - Enhanced test_journey_events_populate_prometheus_timestamps docstring * Document first-write-wins conditions: source > 0.0 AND dest == 0.0 * Reference exact code locations (output_processor.py:535,537) * Add comment explaining copy condition sensitivity - Enhanced test_timestamp_propagation_batch_processing docstring * Define "isolation": each request's stats contain only its own timestamps * Clarify no cross-contamination between requests in batch - Add clock domain documentation to test_request_finish_metrics_completeness * arrival_time: wall-clock (time.time()) - verified from input_processor.py:477 * queued_ts/scheduled_ts/first/last_token_ts: monotonic - verified from scheduler.py * Warning: DO NOT compare across domains in production code P3 - LOW (Internal State Check): - Label request_states cleanup check as "leak-guard invariant" * Document guarantee: output_processor.py:587 unconditional pop on finish * Valid contract verification, not just implementation detail All changes verified against actual code behavior. No assumptions or speculation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

CRITICAL CLARIFICATION based on code verification: The field name "first_token_ts" is semantically misleading. It does NOT mean "timestamp when first token is generated" but rather "timestamp of first prefill update" (even if no tokens generated). Code Evidence (stats.py:288-289): ```python if is_prefilling: req_stats.first_token_ts = engine_core_timestamp ``` Condition: ONLY checks `is_prefilling` flag (no check on token emission) Order of Operations (output_processor.py:541-552): 1. _update_stats_from_output() called (sets first_token_ts) 2. THEN is_prefilling = False Test Execution: - Step 1 (new_token_ids=[], timestamp=1100.0): first_token_ts = 1100.0 ✓ - Step 2 (new_token_ids=[42], timestamp=1150.0): first_token_ts unchanged (is_prefilling=False) Assertion is CORRECT (1100.0), but semantic meaning needed clarification. Changes: - Enhanced inline comment explaining condition (stats.py:288 reference) - Note semantic confusion: "first_token_ts" ≠ "first token generated" - Clarify it means "first prefill update timestamp" - Add docstring section explaining this behavior - Reference exact test steps to avoid future confusion No behavior changes. Documentation only. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Revised wording to be neutral and professional: - "Semantic confusion" → "Note on first_token_ts definition" - "doesn't mean" → "Currently defined as" - "Despite the name" → "Currently defined as" - "Does NOT check" → "Condition checks ... only" - "NOT 1150.0" → "not step 2" Preserves all technical accuracy while avoiding judgmental language. No behavioral changes. Documentation wording only.

sriumcp and others added 4 commits January 29, 2026 19:07

sriumcp merged commit edbe616 into main Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Testing] Replace obsolete journey tracing integration tests with focused post-PR#9 contracts#30

[Testing] Replace obsolete journey tracing integration tests with focused post-PR#9 contracts#30
sriumcp merged 4 commits intomainfrom
fixtests

sriumcp commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sriumcp commented Jan 30, 2026

Summary

Changes

Deleted

Updated

Test Results

Key Improvements

P0: Critical Coverage

P1: Strengthened Assertions

P2: Enhanced Documentation

Why These Tests Are Better

Code Evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant