Skip to content

[Testing] Replace obsolete journey tracing integration tests with focused post-PR#9 contracts#30

Merged
sriumcp merged 4 commits intomainfrom
fixtests
Jan 30, 2026
Merged

[Testing] Replace obsolete journey tracing integration tests with focused post-PR#9 contracts#30
sriumcp merged 4 commits intomainfrom
fixtests

Conversation

@sriumcp
Copy link
Copy Markdown

@sriumcp sriumcp commented Jan 30, 2026

Summary

Replaces 12 obsolete integration tests (1260 lines) that tested the removed buffering architecture with 3 focused integration tests that verify stable post-PR#9 contracts:

  1. Timestamp Propagation - First-write-wins semantics for metrics timestamps
  2. Batch Processing Isolation - Per-request timestamp independence
  3. Request Finish Completeness - Complete metrics + finish_reason propagation

Changes

Deleted

  • tests/v1/engine/test_journey_tracing_integration.py (12 tests, 1260 lines)

Updated

  • tests/v1/engine/test_output_processor.py
    • Updated: test_journey_events_populate_prometheus_timestamps
      • Now uses EngineCoreOutput.queued_ts/scheduled_ts fields (stable public API)
      • Enhanced docstring documenting first-write-wins contract
    • Added: test_timestamp_propagation_batch_processing (193 lines)
      • Verifies batch isolation - no timestamp cross-contamination
      • Tests concurrent request handling
    • Added: test_request_finish_metrics_completeness (147 lines)
      • Critical: Validates finish_reason propagation (FinishReason.LENGTH → "length")
      • Verifies complete metrics on finish (all timestamps populated)
      • Asserts timing invariants (queued ≤ scheduled, first ≤ last)
      • Documents clock domains (wall-clock vs monotonic)

Test Results

All 34 post-v0.0.1 tests passing (100%):

# Verification commands
pytest tests/v1/core/test_pr9_no_buffering.py -v
pytest tests/v1/core/test_step_tracing.py -v
pytest tests/v1/engine/test_output_processor.py::test_journey_events_populate_prometheus_timestamps \
       tests/v1/engine/test_output_processor.py::test_timestamp_propagation_batch_processing \
       tests/v1/engine/test_output_processor.py::test_request_finish_metrics_completeness -v

Key Improvements

P0: Critical Coverage

  • ✅ Added finish_reason propagation validation
  • ✅ Verified FinishReason.LENGTH → "length" string conversion
  • ✅ Assert CompletionOutput.finish_reason == "length"

P1: Strengthened Assertions

  • ✅ Replaced weak > 0.0 checks with exact equality
  • ✅ Added timing invariants (queued ≤ scheduled, first ≤ last)
  • ✅ Verified direct assignment (no transformation)

P2: Enhanced Documentation

  • ✅ Documented first-write-wins conditions (source > 0.0 AND dest == 0.0)
  • ✅ Clarified batch isolation guarantees
  • ✅ Added clock domain warnings (wall-clock vs monotonic)
  • ✅ Explained first_token_ts definition (first prefill update)

Why These Tests Are Better

Before (12 obsolete tests):

  • Tested removed internal buffering API
  • Fragile coupling to implementation details
  • No coverage of finish_reason propagation
  • Weak assertions (> 0.0)

After (3 focused tests):

  • Test stable cross-component contracts
  • Verify public API behavior
  • Critical coverage added (finish_reason)
  • Strong assertions (exact equality + invariants)
  • Comprehensive documentation

Code Evidence

All assertions verified against actual code:

  • FinishReason.__str__ returns "length" (vllm/v1/engine/init.py:22-30)
  • first_token_ts = engine_core_timestamp (vllm/v1/metrics/stats.py:289)
  • last_token_ts = engine_core_timestamp (vllm/v1/metrics/stats.py:294)
  • ✅ Copy condition: source > 0.0 and dest == 0.0 (vllm/v1/engine/output_processor.py:535,537)

🤖 Generated with Claude Code

sriumcp and others added 4 commits January 29, 2026 19:07
…ests

This commit fixes failing integration tests by removing tests for obsolete
buffering architecture and adding 2 NEW focused integration tests that verify
the post-PR#9 cross-component contracts.

CHANGES:
1. Deleted tests/v1/engine/test_journey_tracing_integration.py (12 tests)
   - All tests were testing removed buffering architecture (RequestState.journey_events)

2. Updated test_journey_events_populate_prometheus_timestamps
   - Changed to use EngineCoreOutput timestamp fields (new architecture)
   - Removed deprecated journey_events parameter usage

3. Added 2 NEW integration tests (focused, high-signal):
   a) test_timestamp_propagation_batch_processing
      - Verifies timestamps propagate correctly for multiple concurrent requests
      - Tests cross-component wiring: EngineCoreOutput → OutputProcessor → stats
      - Why integration: Tests batch coordination and per-request isolation

   b) test_request_finish_metrics_completeness
      - Verifies RequestOutput.metrics is complete when request finishes
      - Tests finish flow: EngineCore → OutputProcessor → API
      - Why integration: Tests finish trigger mechanism across components

ROOT CAUSE:
PR #9 removed journey event buffering architecture. Journey events now emit
directly to OTEL spans in the scheduler. Metrics timestamps flow through
EngineCoreOutput fields, not journey_events parameter.

NEW INTEGRATION CONTRACTS (Post-PR#9):
1. Timestamp Propagation: EngineCoreOutput → OutputProcessor → RequestState.stats
   - Works for batch processing (multiple concurrent requests)
   - Maintains per-request isolation
   - Independent of journey tracing

2. Metrics Completeness: RequestOutput.metrics fully populated on finish
   - Timing metrics (arrival, queued, scheduled, first/last token)
   - Token count metrics
   - Finish reason propagated correctly

3. Deprecated API Inertness: journey_events parameter ignored (backward compat)
   - Already tested by updated test_journey_events_populate_prometheus_timestamps

WHY THESE TESTS (Not 1:1 Porting):
- Deleted tests verified buffering (intentionally removed)
- New tests verify cross-component wiring (post-PR#9 architecture)
- Focused on stable public contracts, not internal implementation
- Complement existing coverage:
  * test_pr9_no_buffering.py: Scheduler-level behavior (16 tests)
  * New integration tests: Cross-component wiring (2 tests)

MAINTAINER NARRATIVE:
Why integration tests instead of unit tests?
- Timestamp batch processing: Unit tests verify single request, integration
  verifies batch coordination and per-request isolation
- Finish metrics: Unit tests verify individual calculations, integration
  verifies finish trigger mechanism across EngineCore→OutputProcessor→API
- These test WIRING, not just individual component behavior

Test Results:
- test_timestamp_propagation_batch_processing: PASSED
- test_request_finish_metrics_completeness: PASSED
- test_pr9_no_buffering.py (16 tests): PASSED
- test_journey_events_populate_prometheus_timestamps: PASSED

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Improvements based on rigorous maintainer review and code verification:

P0 - CRITICAL (Missing Coverage):
- Add finish_reason propagation validation in test_request_finish_metrics_completeness
  * Verify FinishReason.LENGTH → "length" string conversion (via __str__)
  * Assert finish_reason appears in CompletionOutput.finish_reason field
  * Assert RequestOutput.finished reflects completion state
  * Verified: FinishReason.LENGTH stringifies to "length" via FINISH_REASON_STRINGS[1]

P1 - HIGH (Strengthen Weak Assertions):
- Replace `> 0.0` checks with exact equality for first_token_ts and last_token_ts
  * first_token_ts == 1100.0 (set on first update_from_output while is_prefilling=True)
  * last_token_ts == 1170.0 (set from engine_core_timestamp in final step)
  * Verified: Both are direct assignments from engine_core_timestamp (no transformation)
- Add timing invariants: queued_ts <= scheduled_ts, first_token_ts <= last_token_ts
  * Removed scheduled_ts <= first_token_ts (not guaranteed in this test scenario)

P2 - MEDIUM (Documentation):
- Enhanced test_journey_events_populate_prometheus_timestamps docstring
  * Document first-write-wins conditions: source > 0.0 AND dest == 0.0
  * Reference exact code locations (output_processor.py:535,537)
  * Add comment explaining copy condition sensitivity
- Enhanced test_timestamp_propagation_batch_processing docstring
  * Define "isolation": each request's stats contain only its own timestamps
  * Clarify no cross-contamination between requests in batch
- Add clock domain documentation to test_request_finish_metrics_completeness
  * arrival_time: wall-clock (time.time()) - verified from input_processor.py:477
  * queued_ts/scheduled_ts/first/last_token_ts: monotonic - verified from scheduler.py
  * Warning: DO NOT compare across domains in production code

P3 - LOW (Internal State Check):
- Label request_states cleanup check as "leak-guard invariant"
  * Document guarantee: output_processor.py:587 unconditional pop on finish
  * Valid contract verification, not just implementation detail

All changes verified against actual code behavior. No assumptions or speculation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CRITICAL CLARIFICATION based on code verification:

The field name "first_token_ts" is semantically misleading. It does NOT mean
"timestamp when first token is generated" but rather "timestamp of first
prefill update" (even if no tokens generated).

Code Evidence (stats.py:288-289):
```python
if is_prefilling:
    req_stats.first_token_ts = engine_core_timestamp
```

Condition: ONLY checks `is_prefilling` flag (no check on token emission)

Order of Operations (output_processor.py:541-552):
1. _update_stats_from_output() called (sets first_token_ts)
2. THEN is_prefilling = False

Test Execution:
- Step 1 (new_token_ids=[], timestamp=1100.0): first_token_ts = 1100.0 ✓
- Step 2 (new_token_ids=[42], timestamp=1150.0): first_token_ts unchanged (is_prefilling=False)

Assertion is CORRECT (1100.0), but semantic meaning needed clarification.

Changes:
- Enhanced inline comment explaining condition (stats.py:288 reference)
- Note semantic confusion: "first_token_ts" ≠ "first token generated"
- Clarify it means "first prefill update timestamp"
- Add docstring section explaining this behavior
- Reference exact test steps to avoid future confusion

No behavior changes. Documentation only.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Revised wording to be neutral and professional:
- "Semantic confusion" → "Note on first_token_ts definition"
- "doesn't mean" → "Currently defined as"
- "Despite the name" → "Currently defined as"
- "Does NOT check" → "Condition checks ... only"
- "NOT 1150.0" → "not step 2"

Preserves all technical accuracy while avoiding judgmental language.

No behavioral changes. Documentation wording only.
@sriumcp sriumcp merged commit edbe616 into main Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant