Skip to content

[Model Runner V2] Fix test_spec_decode_acceptance_length Acceptance length regression#43685

Closed
yewentao256 wants to merge 1 commit into
mainfrom
wentao-fix-v2-test_spec_decode_acceptance_length
Closed

[Model Runner V2] Fix test_spec_decode_acceptance_length Acceptance length regression#43685
yewentao256 wants to merge 1 commit into
mainfrom
wentao-fix-v2-test_spec_decode_acceptance_length

Conversation

@yewentao256

@yewentao256 yewentao256 commented May 26, 2026

Copy link
Copy Markdown
Member

Purpose

Part of #41286

Fixes https://buildkite.com/vllm/ci/builds/67897#019e5781-68a4-4544-a7c6-04ba535ea293

VLLM_USE_V2_MODEL_RUNNER=1 bash tests/v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh

Originally

        # ── Extract metrics from decode server ────────────────────────────
        n_drafts = _fetch_metric("vllm:spec_decode_num_drafts_total")
        n_accepted = _fetch_metric("vllm:spec_decode_num_accepted_tokens_total")
    
        assert n_drafts > 0, "No spec-decode drafts were generated"
    
        acceptance_length = 1 + (n_accepted / n_drafts)
        expected = config.expected_acceptance_length
    
        print(
            f"\n{config.id}: acceptance_length={acceptance_length:.3f} "
            f"(expected={expected:.3f})"
        )
        print(f"  Drafts: {n_drafts:.0f}, Accepted: {n_accepted:.0f}")
    
        # ── Assert acceptance length (all methods) ────────────────────────
        rel_error = abs(acceptance_length - expected) / expected
>       assert rel_error <= rtol, (
            f"Acceptance length regression for {config.id}! "
            f"Expected: {expected:.3f}, "
            f"Got: {acceptance_length:.3f}, "
            f"Relative error: {rel_error:.2%} (tolerance: {rtol:.0%})"
        )
E       AssertionError: Acceptance length regression for llama3-8b-eagle3! Expected: 2.600, Got: 1.961, Relative error: 24.58% (tolerance: 5%)
E       assert 0.24578209894195022 <= 0.05

tests/v1/kv_connector/nixl_integration/test_spec_decode_acceptance.py:183: AssertionError
====================================== warnings summary =======================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../.venv/lib/python3.12/site-packages/torch/jit/_script.py:365: 14 warnings
  /home/yewentao256/.venv/lib/python3.12/site-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================== short test summary info ===================================
FAILED tests/v1/kv_connector/nixl_integration/test_spec_decode_acceptance.py::test_spec_decode_acceptance_length - AssertionError: Acceptance length regression for llama3-8b-eagle3! Expected: 2.600, Got: 1...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================= 1 failed, 16 warnings in 124.15s (0:02:04) ==========================

Now

================== 1 passed, 16 warnings in 97.33s (0:01:37) ==================

This PR fixes the issue (the drafter doesn't get the kv correctly)

For eagle3 + kv connector, we did it in an order like pre_forward -> target forward -> post_forward -> speculator.propose This makes the drafter can't get the correct kv.

We now do it in order as pre_forward -> target forward -> speculator.propose -> post_forward

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label May 26, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request defers the KV connector finalization in the GPU model runner until the drafter has written its KV cache. This is achieved by storing the scheduler output in _deferred_kv_connector_scheduler_output during execute_model under specific conditions (last PP rank, speculator present, and not a dummy run) and executing the post-forward step during sample_tokens. There are no review comments to address, and I have no additional feedback to provide.

@njhill

njhill commented May 27, 2026

Copy link
Copy Markdown
Member

Thanks @yewentao256! I included this in a slightly bigger refactor in #43719, included you as co-author.

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close this PR as #43719 landed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants