Skip to content

Avoid bytecode hook and simplify TorchCompileWrapperWithCustomDipatch#25110

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
laithsakka:something
Nov 14, 2025
Merged

Avoid bytecode hook and simplify TorchCompileWrapperWithCustomDipatch#25110
vllm-bot merged 1 commit intovllm-project:mainfrom
laithsakka:something

Conversation

@laithsakka
Copy link
Copy Markdown
Contributor

@laithsakka laithsakka commented Sep 17, 2025

Two main changes:

  1. Add an option to not use the bytecode hook! torch2.8 have a way to drop guard without the hack.
    We will switch totally to not use in a different PR after internal validation of perf post landing this.
    Since @zou3519 have concerns about the perf. which i highly doubt that there will be issues.

  2. TorchCompileWrapperWithCustomDipatch has complexities that are not used in the code base.
    the current usage is to by pass guard evaluation so I am just introducing a much simpler TorchCompileWithNoGuardsWrapper.

Next PR I will add a debug mode option to keep DS gaurds and fail if get violated.

performance:

How does this effect run-time?
I ran the following benchmarks:

vllm bench latency
--model Qwen/Qwen2-1.5B-Instruct
--input-len 128
--output-len 256
--num-iters 50
--dtype float16

after:

10% percentile latency: 0.9403901444951771 seconds
25% percentile latency: 0.9434001832487411 seconds
50% percentile latency: 0.9502700129960431 seconds
75% percentile latency: 0.9560497467537061 seconds
90% percentile latency: 0.9588484280975536 seconds
99% percentile latency: 0.9626266451006813 seconds

before:

10% percentile latency: 0.9399352639040444 seconds
25% percentile latency: 0.9432627985042927 seconds
50% percentile latency: 0.9482277854986023 seconds
75% percentile latency: 0.9578490345011232 seconds
90% percentile latency: 0.9600405636985669 seconds
99% percentile latency: 0.9649963746999856 seconds

vllm bench throughput --model Qwen/Qwen2-1.5B-Instruct --input-len 512 --output-len 128 --num-prompts 1000

after

Throughput: 130.14 requests/s, 83113.04 total tokens/s, 16658.08 output tokens/s
Total num prompt tokens:  510637
Total num output tokens:  128000

before

Throughput: 129.54 requests/s, 82726.85 total tokens/s, 16580.68 output tokens/s
Total num prompt tokens:  510637
Total num output tokens:  128000

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the torch.compile wrapper to simplify its implementation. It removes the complex bytecode hooking mechanism in favor of the guard_filter_fn option available in newer PyTorch versions to drop guards. The new TorchCompileGuardsStripWrapper is much cleaner. The changes also include updating the support_torch_compile decorator and related tests to use the new wrapper. The overall change improves code clarity and maintainability. I've found one critical issue in the test suite that needs to be addressed.

@laithsakka laithsakka marked this pull request as draft September 17, 2025 23:18
@mergify mergify bot removed the tpu Related to Google TPUs label Sep 17, 2025
@laithsakka laithsakka force-pushed the something branch 3 times, most recently from 6f72573 to bcc0f99 Compare September 18, 2025 00:05
@mergify mergify bot added the tpu Related to Google TPUs label Sep 18, 2025
@laithsakka laithsakka marked this pull request as ready for review September 18, 2025 00:21
@mergify
Copy link
Copy Markdown

mergify bot commented Sep 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @laithsakka.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 21, 2025
@ProExpertProg
Copy link
Copy Markdown
Collaborator

@bigPYJ1151 is planning to fix the CPU torch issue in vLLM so once that's done, we can upgrade to torch==2.8 everywhere and merge this PR

Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you reformat?

@mergify mergify bot added the llama Related to Llama models label Nov 11, 2025
@laithsakka laithsakka force-pushed the something branch 3 times, most recently from 2b04ba5 to 1e420a3 Compare November 12, 2025 05:24
@mergify
Copy link
Copy Markdown

mergify bot commented Nov 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @laithsakka.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added multi-modality Related to multi-modality (#4194) qwen Related to Qwen models labels Nov 12, 2025
…patcher with TorchCompileGuardsStripWrapper

Signed-off-by: Laith Sakka <lsakka@meta.com>
@laithsakka
Copy link
Copy Markdown
Contributor Author

rebase again

@laithsakka
Copy link
Copy Markdown
Contributor Author

@ProExpertProg can you help me land it before another rebase is needed :)

@hmellor
Copy link
Copy Markdown
Member

hmellor commented Nov 15, 2025

FYI the docs build in this PR was failing so now it's failing on main too.

We're working on a fix.

@ProExpertProg
Copy link
Copy Markdown
Collaborator

Sorry about that, I looked at the error message and it looked unrelated, I should have checked nightly.

@DarkLight1337
Copy link
Copy Markdown
Member

Fixed by #28772

@hmellor
Copy link
Copy Markdown
Member

hmellor commented Nov 15, 2025

It was an issue with mkdocstrings which I've reported to them.

For future reference, the docs are built for every commit on main. So if the docs are green on main, it's a real failure.

Comment on lines +1904 to +1905
compiled_model.compiled = False
TorchCompileWithNoGuardsWrapper.__init__(compiled_model)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change broke tpu on v0.11.2, calling __init__ here results in an attempt to use VllmConfig when it's not defined, resulting in the following traceback:

   (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815] WorkerProc hit an exception.
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815] Traceback (most recent call last):
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 810, in worker_busy_loop
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     output = func(*args, **kwargs)
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]              ^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/v1/worker/tpu_worker.py", line 214, in determine_available_memory
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     self.model_runner.reset_dynamo_cache()
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/v1/worker/tpu_model_runner.py", line 1907, in reset_dynamo_cache
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     TorchCompileWithNoGuardsWrapper.__init__(compiled_model)
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/compilation/wrapper.py", line 91, in __init__
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     vllm_config = get_current_vllm_config()
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]                   ^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/config/vllm.py", line 1136, in get_current_vllm_config
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     return VllmConfig()
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]            ^^^^^^^^^^^^
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/config/vllm.py", line 609, in __post_init__
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     current_platform.check_and_update_config(self)
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/platforms/tpu.py", line 165, in check_and_update_config
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     cache_config.block_size = PallasAttentionBackend.get_page_size(vllm_config)  # type: ignore[assignment]
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]   File "/home/runner/actions-runner/_work/nm-cicd/nm-cicd/.venv/lib/python3.12/site-packages/vllm/v1/attention/backends/pallas.py", line 162, in get_page_size
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]     if vllm_config.model_config.max_model_len > 8192:
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815]        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815] AttributeError: 'NoneType' object has no attribute 'max_model_len'
    (Worker_TP2 pid=3714832) ERROR 12-04 10:29:38 [multiproc_executor.py:815] Traceback (most recent call last):

cc @mgoin

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix is here: #30331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llama Related to Llama models multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants