[BugFix] Graceful handling of torch symm mem errors. by ilmarkov · Pull Request #27671 · vllm-project/vllm

ilmarkov · 2025-10-28T17:11:08Z

Disable torch symm mem in case of torch internal errors.
Enable torch symm mem back by default.

Fixes #26922
The problem described in the issue seems to be a conflict between torch and driver which can't be resolved in vllm.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ilmarkov <markovilya197@gmail.com>

gemini-code-assist

Code Review

This pull request introduces graceful error handling for torch_symm_mem initialization by catching RuntimeError and disabling the feature, preventing crashes. It also enables torch_symm_mem by default. My review confirms the logic of the error handling. I have one suggestion to improve the logging within the new except block to ensure exception details are correctly captured, which is crucial for debugging.

vllm/distributed/device_communicators/symm_mem.py

Signed-off-by: ilmarkov <markovilya197@gmail.com>

yewentao256

We can land this, but I think we should know what is the conflict between torch and driver which can't be resolved in vllm. And give user guidance how to solve and enable it in case this happens.
@mgoin CC

yewentao256

We can land this first, but still need to figure out what is the root cause

vllm/distributed/device_communicators/symm_mem.py

Signed-off-by: ilmarkov <markovilya197@gmail.com>

mgoin

I’m not really satisfied by this. Wrapping a try-catch around the problem is not good.
Have we spoke with the torch or nvidia team about this issue?

vllm/distributed/device_communicators/symm_mem.py

Signed-off-by: ilmarkov <markovilya197@gmail.com>

) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>

robertgshaw2-redhat · 2025-11-18T03:52:45Z

I find that PR introduces significant performance regression on my h200 box for DSR1

robertgshaw2-redhat · 2025-11-18T03:56:22Z

MODEL := "deepseek-ai/DeepSeek-V3.1"

INPUT_LEN := "1000"
OUTPUT_LEN := "100"

launch_vllm:
    VLLM_ALLREDUCE_USE_SYMM_MEM=0 VLLM_MOE_USE_DEEP_GEMM=0 VLLM_USE_DEEP_GEMM=1 VLLM_TORCH_PROFILER_DIR=$(pwd)/profiles_tp chg run --gpus 8 -- vllm serve \
        {{MODEL}} --tensor-parallel-size 8

benchmark BATCH_SIZE NUM_PROMPTS:
    vllm bench serve \
        --model {{MODEL}} \
        --dataset-name random \
        --random-input-len {{INPUT_LEN}} \
        --random-output-len {{OUTPUT_LEN}} \
        --max-concurrency {{BATCH_SIZE}} \
        --num-prompts {{NUM_PROMPTS}} \
        --seed $(date +%M%H%M%S) \
        --percentile-metrics ttft,tpot,itl \
        --ignore-eos

sweep:
    just benchmark 4 40 && \
    just benchmark 8 80 && \
    just benchmark 16 160 && \
    just benchmark 32 320 && \
    just benchmark 64 640


start_profile:
    curl -X POST http://localhost:8000/start_profile

stop_profile:
    curl -X POST http://localhost:8000/stop_profile

before (VLLM_ALLREDUCE_USE_SYMM_MEM=0)

just launch_vllm
just benchmark 16 160

robertgshaw2-redhat · 2025-11-18T03:57:57Z

oops, weirdly looks like this is happening on other backend too

robertgshaw2-redhat · 2025-11-18T03:58:43Z

robertgshaw2-redhat · 2025-11-18T03:58:53Z

might just be something wrong on my machine

) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

)

) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Graceful handling of torch symm mem errors.

3bbb7ce

Signed-off-by: ilmarkov <markovilya197@gmail.com>

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

vllm/distributed/device_communicators/symm_mem.py Outdated Show resolved Hide resolved

Address gemini review

648c476

Signed-off-by: ilmarkov <markovilya197@gmail.com>

yewentao256 reviewed Oct 29, 2025

View reviewed changes

yewentao256 approved these changes Nov 3, 2025

View reviewed changes

vllm/distributed/device_communicators/symm_mem.py Outdated Show resolved Hide resolved

vllm/distributed/device_communicators/symm_mem.py Show resolved Hide resolved

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 3, 2025

ilmarkov added 4 commits November 4, 2025 13:07

Address comments

8e7fa72

pre-commit fix

0ef8dd6

Fix

1ef9323

Signed-off-by: ilmarkov <markovilya197@gmail.com>

Improve warning

b7341ae

mgoin reviewed Nov 4, 2025

View reviewed changes

vllm/distributed/device_communicators/symm_mem.py Outdated Show resolved Hide resolved

nvpohanh mentioned this pull request Nov 11, 2025

[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds #24248

Merged

mgoin and others added 4 commits November 11, 2025 12:50

Change warning to warning_once for symmetric memory

0d346d2

Merge branch 'main' into imarkov/enable_symm_mem

1f5922a

Fix precommit

e6395d7

Signed-off-by: ilmarkov <markovilya197@gmail.com>

Remove comment

2d9799c

Signed-off-by: ilmarkov <markovilya197@gmail.com>

mgoin approved these changes Nov 12, 2025

View reviewed changes

mgoin merged commit 1788aa1 into vllm-project:main Nov 12, 2025
53 checks passed

yewentao256 deleted the imarkov/enable_symm_mem branch November 12, 2025 14:26

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[BugFix] Graceful handling of torch symm mem errors. (vllm-project#27671

d747662

) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

nvyutwu pushed a commit to nvyutwu/vllm that referenced this pull request Feb 2, 2026

[BugFix] Graceful handling of torch symm mem errors. (vllm-project#27671

9dc75b9

) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

nvyutwu added a commit to nvyutwu/vllm that referenced this pull request Feb 2, 2026

[BugFix] Graceful handling of torch symm mem errors. (vllm-project#27671

95b8849

)

nvyutwu pushed a commit to nvyutwu/vllm that referenced this pull request Feb 3, 2026

[BugFix] Graceful handling of torch symm mem errors. (vllm-project#27671

396e0ea

) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Graceful handling of torch symm mem errors.#27671

[BugFix] Graceful handling of torch symm mem errors.#27671
mgoin merged 10 commits intovllm-project:mainfrom
neuralmagic:imarkov/enable_symm_mem

ilmarkov commented Oct 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

ilmarkov commented Oct 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

robertgshaw2-redhat commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ilmarkov commented Oct 28, 2025 •

edited by github-actions bot

Loading