[BugFix] Graceful handling of torch symm mem errors.#27671
[BugFix] Graceful handling of torch symm mem errors.#27671mgoin merged 10 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ilmarkov <markovilya197@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces graceful error handling for torch_symm_mem initialization by catching RuntimeError and disabling the feature, preventing crashes. It also enables torch_symm_mem by default. My review confirms the logic of the error handling. I have one suggestion to improve the logging within the new except block to ensure exception details are correctly captured, which is crucial for debugging.
Signed-off-by: ilmarkov <markovilya197@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
We can land this, but I think we should know what is the conflict between torch and driver which can't be resolved in vllm. And give user guidance how to solve and enable it in case this happens.
@mgoin CC
yewentao256
left a comment
There was a problem hiding this comment.
We can land this first, but still need to figure out what is the root cause
mgoin
left a comment
There was a problem hiding this comment.
I’m not really satisfied by this. Wrapping a try-catch around the problem is not good.
Have we spoke with the torch or nvidia team about this issue?
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
MODEL := "deepseek-ai/DeepSeek-V3.1"
INPUT_LEN := "1000"
OUTPUT_LEN := "100"
launch_vllm:
VLLM_ALLREDUCE_USE_SYMM_MEM=0 VLLM_MOE_USE_DEEP_GEMM=0 VLLM_USE_DEEP_GEMM=1 VLLM_TORCH_PROFILER_DIR=$(pwd)/profiles_tp chg run --gpus 8 -- vllm serve \
{{MODEL}} --tensor-parallel-size 8
benchmark BATCH_SIZE NUM_PROMPTS:
vllm bench serve \
--model {{MODEL}} \
--dataset-name random \
--random-input-len {{INPUT_LEN}} \
--random-output-len {{OUTPUT_LEN}} \
--max-concurrency {{BATCH_SIZE}} \
--num-prompts {{NUM_PROMPTS}} \
--seed $(date +%M%H%M%S) \
--percentile-metrics ttft,tpot,itl \
--ignore-eos
sweep:
just benchmark 4 40 && \
just benchmark 8 80 && \
just benchmark 16 160 && \
just benchmark 32 320 && \
just benchmark 64 640
start_profile:
curl -X POST http://localhost:8000/start_profile
stop_profile:
curl -X POST http://localhost:8000/stop_profile
just launch_vllm
just benchmark 16 160 |
|
oops, weirdly looks like this is happening on other backend too |
|
might just be something wrong on my machine |


Disable torch symm mem in case of torch internal errors.
Enable torch symm mem back by default.
Fixes #26922
The problem described in the issue seems to be a conflict between torch and driver which can't be resolved in vllm.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.