This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
Originally posted by @github-actions in #2933
In issue #2933, the workaround for NVIDIA H100 MIG crashes was to use the environment variable SGLANG_GPU_MEMORY_TOTAL_FALLBACK. That fix was implemented in the server_args.py file.
However, in later versions (from v0.4.5 to the current v0.4.6.post2), the related logic seems to have been refactored into utils.py, specifically:
/sgl-workspace/sglang/python/sglang/srt/utils.py, line 1244, in get_device_memory_capacity
When comparing the current codebase with the commit that resolved #2933, it appears that the SGLANG_GPU_MEMORY_TOTAL_FALLBACK exception handling was omitted during the refactor.
As a result, the workaround no longer works, and the original issue resurfaces on H100 MIG environments.
Please restore the fallback logic using the SGLANG_GPU_MEMORY_TOTAL_FALLBACK environment variable or provide an alternative solution to safely handle GPU memory detection in such cases.
Originally posted by @github-actions in #2933
In issue #2933, the workaround for NVIDIA H100 MIG crashes was to use the environment variable
SGLANG_GPU_MEMORY_TOTAL_FALLBACK. That fix was implemented in theserver_args.pyfile.However, in later versions (from v0.4.5 to the current v0.4.6.post2), the related logic seems to have been refactored into
utils.py, specifically:When comparing the current codebase with the commit that resolved #2933, it appears that the
SGLANG_GPU_MEMORY_TOTAL_FALLBACKexception handling was omitted during the refactor.As a result, the workaround no longer works, and the original issue resurfaces on H100 MIG environments.
Please restore the fallback logic using the
SGLANG_GPU_MEMORY_TOTAL_FALLBACKenvironment variable or provide an alternative solution to safely handle GPU memory detection in such cases.