Skip to content

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933 #6110

@jiyol

Description

@jiyol

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Originally posted by @github-actions in #2933


In issue #2933, the workaround for NVIDIA H100 MIG crashes was to use the environment variable SGLANG_GPU_MEMORY_TOTAL_FALLBACK. That fix was implemented in the server_args.py file.

However, in later versions (from v0.4.5 to the current v0.4.6.post2), the related logic seems to have been refactored into utils.py, specifically:

/sgl-workspace/sglang/python/sglang/srt/utils.py, line 1244, in get_device_memory_capacity

When comparing the current codebase with the commit that resolved #2933, it appears that the SGLANG_GPU_MEMORY_TOTAL_FALLBACK exception handling was omitted during the refactor.

As a result, the workaround no longer works, and the original issue resurfaces on H100 MIG environments.

Please restore the fallback logic using the SGLANG_GPU_MEMORY_TOTAL_FALLBACK environment variable or provide an alternative solution to safely handle GPU memory detection in such cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions