[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933

> This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed. 

 _Originally posted by @github-actions in [#2933](https://github.com/sgl-project/sglang/issues/2933#issuecomment-2784912058)_


---
In issue #2933, the workaround for NVIDIA H100 MIG crashes was to use the environment variable `SGLANG_GPU_MEMORY_TOTAL_FALLBACK`. That fix was implemented in the `server_args.py` file.

However, in later versions (from v0.4.5 to the current v0.4.6.post2), the related logic seems to have been refactored into `utils.py`, specifically:

```python
/sgl-workspace/sglang/python/sglang/srt/utils.py, line 1244, in get_device_memory_capacity
```

When comparing the current codebase with the commit that resolved #2933, it appears that the `SGLANG_GPU_MEMORY_TOTAL_FALLBACK` exception handling was omitted during the refactor.

As a result, the workaround no longer works, and the original issue resurfaces on H100 MIG environments.

Please restore the fallback logic using the `SGLANG_GPU_MEMORY_TOTAL_FALLBACK` environment variable or provide an alternative solution to safely handle GPU memory detection in such cases.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933 #6110

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933 #6110

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions