Skip to content

Revert "Implement return_hidden_states for the OpenAI API (#6137)"#6440

Merged
zhyncs merged 1 commit intomainfrom
zhyncs/revert
May 20, 2025
Merged

Revert "Implement return_hidden_states for the OpenAI API (#6137)"#6440
zhyncs merged 1 commit intomainfrom
zhyncs/revert

Conversation

@zhyncs
Copy link
Copy Markdown
Collaborator

@zhyncs zhyncs commented May 20, 2025

This reverts commit 4f39bcf.

Motivation

Modifications

Checklist

@zhyncs
Copy link
Copy Markdown
Collaborator Author

zhyncs commented May 20, 2025

Hi @kyle-pena-kuzco @Qiaolin-Yu @CatherineSue This pr breaks test/srt/test_openai_server.py

@zhyncs zhyncs merged commit b146555 into main May 20, 2025
1 of 37 checks passed
@zhyncs zhyncs deleted the zhyncs/revert branch May 20, 2025 01:21
@kyle-pena-kuzco
Copy link
Copy Markdown
Contributor

Hi @kyle-pena-kuzco @Qiaolin-Yu @CatherineSue This pr breaks test/srt/test_openai_server.py

Hi @zhyncs - thanks for the callout. We love the project and we want to make sure that our PRs meet the highest standards.

Could you help us understand which test case breaks and where you noticed the failure? This will help us pinpoint what you saw so we can address it. We are running test/srt/test_openai_server.py locally and we have all tests pass.

The only relevant github action result we could find was here, but it looked like maybe this failure was intermittent / random?
https://github.com/sgl-project/sglang/actions/runs/15107245347/job/42512558657#step:4:4522

@zhyncs
Copy link
Copy Markdown
Collaborator Author

zhyncs commented May 20, 2025

Hi @kyle-pena-kuzco @Qiaolin-Yu @CatherineSue This pr breaks test/srt/test_openai_server.py

Hi @zhyncs - thanks for the callout. We love the project and we want to make sure that our PRs meet the highest standards.

Could you help us understand which test case breaks and where you noticed the failure? This will help us pinpoint what you saw so we can address it. We are running test/srt/test_openai_server.py locally and we have all tests pass.

The only relevant github action result we could find was here, but it looked like maybe this failure was intermittent / random? https://github.com/sgl-project/sglang/actions/runs/15107245347/job/42512558657#step:4:4522

Hi @kyle-pena-kuzco Are u running on H100 or H200? May you try to run on H100

@kyle-pena-kuzco
Copy link
Copy Markdown
Contributor

Hi @kyle-pena-kuzco @Qiaolin-Yu @CatherineSue This pr breaks test/srt/test_openai_server.py

Hi @zhyncs - thanks for the callout. We love the project and we want to make sure that our PRs meet the highest standards.
Could you help us understand which test case breaks and where you noticed the failure? This will help us pinpoint what you saw so we can address it. We are running test/srt/test_openai_server.py locally and we have all tests pass.
The only relevant github action result we could find was here, but it looked like maybe this failure was intermittent / random? https://github.com/sgl-project/sglang/actions/runs/15107245347/job/42512558657#step:4:4522

Hi @kyle-pena-kuzco Are u running on H100 or H200? May you try to run on H100

Absolutely, we will try on an H100. We have been running our tests on a 4090.

Would you mind sharing what test failure you saw? That would help us to troubleshoot.

@zhyncs
Copy link
Copy Markdown
Collaborator Author

zhyncs commented May 20, 2025

@BBuf can provide more detailed information.

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented May 20, 2025

@BBuf can provide more detailed information.

Hi @kyle-pena-kuzco @Qiaolin-Yu @CatherineSue This pr breaks test/srt/test_openai_server.py

Hi @zhyncs - thanks for the callout. We love the project and we want to make sure that our PRs meet the highest standards.
Could you help us understand which test case breaks and where you noticed the failure? This will help us pinpoint what you saw so we can address it. We are running test/srt/test_openai_server.py locally and we have all tests pass.
The only relevant github action result we could find was here, but it looked like maybe this failure was intermittent / random? https://github.com/sgl-project/sglang/actions/runs/15107245347/job/42512558657#step:4:4522

Hi @kyle-pena-kuzco Are u running on H100 or H200? May you try to run on H100

Absolutely, we will try on an H100. We have been running our tests on a 4090.

Would you mind sharing what test failure you saw? That would help us to troubleshoot.

In cuda graph mode, the memory usage is too high because each batch will capture a cuda graph and return a hidden state. So we can set cuda_graph_max_bs parameter to 8 in test/srt/test_openai_server.py in H100 to avoid OOM and it's not effect the accuracy

@kyle-pena-kuzco
Copy link
Copy Markdown
Contributor

kyle-pena-kuzco commented May 20, 2025

@BBuf can provide more detailed information.

Hi @kyle-pena-kuzco @Qiaolin-Yu @CatherineSue This pr breaks test/srt/test_openai_server.py

Hi @zhyncs - thanks for the callout. We love the project and we want to make sure that our PRs meet the highest standards.
Could you help us understand which test case breaks and where you noticed the failure? This will help us pinpoint what you saw so we can address it. We are running test/srt/test_openai_server.py locally and we have all tests pass.
The only relevant github action result we could find was here, but it looked like maybe this failure was intermittent / random? https://github.com/sgl-project/sglang/actions/runs/15107245347/job/42512558657#step:4:4522

Hi @kyle-pena-kuzco Are u running on H100 or H200? May you try to run on H100

Absolutely, we will try on an H100. We have been running our tests on a 4090.
Would you mind sharing what test failure you saw? That would help us to troubleshoot.

In cuda graph mode, the memory usage is too high because each batch will capture a cuda graph and return a hidden state. So we can set cuda_graph_max_bs parameter to 8 in test/srt/test_openai_server.py in H100 to avoid OOM and it's not effect the accuracy

I believe I understand the issue now.

As test_openai_server.py iterates through many test cases, return_hidden_states switches between on and off many times.

When return_hidden_states changes, it triggers a CUDA graph re-capture. This is by design. See:

Note that each time you change the `return_hidden_states` parameter,

It looks like the old CUDA graph captures are not removed from memory, and as a result, available memory decreases over time, leading to eventual OOM.

Here is a screen capture demonstrating the available memory decreasing after every CUDA graph recapture:
image

So, I think the core issue is:
(a) Old CUDA graphs are not destroyed
(b) Requesting hidden states triggers a CUDA graph re-capture

If either of those problems are solved, I think that might resolve the issue.

@BBuf is this the problem that you encountered?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants