Revert "Implement return_hidden_states for the OpenAI API (#6137)"#6440
Revert "Implement return_hidden_states for the OpenAI API (#6137)"#6440
return_hidden_states for the OpenAI API (#6137)"#6440Conversation
This reverts commit 4f39bcf.
|
Hi @kyle-pena-kuzco @Qiaolin-Yu @CatherineSue This pr breaks |
Hi @zhyncs - thanks for the callout. We love the project and we want to make sure that our PRs meet the highest standards. Could you help us understand which test case breaks and where you noticed the failure? This will help us pinpoint what you saw so we can address it. We are running The only relevant github action result we could find was here, but it looked like maybe this failure was intermittent / random? |
Hi @kyle-pena-kuzco Are u running on H100 or H200? May you try to run on H100 |
Absolutely, we will try on an H100. We have been running our tests on a 4090. Would you mind sharing what test failure you saw? That would help us to troubleshoot. |
|
@BBuf can provide more detailed information. |
In cuda graph mode, the memory usage is too high because each batch will capture a cuda graph and return a hidden state. So we can set cuda_graph_max_bs parameter to 8 in |
I believe I understand the issue now. As When It looks like the old CUDA graph captures are not removed from memory, and as a result, available memory decreases over time, leading to eventual OOM. Here is a screen capture demonstrating the available memory decreasing after every CUDA graph recapture: So, I think the core issue is: If either of those problems are solved, I think that might resolve the issue. @BBuf is this the problem that you encountered? |

This reverts commit 4f39bcf.
Motivation
Modifications
Checklist