Conversation
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| static CUDAStreamInternals* default_streams; | ||
| static constexpr int STREAMS_PER_POOL = 32; | ||
| static constexpr unsigned int DEFAULT_FLAGS = cudaStreamNonBlocking; | ||
| static int HIGH_PRIORITY = 0; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@colesbury @apaszke Any reviews? |
|
@mruberry we probably need a rebase for this PR @colesbury, @apaszke, @goldsborough any suggestions? |
|
Happy to rebase but we should get a review first. |
apaszke
left a comment
There was a problem hiding this comment.
(Not a complete review. Some notes)
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| low_priority_streams[device].resize(STREAMS_PER_POOL); | ||
| high_priority_streams[device].resize(STREAMS_PER_POOL); | ||
|
|
||
| for (auto i = decltype(STREAMS_PER_POOL){0}; i < STREAMS_PER_POOL; ++i) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| // Non-default streams | ||
| static std::deque<std::once_flag> device_flags; | ||
| static std::deque<std::atomic<int>> low_priority_counters; | ||
| static std::deque<std::atomic<int>> high_priority_counters; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| int modded = raw_idx % STREAMS_PER_POOL; | ||
| if (raw_idx >= STREAMS_PER_POOL && modded == 0) { | ||
| counter -= STREAMS_PER_POOL; | ||
| } |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| current_streams[device] = ptr; | ||
| } | ||
| const auto idx = get_idx(low_priority_counters[device]); | ||
| return &low_priority_streams[device][idx]; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
|
||
| ~CUDAStreamInternals() { | ||
| if (stream) cudaStreamDestroy(stream); | ||
| } |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
|
|
||
| // Non-default streams | ||
| static std::deque<std::once_flag> device_flags; | ||
| static std::deque<std::atomic<int>> low_priority_counters; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| static std::deque<std::once_flag> device_flags; | ||
| static std::deque<std::atomic<int>> low_priority_counters; | ||
| static std::deque<std::atomic<int>> high_priority_counters; | ||
| static std::vector<std::vector<CUDAStreamInternals>> low_priority_streams; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| default_streams[i].device = i; | ||
| default_streams[i].stream = DEFAULT_STREAM; | ||
| low_priority_counters[i] = 0; | ||
| high_priority_counters[i] = 0; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/cuda/CUDAStream.cpp
Outdated
| , DEFAULT_FLAGS | ||
| , HIGH_PRIORITY)); | ||
| #else | ||
| AT_CUDA_CHECK(cudaStreamCreateWithFlags( |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ezyang
left a comment
There was a problem hiding this comment.
At a high level, it all looks good. All my comments are just lower level nits. In terms of priority, the most important change to make for me is changing how counter wraparound works.
We might need some documentation about how the streams used here should be short lived. The discussion in the upstream issue was nicely detailed, but people are not likely to see it once this merges.
|
I merged with master and made the following changes:
I did not:
|
|
@pytorchbot retest this please |
1 similar comment
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity. Change notes: - There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown. - All stream refcounting has been removed (the pools pattern replaces it). - Setting a stream now sets it on its device. Streams are associated with a device and the previous requirement to specify that device was unnecessary. - There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking. - Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.) - Unused THCStream and THCState stream-related functions were removed. - A new test of pooling behavior was added in stream_test. fyi: colesbury, apaszke, goldsborough Pull Request resolved: pytorch/pytorch#9938 Reviewed By: SsnL Differential Revision: D9569036 Pulled By: ezyang fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d
Summary: This PR creates a stream pool per issue pytorch#9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity. Change notes: - There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown. - All stream refcounting has been removed (the pools pattern replaces it). - Setting a stream now sets it on its device. Streams are associated with a device and the previous requirement to specify that device was unnecessary. - There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking. - Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.) - Unused THCStream and THCState stream-related functions were removed. - A new test of pooling behavior was added in stream_test. fyi: colesbury, apaszke, goldsborough Pull Request resolved: pytorch#9938 Reviewed By: SsnL Differential Revision: D9569036 Pulled By: ezyang fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d
Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.
Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.
Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.
Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.
…#4392) Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.
…triton-lang#4392) Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.
… (#4392) Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.
This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.
Change notes:
requirement to specify that device was unnecessary.
fyi: @colesbury, @apaszke, @goldsborough