Creates stream pool by mruberry · Pull Request #9938 · pytorch/pytorch

mruberry · 2018-07-27T16:40:06Z

This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.

Change notes:

There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown.
All stream refcounting has been removed (the pools pattern replaces it).
Setting a stream now sets it on its device. Streams are associated with a device and the previous
requirement to specify that device was unnecessary.
There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking.
Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.)
Unused THCStream and THCState stream-related functions were removed.
A new test of pooling behavior was added in stream_test.

fyi: @colesbury, @apaszke, @goldsborough

aten/src/ATen/cuda/CUDAStream.cpp

-  static CUDAStreamInternals* default_streams;
+  static constexpr int STREAMS_PER_POOL = 32;
+  static constexpr unsigned int DEFAULT_FLAGS = cudaStreamNonBlocking;
+  static int HIGH_PRIORITY = 0;


yf225 · 2018-08-14T17:02:49Z

@colesbury @apaszke Any reviews?

yf225 · 2018-08-28T16:44:00Z

@mruberry we probably need a rebase for this PR

@colesbury, @apaszke, @goldsborough any suggestions?

mruberry · 2018-08-28T16:46:32Z

Happy to rebase but we should get a review first.

apaszke

(Not a complete review. Some notes)

aten/src/ATen/cuda/CUDAStream.cpp

+    low_priority_streams[device].resize(STREAMS_PER_POOL);
+    high_priority_streams[device].resize(STREAMS_PER_POOL);
+
+    for (auto i = decltype(STREAMS_PER_POOL){0}; i < STREAMS_PER_POOL; ++i) {


aten/src/ATen/cuda/CUDAStream.cpp

+  // Non-default streams
+  static std::deque<std::once_flag> device_flags;
+  static std::deque<std::atomic<int>> low_priority_counters;
+  static std::deque<std::atomic<int>> high_priority_counters;


aten/src/ATen/cuda/CUDAStream.cpp

+    int modded = raw_idx % STREAMS_PER_POOL;
+    if (raw_idx >= STREAMS_PER_POOL && modded == 0) {
+      counter -= STREAMS_PER_POOL;
+    }


aten/src/ATen/cuda/CUDAStream.cpp

-    current_streams[device] = ptr;
-  }
+    const auto idx = get_idx(low_priority_counters[device]);
+    return &low_priority_streams[device][idx];


aten/src/ATen/cuda/CUDAStream.cpp

+
+  ~CUDAStreamInternals() {
+    if (stream) cudaStreamDestroy(stream);
+  }


aten/src/ATen/cuda/CUDAStream.cpp

+
+  // Non-default streams
+  static std::deque<std::once_flag> device_flags;
+  static std::deque<std::atomic<int>> low_priority_counters;


aten/src/ATen/cuda/CUDAStream.cpp

+  static std::deque<std::once_flag> device_flags;
+  static std::deque<std::atomic<int>> low_priority_counters;
+  static std::deque<std::atomic<int>> high_priority_counters;
+  static std::vector<std::vector<CUDAStreamInternals>> low_priority_streams;


aten/src/ATen/cuda/CUDAStream.cpp

      default_streams[i].device = i;
-      default_streams[i].stream = DEFAULT_STREAM;
+      low_priority_counters[i] = 0;
+      high_priority_counters[i] = 0;


aten/src/ATen/cuda/CUDAStream.cpp

+        , DEFAULT_FLAGS
+        , HIGH_PRIORITY));
+      #else 
+        AT_CUDA_CHECK(cudaStreamCreateWithFlags(


ezyang

At a high level, it all looks good. All my comments are just lower level nits. In terms of priority, the most important change to make for me is changing how counter wraparound works.

We might need some documentation about how the streams used here should be short lived. The discussion in the upstream issue was nicely detailed, but people are not likely to see it once this merges.

mruberry · 2018-08-28T19:45:52Z

Thanks for taking a look @ezyang, @apaszke. Suggestions look good. Really like your point about commenting, @ezyang. I'll get us an update soon (have to finish splitting the fusion compiler first).

mruberry · 2018-08-29T20:50:32Z

I merged with master and made the following changes:

Added a note to CUDAStream.h and additional comments to CUDAStream.cpp, clarifying the use of counters and flags, in particular.
Updated constant names per @goldsborough, guarded values per @ezyang, used arrays per @ezyang, also simplified high vs low priority so these values are simply initialized properly (the prior approach was needlessly general)
Changed the atomic counters to uint32_t, simplified the round-robin logic by allowing overflow

I did not:

Update CUDAStreamInternals to use a unique ptr with a custom deleter, which I agree is probably more elegant but also not necessary right now.
Change the loop unroll since I think it's good enough
Merge the high priority and low priority acquisitions in CUDAStream_createStream(). The lines of code here can be reduced but I think the current statement is clear and the logic duplication is very small.

ezyang · 2018-08-30T04:25:29Z

@pytorchbot retest this please

ezyang · 2018-08-30T04:25:30Z

@pytorchbot retest this please

facebook-github-bot

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

d

Summary: This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity. Change notes: - There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown. - All stream refcounting has been removed (the pools pattern replaces it). - Setting a stream now sets it on its device. Streams are associated with a device and the previous requirement to specify that device was unnecessary. - There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking. - Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.) - Unused THCStream and THCState stream-related functions were removed. - A new test of pooling behavior was added in stream_test. fyi: colesbury, apaszke, goldsborough Pull Request resolved: pytorch/pytorch#9938 Reviewed By: SsnL Differential Revision: D9569036 Pulled By: ezyang fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d

Summary: This PR creates a stream pool per issue pytorch#9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity. Change notes: - There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown. - All stream refcounting has been removed (the pools pattern replaces it). - Setting a stream now sets it on its device. Streams are associated with a device and the previous requirement to specify that device was unnecessary. - There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking. - Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.) - Unused THCStream and THCState stream-related functions were removed. - A new test of pooling behavior was added in stream_test. fyi: colesbury, apaszke, goldsborough Pull Request resolved: pytorch#9938 Reviewed By: SsnL Differential Revision: D9569036 Pulled By: ezyang fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d

Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.

…#4392) Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.

…triton-lang#4392) Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.

… (#4392) Per pytorch/pytorch#9938, which fixes pytorch/pytorch#9646, CUDA streams are now cheap to create under PyTorch. Let's have the benchmarking function create one per run instead of requiring its callers to do so.

Creates stream pool

2dfb17f

mruberry requested review from apaszke, colesbury, ezyang, gchanan, pietern, soumith, teng-li and zdevito as code owners July 27, 2018 16:40

mruberry added 3 commits July 27, 2018 10:11

Guards cudaDeviceGetStreamPriorityRange for HIP

b1da05e

Guards on device and adds multi-gpu test (to be expanded with CUDAEvent)

d4050f1

Fixes narrowing conversion

4792040

mruberry mentioned this pull request Jul 27, 2018

Stop ifdef'ing out scatter/gather (comm) in libtorch #9912

Open

goldsborough reviewed Jul 30, 2018

View reviewed changes

weiyangfb added the ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes label Jul 31, 2018

apaszke reviewed Aug 28, 2018

View reviewed changes

ezyang reviewed Aug 28, 2018

View reviewed changes

aten/src/ATen/cuda/CUDAStream.cpp

~CUDAStreamInternals() {

if (stream) cudaStreamDestroy(stream);

}

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Aug 28, 2018

View reviewed changes

aten/src/ATen/cuda/CUDAStream.cpp Outdated

// Non-default streams

static std::deque<std::once_flag> device_flags;

static std::deque<std::atomic<int>> low_priority_counters;

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Aug 28, 2018

View reviewed changes

aten/src/ATen/cuda/CUDAStream.cpp Outdated

default_streams[i].device = i;

default_streams[i].stream = DEFAULT_STREAM;

low_priority_counters[i] = 0;

high_priority_counters[i] = 0;

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Aug 28, 2018

View reviewed changes

aten/src/ATen/cuda/CUDAStream.cpp Outdated

, DEFAULT_FLAGS

, HIGH_PRIORITY));

#else

AT_CUDA_CHECK(cudaStreamCreateWithFlags(

This comment was marked as off-topic.

Sign in to view

ezyang previously requested changes Aug 28, 2018

View reviewed changes

mruberry mentioned this pull request Aug 28, 2018

Updates autograd engine to respect streams set in forward #8354

Closed

merges and updates per comments

3be0326

improves comment

3c0c306

facebook-github-bot reviewed Aug 30, 2018

View reviewed changes

facebook-github-bot closed this in 9d4360c Aug 30, 2018

mruberry mentioned this pull request Sep 8, 2018

Perf overhead of creating and destroying CUDA streams #9646

Closed

pietern mentioned this pull request Sep 20, 2018

Use ATen CUDA event/stream wrappers in c10d #11912

Closed

mruberry deleted the stream_pool branch September 25, 2018 16:41

ezyang added open source merged labels Jun 24, 2019

rongou mentioned this pull request Apr 21, 2020

[FEA] CUDA stream pool rapidsai/rmm#352

Closed

ezyang mentioned this pull request Jun 8, 2020

Allow custom CUDA stream injection #39567

Closed

int3 mentioned this pull request Jul 25, 2024

[FRONTEND] Have do_bench_cudagraph create its own benchmarking stream triton-lang/triton#4392

Merged

Conversation

mruberry commented Jul 27, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

yf225 commented Aug 14, 2018

Uh oh!

yf225 commented Aug 28, 2018

Uh oh!

mruberry commented Aug 28, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

mruberry commented Aug 28, 2018

Uh oh!

mruberry commented Aug 29, 2018

Uh oh!

ezyang commented Aug 30, 2018

Uh oh!

ezyang commented Aug 30, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants