Skip to content

Creates stream pool#9938

Closed
mruberry wants to merge 6 commits intopytorch:masterfrom
mruberry:stream_pool
Closed

Creates stream pool#9938
mruberry wants to merge 6 commits intopytorch:masterfrom
mruberry:stream_pool

Conversation

@mruberry
Copy link
Collaborator

This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.

Change notes:

  • There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown.
  • All stream refcounting has been removed (the pools pattern replaces it).
  • Setting a stream now sets it on its device. Streams are associated with a device and the previous
    requirement to specify that device was unnecessary.
  • There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking.
  • Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.)
  • Unused THCStream and THCState stream-related functions were removed.
  • A new test of pooling behavior was added in stream_test.

fyi: @colesbury, @apaszke, @goldsborough

static CUDAStreamInternals* default_streams;
static constexpr int STREAMS_PER_POOL = 32;
static constexpr unsigned int DEFAULT_FLAGS = cudaStreamNonBlocking;
static int HIGH_PRIORITY = 0;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@weiyangfb weiyangfb added the ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes label Jul 31, 2018
@yf225
Copy link
Contributor

yf225 commented Aug 14, 2018

@colesbury @apaszke Any reviews?

@yf225
Copy link
Contributor

yf225 commented Aug 28, 2018

@mruberry we probably need a rebase for this PR

@colesbury, @apaszke, @goldsborough any suggestions?

@mruberry
Copy link
Collaborator Author

Happy to rebase but we should get a review first.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a complete review. Some notes)

low_priority_streams[device].resize(STREAMS_PER_POOL);
high_priority_streams[device].resize(STREAMS_PER_POOL);

for (auto i = decltype(STREAMS_PER_POOL){0}; i < STREAMS_PER_POOL; ++i) {

This comment was marked as off-topic.

// Non-default streams
static std::deque<std::once_flag> device_flags;
static std::deque<std::atomic<int>> low_priority_counters;
static std::deque<std::atomic<int>> high_priority_counters;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

int modded = raw_idx % STREAMS_PER_POOL;
if (raw_idx >= STREAMS_PER_POOL && modded == 0) {
counter -= STREAMS_PER_POOL;
}

This comment was marked as off-topic.

This comment was marked as off-topic.

current_streams[device] = ptr;
}
const auto idx = get_idx(low_priority_counters[device]);
return &low_priority_streams[device][idx];

This comment was marked as off-topic.

This comment was marked as off-topic.


~CUDAStreamInternals() {
if (stream) cudaStreamDestroy(stream);
}

This comment was marked as off-topic.

This comment was marked as off-topic.


// Non-default streams
static std::deque<std::once_flag> device_flags;
static std::deque<std::atomic<int>> low_priority_counters;

This comment was marked as off-topic.

static std::deque<std::once_flag> device_flags;
static std::deque<std::atomic<int>> low_priority_counters;
static std::deque<std::atomic<int>> high_priority_counters;
static std::vector<std::vector<CUDAStreamInternals>> low_priority_streams;

This comment was marked as off-topic.

default_streams[i].device = i;
default_streams[i].stream = DEFAULT_STREAM;
low_priority_counters[i] = 0;
high_priority_counters[i] = 0;

This comment was marked as off-topic.

, DEFAULT_FLAGS
, HIGH_PRIORITY));
#else
AT_CUDA_CHECK(cudaStreamCreateWithFlags(

This comment was marked as off-topic.

ezyang
ezyang previously requested changes Aug 28, 2018
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, it all looks good. All my comments are just lower level nits. In terms of priority, the most important change to make for me is changing how counter wraparound works.

We might need some documentation about how the streams used here should be short lived. The discussion in the upstream issue was nicely detailed, but people are not likely to see it once this merges.

@mruberry
Copy link
Collaborator Author

Thanks for taking a look @ezyang, @apaszke. Suggestions look good. Really like your point about commenting, @ezyang. I'll get us an update soon (have to finish splitting the fusion compiler first).

@mruberry
Copy link
Collaborator Author

I merged with master and made the following changes:

  • Added a note to CUDAStream.h and additional comments to CUDAStream.cpp, clarifying the use of counters and flags, in particular.
  • Updated constant names per @goldsborough, guarded values per @ezyang, used arrays per @ezyang, also simplified high vs low priority so these values are simply initialized properly (the prior approach was needlessly general)
  • Changed the atomic counters to uint32_t, simplified the round-robin logic by allowing overflow

I did not:

  • Update CUDAStreamInternals to use a unique ptr with a custom deleter, which I agree is probably more elegant but also not necessary right now.
  • Change the loop unroll since I think it's good enough
  • Merge the high priority and low priority acquisitions in CUDAStream_createStream(). The lines of code here can be reduced but I think the current statement is clear and the logic duplication is very small.

@ezyang
Copy link
Contributor

ezyang commented Aug 30, 2018

@pytorchbot retest this please

1 similar comment
@ezyang
Copy link
Contributor

ezyang commented Aug 30, 2018

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Aug 30, 2018
Summary:
This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.

Change notes:

- There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown.
- All stream refcounting has been removed (the pools pattern replaces it).
- Setting a stream now sets it on its device. Streams are associated with a device and the previous
requirement to specify that device was unnecessary.
- There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking.
- Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.)
- Unused THCStream and THCState stream-related functions were removed.
- A new test of pooling behavior was added in stream_test.

fyi: colesbury, apaszke, goldsborough
Pull Request resolved: pytorch/pytorch#9938

Reviewed By: SsnL

Differential Revision: D9569036

Pulled By: ezyang

fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
This PR creates a stream pool per issue pytorch#9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.

Change notes:

- There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown.
- All stream refcounting has been removed (the pools pattern replaces it).
- Setting a stream now sets it on its device. Streams are associated with a device and the previous
requirement to specify that device was unnecessary.
- There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking.
- Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.)
- Unused THCStream and THCState stream-related functions were removed.
- A new test of pooling behavior was added in stream_test.

fyi: colesbury, apaszke, goldsborough
Pull Request resolved: pytorch#9938

Reviewed By: SsnL

Differential Revision: D9569036

Pulled By: ezyang

fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d
@mruberry mruberry deleted the stream_pool branch September 25, 2018 16:41
int3 added a commit to int3/triton-cpu that referenced this pull request Jul 25, 2024
Per pytorch/pytorch#9938, which fixes
pytorch/pytorch#9646, CUDA streams are now
cheap to create under PyTorch. Let's have the benchmarking function
create one per run instead of requiring its callers to do so.
int3 added a commit to int3/triton-cpu that referenced this pull request Jul 26, 2024
Per pytorch/pytorch#9938, which fixes
pytorch/pytorch#9646, CUDA streams are now
cheap to create under PyTorch. Let's have the benchmarking function
create one per run instead of requiring its callers to do so.
int3 added a commit to int3/triton-cpu that referenced this pull request Jul 26, 2024
Per pytorch/pytorch#9938, which fixes
pytorch/pytorch#9646, CUDA streams are now
cheap to create under PyTorch. Let's have the benchmarking function
create one per run instead of requiring its callers to do so.
int3 added a commit to int3/triton-cpu that referenced this pull request Aug 9, 2024
Per pytorch/pytorch#9938, which fixes
pytorch/pytorch#9646, CUDA streams are now
cheap to create under PyTorch. Let's have the benchmarking function
create one per run instead of requiring its callers to do so.
Jokeren pushed a commit to triton-lang/triton that referenced this pull request Aug 9, 2024
…#4392)

Per pytorch/pytorch#9938, which fixes
pytorch/pytorch#9646, CUDA streams are now
cheap to create under PyTorch. Let's have the benchmarking function
create one per run instead of requiring its callers to do so.
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…triton-lang#4392)

Per pytorch/pytorch#9938, which fixes
pytorch/pytorch#9646, CUDA streams are now
cheap to create under PyTorch. Let's have the benchmarking function
create one per run instead of requiring its callers to do so.
liuyunqi20 pushed a commit to flagos-ai/FlagTree that referenced this pull request Oct 21, 2025
… (#4392)

Per pytorch/pytorch#9938, which fixes
pytorch/pytorch#9646, CUDA streams are now
cheap to create under PyTorch. Let's have the benchmarking function
create one per run instead of requiring its callers to do so.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants