[CUDA][Green Context] Expose green context streams by eqy · Pull Request #171116 · pytorch/pytorch

eqy · 2025-12-22T19:10:05Z

~~Also uses a non-default stream in the green context as passing around a default (null) stream seems sketchy~~
set/pop-context APIs still use default stream

cc @ptrblck @msaroufim @jerryzh168 @tinglvv @nWEIdia

pytorch-bot · 2025-12-22T19:10:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171116

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 38dca67 with merge base 5e30b70 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Limited CI on H100 / linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test / test (gh) (similar failure)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4) (gh) (similar failure)
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardCudaGraph::test_two_layer_fully_shard_cudagraph
trunk / linux-jammy-rocm-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.gfx942.4) (gh) (disabled by #129390)
test/distributed/_tools/test_fsdp2_mem_tracker.py::TestTrackerFullyShard1DTrainingCore::test_tracker_multi_group_eager
trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4) (gh) (similar failure)
test/distributed/_tools/test_mem_tracker.py::TestMemTracker::test_tracker_with_activation_checkpointing

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/csrc/cuda/GreenContext.cpp

Skylion007 · 2025-12-23T17:18:46Z

aten/src/ATen/cuda/CUDAGreenContext.cpp

@@ -97,6 +100,7 @@ GreenContext::GreenContext(uint32_t device_id, uint32_t num_sms) {
    green_ctx_ = std::exchange(other.green_ctx_, nullptr);


You should ifdef the entire move ctor and have these all be member initializers. Also that way the TORCH_CHECK error would properly give a stack trace instead of just terminating.

Skylion007 · 2025-12-23T17:20:11Z

aten/src/ATen/cuda/CUDAGreenContext.h

  CUgreenCtx green_ctx_ = nullptr;
  CUcontext context_ = nullptr;
  cudaStream_t parent_stream_ = nullptr;
+  CUstream green_ctx_stream_;


Shouldn't this also be nullptr initialized?

ngimel · 2025-12-23T20:59:29Z

aten/src/ATen/cuda/CUDAGreenContext.cpp

-    auto default_stream = c10::cuda::getDefaultCUDAStream();
-    ev.block(default_stream);
-    c10::cuda::setCurrentCUDAStream(default_stream);
+    auto green_ctx_stream = c10::cuda::getStreamFromExternal(green_ctx_stream_, device_id_);


can we create greem_ctx_stream_ as CUDAStream so you can directly use it here? nbd if no

Can just revert to using the default stream in this case given below comment, no real reason this has to be the same stream as returned by Stream()

ngimel · 2025-12-23T21:01:11Z

aten/src/ATen/cuda/CUDAGreenContext.cpp

+
+  CUDAStream GreenContext::Stream() {
+#if HAS_CUDA_GREEN_CONTEXT()
+    return c10::cuda::getStreamFromExternal(green_ctx_stream_, device_id_);


Ugh this limits users to just one stream per green context? People are used to writing s1 = torch.cuda.Stream(); s2 = torch.cuda.Stream(), if ctx.Stream() has drastically different behavior this will be confusing. Also is there a real reason for this?

I don't think so if we make the tracking user responsibility

ngimel · 2025-12-24T01:23:14Z

aten/src/ATen/cuda/CUDAGreenContext.cpp

+    CUstream green_ctx_side_stream;
+    C10_CUDA_DRIVER_CHECK(c10::cuda::DriverAPI::get()->cuGreenCtxStreamCreate_(
+      &green_ctx_side_stream, green_ctx_, CU_STREAM_NON_BLOCKING, 0));
+    // implies we leak side-streams, but this has precedent in e.g., c10/cuda/CUDAStream.cpp


not really, CUDAStream.cpp creates fixed number of streams, getStreamFromExternal implies that external libraries that created the stream can also destroy it, but here we can potentially create and leak an unbounded number of streams, because it's very common to have code that just creates and "destroys" streams like no tomorrow.
Can we instead go CUDAStream.cpp route, precreate a fixed number of streams and dole them out as needed?

eqy · 2026-01-05T16:39:25Z

@pytorchmergebot rebase

pytorchmergebot · 2026-01-05T16:40:56Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2026-01-05T16:40:59Z

Successfully rebased greenstreamexposed onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout greenstreamexposed && git pull --rebase)

ngimel · 2026-01-05T17:22:54Z

aten/src/ATen/cuda/CUDAGreenContext.h

  CUcontext context_ = nullptr;
  cudaStream_t parent_stream_ = nullptr;
+  std::array<CUstream, kStreamPerGreenContextPool> green_ctx_streams_;
+  int32_t curr_stream_idx_ = -1;


needs to be atomic?

eqy · 2026-01-05T21:52:33Z

@pytorchmergebot merge

pytorchmergebot · 2026-01-05T21:54:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-01-11T05:08:26Z

Successfully rebased greenstreamexposed onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout greenstreamexposed && git pull --rebase)

eqy · 2026-01-12T17:18:32Z

@pytorchmergebot merge

pytorchmergebot · 2026-01-12T17:20:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-01-12T17:21:24Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

eqy · 2026-01-12T17:31:45Z

@pytorchmergebot merge -i

pytorchmergebot · 2026-01-12T17:33:48Z

Merge started

Your change will be merged while ignoring the following 5 checks: Limited CI on H100 / linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test / test, trunk / linux-jammy-rocm-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.gfx942.4), trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4), trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4), Meta Internal-Only Changes Check

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…71116)" This reverts commit 5ecb35e. Reverted pytorch#171116 on behalf of https://github.com/jeanschmidt due to breaks internal builds, see D90148243 ([comment](pytorch#171116 (comment)))

pytorchmergebot · 2026-01-12T23:32:09Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

ngimel · 2026-01-13T00:52:55Z

@pytorchbot merge -f "merge keeps timing out"

pytorchmergebot · 2026-01-13T00:54:28Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eqy added module: cuda Related to torch.cuda, and CUDA support in general open source labels Dec 22, 2025

eqy requested a review from syed-ahmed as a code owner December 22, 2025 19:10

eqy added the release notes: cuda release notes category label Dec 22, 2025

eqy requested a review from Aidyn-A as a code owner December 22, 2025 19:10

pytorch-bot bot added ciflow/b200 ciflow/h100 ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 22, 2025

eqy commented Dec 22, 2025

View reviewed changes

torch/csrc/cuda/GreenContext.cpp Show resolved Hide resolved

Skylion007 reviewed Dec 23, 2025

View reviewed changes

ngimel reviewed Dec 23, 2025

View reviewed changes

ngimel reviewed Dec 24, 2025

View reviewed changes

pytorchmergebot force-pushed the greenstreamexposed branch from 1743baa to 51bbc95 Compare January 5, 2026 16:41

ngimel reviewed Jan 5, 2026

View reviewed changes

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 5, 2026

ngimel approved these changes Jan 5, 2026

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 5, 2026

pytorchmergebot added the merging label Jan 5, 2026

malfet approved these changes Jan 5, 2026

View reviewed changes

pytorchmergebot added the Merged label Jan 6, 2026

pytorchmergebot closed this in 5ecb35e Jan 6, 2026

pytorchmergebot removed the merging label Jan 6, 2026

eqy added 13 commits January 11, 2026 05:08

lint

b35dbf8

update

ca3e4cf

thanks gemini

5d33244

rocm

662da74

update

238a04d

atomci

f1d2427

update

d8c05fb

Add new CUDA driver API functions for version 12080

a50a914

Update driver_api.h

15767d3

Update driver_api.h

c6320d2

Update driver_api.h

8b61267

preprocessor unhappy with nesting

1c670b8

Update CUDAGreenContext.h

38dca67

pytorchmergebot force-pushed the greenstreamexposed branch from 6ba6d3f to 38dca67 Compare January 11, 2026 05:08

pytorchmergebot removed the merging label Jan 12, 2026

pytorchmergebot added the merging label Jan 12, 2026

pytorchmergebot closed this in 3bdf66d Jan 13, 2026

pytorchmergebot removed the merging label Jan 13, 2026

eqy mentioned this pull request Jan 16, 2026

[SymmMem] Back symm_mem.emtpy() with implicit pool #172292

Closed

		@@ -97,6 +100,7 @@ GreenContext::GreenContext(uint32_t device_id, uint32_t num_sms) {
		green_ctx_ = std::exchange(other.green_ctx_, nullptr);

Conversation

eqy commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171116

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Jan 5, 2026

Uh oh!

pytorchmergebot commented Jan 5, 2026

Uh oh!

pytorchmergebot commented Jan 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Jan 5, 2026

Uh oh!

pytorchmergebot commented Jan 5, 2026

Merge started

Uh oh!

pytorchmergebot commented Jan 11, 2026

Uh oh!

eqy commented Jan 12, 2026

Uh oh!

pytorchmergebot commented Jan 12, 2026

Merge started

Uh oh!

pytorchmergebot commented Jan 12, 2026

Merge failed

Uh oh!

eqy commented Jan 12, 2026

Uh oh!

pytorchmergebot commented Jan 12, 2026

Merge started

Uh oh!

pytorchmergebot commented Jan 12, 2026

Uh oh!

ngimel commented Jan 13, 2026

Uh oh!

pytorchmergebot commented Jan 13, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

eqy commented Dec 22, 2025 •

edited

Loading

pytorch-bot bot commented Dec 22, 2025 •

edited

Loading