Make CachingHostAllocator work with memory pools. by galv · Pull Request #167507 · pytorch/pytorch

galv · 2025-11-11T02:36:33Z

Stack from ghstack (oldest at bottom):

Both allocation to a cuda graph's private pool via stream capture and
allocation to a memory pool in non-stream-captured code are supported.

In the case of stream capture, we refuse to reuse a host memory block
as soon as record_event() is called on that block. This is to prevent
a stream-captured CUDA kernel from reading different contents from a
memory block than would be read if, counterfactually, this CUDA
kernels were running eagerly on a cuda stream.

See
#161583 (comment)
for elaboration.

This is lacking test cases for pagedable host memory copies. We must
make sure that record_event() does not fail in that case.

[ghstack-poisoned]

pytorch-bot · 2025-11-11T02:36:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167507

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 9 Unrelated Failures

As of commit dfbb053 with merge base 3854d69 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 7, 8, lf.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh)
test/test_sparse.py::TestSparseCUDA::test_sparse_mul_masked_cuda_float64
s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 4, 10, linux.s390x) (gh)
test/dynamo/test_repros.py::ReproTests::test_dynamo_set_recursion_limit

CANCELLED JOB - The following job was cancelled. Please retry:

s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 3, 10, linux.s390x) (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 1, 8, lf.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (similar failure)
test/test_dataloader.py::TestDataLoaderDeviceTypeCUDA::test_sparse_tensor_multiprocessing_context_spawn_cuda

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 2, 8, lf.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (trunk failure)
test/test_autograd.py::TestAutograd::test_lobpcg
periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 6, 8, lf.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (trunk failure)
test/dynamo/test_backends.py::TestOptimizationsCUDA::test_aot_cudagraphs_cuda
periodic / linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, lf.linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build) (gh) (trunk failure)
test/test_privateuseone_python_backend.py::PrivateUse1BackendTest::test_backend_simple

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 1, 5, lf.linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
RuntimeError: inductor/test_flex_attention 1/1 failed!
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 2, 5, lf.linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
test/inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_32_mid_dim_8_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 3, 5, lf.linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
test/inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_cuda_batch_size_17_mid_dim_8_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 4, 5, lf.linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
test/inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_1_mid_dim_8_in_features_128_out_features_64_cuda_bfloat16
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 5, 5, lf.linux.g4dn.4xlarge.nvidia.gpu, unstable) (gh) (#163657)
test/inductor/test_cuda_select_algorithm.py::TestSelectAlgorithmCudaCUDA::test_int8_woq_mm_concat_cuda_batch_size_1_mid_dim_1_in_features_128_out_features_64_cuda_bfloat16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

galv · 2025-11-17T06:44:26Z

I have found an issue while testing this locally. Still not quite there yet.

Both allocation to a cuda graph's private pool via stream capture and allocation to a memory pool in non-stream-captured code are supported. In the case of stream capture, we refuse to reuse a host memory block as soon as record_event() is called on that block. This is to prevent a stream-captured CUDA kernel from reading different contents from a memory block than would be read if, counterfactually, this CUDA kernels were running eagerly on a cuda stream. See pytorch/pytorch#161583 (comment) for elaboration. This is lacking test cases for pagedable host memory copies. We must make sure that record_event() does not fail in that case. ghstack-source-id: 7f22464 Pull-Request: pytorch/pytorch#167507

galv · 2025-11-17T21:44:14Z

The current issue is pageable host memory. It is legal to t1.copy_(t2, non_blocking=True) where either t1 or t2 is pageable host memory (while the other tensor is device memory).

Without loss of generality, suppose that t1 is a tensor backed by pageable host memory.

If t1 was allocated during stream capture, the user would expect that t1's backing memory will be kept alive (even if t1 itself dies), because that is the semantics with GPU and pinned CPU memory allocations. Meanwhile, if t1 was allocated before stream capture, it is the user's responsibility to keep t1 alive.

We have no way to distinguish whether t1 was allocated before or after stream capture, without doing something clunky like capturing the entire state of the process's memory map when capture_begin() is called. Even then, that does not work since malloc() can reuse previously allocated pages.

Fortunately, what we can do is distinguish with the CUDA APIs whether a given host pointer points to pinned host memory or pageable host memory. In #167508, I can add a warning if a particular cudaMemcpyAsync() happens to read from pageable host memory. This is probably the best I can do without something weird like shadow memory.

[ghstack-poisoned]

galv · 2025-11-18T21:19:30Z

This PR is ready for review.

FYI @eee4017 @ngimel in case you are interested.

aten/src/ATen/cuda/CUDAGraph.cpp

aten/src/ATen/core/CachingHostAllocator.h

aten/src/ATen/cuda/MemPool.cpp

aten/src/ATen/core/CachingHostAllocator.h

test/test_cuda.py

ngimel

Added a few comments, lmk if you want to change the pools design

aten/src/ATen/core/CachingHostAllocator.h

ngimel · 2025-12-02T22:26:38Z

aten/src/ATen/core/CachingHostAllocator.h

+    // First, try to allocate from the free list of the chosen pool
+    auto* block = get_free_block(roundSize, pool);
    if (block) {
+      block->was_allocated_during_stream_capture_ = stream_is_capturing(get_current_stream());


similarly, to protect perf of the common case, first check captures_underway. You may hide in in current_stream_is_capturing with no args, so inside that function you'd first check that captures_underway is non-empty, end then get current stream and check its status.

aten/src/ATen/core/CachingHostAllocator.h

ngimel · 2025-12-02T23:04:44Z

aten/src/ATen/core/CachingHostAllocator.h

    }
  }

+  // TODO: Rethink how this is implemented. Should it take a pool id


Yes! I think it would be valuable to free just the pinned blocks associated with a pool

aten/src/ATen/core/CachingHostAllocator.h

aten/src/ATen/cuda/MemPool.cpp

[ghstack-poisoned]

galv · 2025-12-02T23:34:02Z

test/test_cuda.py

+
+    def test_unpinned_memory_use(self):
+        # It is allowed to call copy_(non_blocking=True) on pageable
+        # host memory. TODO: We should test that a warning is emitted


Will complete this test in #167508

aten/src/ATen/core/CachingHostAllocator.h

[ghstack-poisoned]

galv · 2025-12-02T23:49:12Z

@eee4017 @eqy @syed-ahmed if you want to review this, now is the time to do so. I removed support for private pools outside of stream capture, which drastically reduces the size of the PR. The test cases have also shrunk as well.

pytorchmergebot · 2025-12-17T04:56:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-12-17T04:56:42Z

Merge failed

Reason: 3 jobs have failed, first few of them are: s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 3, 10, linux.s390x), s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 4, 10, linux.s390x), periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 7, 8, lf.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck)

Details for Dev Infra team

Raised by workflow job

galv · 2025-12-17T05:01:06Z

@pytorchbot merge -i

pytorchmergebot · 2025-12-17T05:03:15Z

Merge started

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Testing: `pytest -k test_split_with_sizes_copy_out test/test_torch.py` This is no longer needed after #167507 Fixes #169607 Pull Request resolved: #170710 Approved by: https://github.com/eqy

Testing: `pytest -k test_split_with_sizes_copy_out test/test_torch.py` This is no longer needed after pytorch#167507 Fixes pytorch#169607 Pull Request resolved: pytorch#170710 Approved by: https://github.com/eqy

Testing: `pytest -k test_split_with_sizes_copy_out test/test_torch.py` This is no longer needed after #167507 Fixes #169607 Pull Request resolved: #170710 Approved by: https://github.com/eqy

I initially wrote in pytorch#164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since pytorch#167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue pytorch#164264

I initially wrote in #164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since #167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue #164264 Fixes #164264 Pull Request resolved: #171835 Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Edward Yang <ezyang@meta.com>

Both allocation to a cuda graph's private pool via stream capture and allocation to a memory pool in non-stream-captured code are supported. In the case of stream capture, we refuse to reuse a host memory block as soon as record_event() is called on that block. This is to prevent a stream-captured CUDA kernel from reading different contents from a memory block than would be read if, counterfactually, this CUDA kernels were running eagerly on a cuda stream. See pytorch#161583 (comment) for elaboration. This is lacking test cases for pagedable host memory copies. We must make sure that record_event() does not fail in that case. Pull Request resolved: pytorch#167507 Approved by: https://github.com/eqy, https://github.com/eee4017, https://github.com/ngimel

Testing: `pytest -k test_split_with_sizes_copy_out test/test_torch.py` This is no longer needed after pytorch#167507 Fixes pytorch#169607 Pull Request resolved: pytorch#170710 Approved by: https://github.com/eqy

I initially wrote in pytorch#164264 that there was a missing wait_stream() call to put a stream into stream capture mode, but surprisingly since I made that issue the problem has been fixed. I was not able to locate the exact commit that coincidentally made that fix after a brief search. Since CachingHostAllocator supports memory allocation during stream capture since pytorch#167507, the purpose of this PR is simply to make sure that the support does not regress. An important detail is that we need to make sure that cuda graph still overlaps the all-gather and reduce-scatter streams with computation streams. To check for that, I applied this patch: ``` diff --git a/test/distributed/_composable/fsdp/test_fully_shard_training.py b/test/distributed/_composable/fsdp/test_fully_shard_training.py index c0831d8..c0fecdf787d 100644 --- a/test/distributed/_composable/fsdp/test_fully_shard_training.py +++ b/test/distributed/_composable/fsdp/test_fully_shard_training.py @@ -1681,8 +1681,8 @@ class TestFullyShardCudaGraph(FSDPTest): device = torch.device(device_type.type, self.rank) torch.manual_seed(42) model = nn.Sequential( - nn.Linear(8, 8, bias=False), - nn.Linear(8, 8, bias=False), + nn.Linear(4096, 4096, bias=False), + nn.Linear(4096, 4096, bias=False), ).to(device) for param in model.parameters(): dist.broadcast(param, src=0) @@ -1694,7 +1694,7 @@ class TestFullyShardCudaGraph(FSDPTest): # warmup with torch.cuda.stream(stream): - input_tensor = torch.randn(4, 8, device=device) + input_tensor = torch.randn(4, 4096, device=device) output = model(input_tensor) output.sum().backward() model.zero_grad(set_to_none=True) @@ -1711,7 +1711,7 @@ class TestFullyShardCudaGraph(FSDPTest): ] # equivalence check - with torch.cuda.stream(stream): + with torch.cuda.stream(stream), torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof: for _ in range(2): replay_input = torch.randn(4, 8, device=device) ref_output = model(replay_input) @@ -1726,6 +1726,8 @@ class TestFullyShardCudaGraph(FSDPTest): for graph_grad, ref_grad in zip(static_output_grads, ref_grads): self.assertTrue(torch.equal(graph_grad, ref_grad)) model.zero_grad(set_to_none=True) + prof.step() + prof.export_chrome_trace(f"two_layer_fully_shard_cudagraph_{self.rank}.json") if __name__ == "__main__": ``` I then inspection the json file manually to check for overlap. Closes issue pytorch#164264 Fixes pytorch#164264 Pull Request resolved: pytorch#171835 Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Edward Yang <ezyang@meta.com>

Only allocation to a cuda graph's private pool via stream capture is supported. Allocation to a memory pool in non-stream-captured code is not supported. There is no obvious usecase at this time for that. In stream capture, we refuse to reuse a host memory block as soon as record_event() is called on that block. This is to prevent a stream-captured CUDA kernel from reading different contents from a memory block than would be read if, counterfactually, this CUDA kernels were running eagerly on a cuda stream. See pytorch/pytorch#161583 (comment) for elaboration. ghstack-source-id: d1e30fc Pull-Request: pytorch/pytorch#167507

@ngimel

… capturing (#174724) This matches the behavior of CUDACachingAllocator.cpp. If a user wants to prevent a pin_memory() call from a data loading thread from disrupting their stream capture, they use "thread_local" stream caputre mode for that stream capture. Requested by @ngimel as a follow up to #167507 Pull Request resolved: #174724 Approved by: https://github.com/ngimel

Update

f8722c7

[ghstack-poisoned]

galv requested review from Aidyn-A, EikanWang, eqy, gujinghui and syed-ahmed as code owners November 11, 2025 02:36

This was referenced Nov 11, 2025

Move MemPool out of c10 and into ATen. #167506

Closed

Detect write-after-read hazards during stream capture with pinned memory. #167508

Open

galv requested a review from guangyey November 11, 2025 02:38

pytorchbot added the open source label Nov 11, 2025

galv mentioned this pull request Nov 14, 2025

SimpleFSDP x CUDAGraph issue tracking #167861

Closed

3 tasks

galv mentioned this pull request Nov 17, 2025

Thread Sanitizer support is clunky right now. #167970

Open

galv added the release notes: cuda release notes category label Nov 17, 2025

Update

eff303a

[ghstack-poisoned]

syed-ahmed added this to PyTorch + CUDA Nov 18, 2025

eee4017 reviewed Nov 20, 2025

View reviewed changes

aten/src/ATen/cuda/CUDAGraph.cpp Show resolved Hide resolved

galv commented Nov 21, 2025

View reviewed changes

aten/src/ATen/core/CachingHostAllocator.h Outdated Show resolved Hide resolved

galv commented Nov 25, 2025

View reviewed changes

aten/src/ATen/core/CachingHostAllocator.h Outdated Show resolved Hide resolved

eqy reviewed Nov 25, 2025

View reviewed changes

aten/src/ATen/cuda/MemPool.cpp Outdated Show resolved Hide resolved

aten/src/ATen/core/CachingHostAllocator.h Outdated Show resolved Hide resolved

aten/src/ATen/core/CachingHostAllocator.h Show resolved Hide resolved

test/test_cuda.py Show resolved Hide resolved

ngimel reviewed Dec 2, 2025

View reviewed changes

Update

cde0789

[ghstack-poisoned]

galv commented Dec 2, 2025

View reviewed changes

aten/src/ATen/core/CachingHostAllocator.h Outdated Show resolved Hide resolved

Update

a6c87ea

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 17, 2025

pytorchmergebot added the merging label Dec 17, 2025

pytorchmergebot removed the merging label Dec 17, 2025

pytorchmergebot added the merging label Dec 17, 2025

pytorchmergebot added the Merged label Dec 17, 2025

pytorchmergebot closed this in 416fccb Dec 17, 2025

github-project-automation bot moved this to Done in PyTorch + CUDA Dec 17, 2025

pytorchmergebot removed the merging label Dec 17, 2025

galv mentioned this pull request Dec 17, 2025

Fix #169607 #170710

Closed

xgz2 pushed a commit that referenced this pull request Dec 22, 2025

Fix #169607 (#170710)

99025fc

Testing: `pytest -k test_split_with_sizes_copy_out test/test_torch.py` This is no longer needed after #167507 Fixes #169607 Pull Request resolved: #170710 Approved by: https://github.com/eqy

guangyey mentioned this pull request Dec 30, 2025

Introduce a unified API to empty the host cache memory #171270

Open

galv mentioned this pull request Jan 6, 2026

Test that FSDP2 works with cuda graphs. #171835

Closed

github-actions bot deleted the gh/galv/2/head branch January 17, 2026 02:19

galv mentioned this pull request Feb 10, 2026

Set thread stream capture mode to relaxed only when current stream is capturing #174724

Closed

Conversation

galv commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167507

❌ 2 New Failures, 1 Cancelled Job, 9 Unrelated Failures

Uh oh!

galv commented Nov 17, 2025

Uh oh!

galv commented Nov 17, 2025

Uh oh!

galv commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

galv Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

galv commented Dec 2, 2025

Uh oh!

pytorchmergebot commented Dec 17, 2025

Merge started

Uh oh!

pytorchmergebot commented Dec 17, 2025

Merge failed

Uh oh!

galv commented Dec 17, 2025

Uh oh!

pytorchmergebot commented Dec 17, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

galv commented Nov 11, 2025 •

edited

Loading

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading

galv Dec 2, 2025 •

edited

Loading