Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167771
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit 5664cd8 with merge base 5a3930a ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
The label |
|
The label |
|
The label |
|
The label |
|
The label |
|
The label |
|
|
||
| byte_multipler = 0 | ||
| total_score = 0 | ||
| for buf_name in buf_names: |
There was a problem hiding this comment.
more for me to understand, but why would there be more than one buffer per access?
There was a problem hiding this comment.
this is just format of normalized read writes:
pytorch/torch/_inductor/tiling_utils.py
Lines 218 to 227 in a1db54f
it contains a mapping of sympy memory expr -> all buffers with that expression
| """ | ||
| Try to find the variable that this index is broadcast over. | ||
| A broadcast pattern is one where consecutive values of a variable | ||
| access the same memory location (e.g., x // 10). |
There was a problem hiding this comment.
In general x % 10 can also be a broadcast?
x %10 v.s. x // 10 just picks different dimension
There was a problem hiding this comment.
In this case, x % 10 will be read as coalesced so should still work the same
There was a problem hiding this comment.
yea, but for stride * (x % 10), it's not coalesced
There was a problem hiding this comment.
That won't be considered coalesced. see
pytorch/torch/_inductor/tiling_utils.py
Line 179 in cfe0425
There was a problem hiding this comment.
I think the main thing confuses me is the code treats:
stride * (x % 10) and stride * (x // 10) differently, while they are both broadcasting.
There was a problem hiding this comment.
x % 2048 is not broadcasting. it's only with a very small modulo that it is broadcasting. in this case we're treating both coalesced and broadcasting the same, so it shouldn't matter though.
There was a problem hiding this comment.
import torch
@torch.compile
def f(x, y):
return x[::2, None] + y[None, ::4]
x = torch.randn(1024, device="cuda")
y = torch.randn(2048, device="cuda")
f(x, y)
generates:
@triton.jit
def triton_poi_fused_add_slice_unsqueeze_0(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 262144
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = tl.full([XBLOCK], True, tl.int1)[:]
x1 = xindex // 512
x0 = (xindex % 512)
x2 = xindex
tmp0 = tl.load(in_ptr0 + (2*x1), None, eviction_policy='evict_last')
tmp1 = tl.load(in_ptr1 + (4*x0), None, eviction_policy='evict_last')
tmp2 = tmp0 + tmp1
tl.store(out_ptr0 + (x2), tmp2, None)
Althoughly the generated code replace xindex // 512 with x1 and xindex % 512 with x0
There was a problem hiding this comment.
maybe a more complex example can trigger the case that xindex // 512 and xindex % 512 shows up in the memory address expression directly. But the tiny example above already shows the idea
Fix for pytorch#166653. Two fixes: - We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache. - Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel. Pull Request resolved: pytorch#167771 Approved by: https://github.com/v0i0 Empty draft PR Initial muon port Change branch name lint refresh cla lint lint
ghstack-source-id: 0102654 Pull Request resolved: pytorch/pytorch#167771
|
@pytorchbot revert -c "weird" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "needs one fix" -c weird |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@eellison your PR has been successfully reverted. |
This reverts commit 7ede33b. Reverted #167771 on behalf of https://github.com/eellison due to needs one fix ([comment](#167771 (comment)))
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: Lint / Test collect_env (with_torch, linux.24_04.4x), inductor / inductor-cpu-test / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx), inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fix for pytorch#166653. Two fixes: - We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache. - Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel. Pull Request resolved: pytorch#167771 Approved by: https://github.com/v0i0
ghstack-source-id: dae7771 Pull Request resolved: pytorch/pytorch#167771
Stack from ghstack (oldest at bottom):
Fix for #166653.
Two fixes:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben