Tiling bug fix#167771

Closed

eellison wants to merge 5 commits intogh/eellison/865/basefrom

gh/eellison/865/head

Contributor

eellison commented Nov 13, 2025 •

edited

Loading

Stack from ghstack (oldest at bottom):

Fix for #166653.

Two fixes:

We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache.
Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben


          Update

c58d0f7

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels

eellison added a commit that referenced this pull request


          Tiling bug fix

728ccf1

ghstack-source-id: 198bd40
Pull Request resolved: #167771

pytorch-bot bot commented Nov 13, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167771

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 5664cd8 with merge base 5a3930a ():

NEW FAILURES - The following jobs have failed:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx) (gh)
stable_diffusion_unet
inductor / inductor-cpu-test / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx) (gh)
stable_diffusion_unet
inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
stable_diffusion_unet
Lint / Test collect_env (with_torch, linux.24_04.4x) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot removed the module: inductor label

pytorch-bot bot commented Nov 13, 2025

The label module: inductor is only applicable to issues and has been removed. Please only use this label on issues.

pytorch-bot bot added module: inductor and removed module: inductor labels

pytorch-bot bot commented Nov 13, 2025

The label module: inductor is only applicable to issues and has been removed. Please only use this label on issues.

pytorch-bot bot added the module: inductor label

pytorch-bot bot commented Nov 13, 2025

The label module: inductor is only applicable to issues and has been removed. Please only use this label on issues.

pytorch-bot bot removed the module: inductor label

eellison requested a review from shunting314

November 13, 2025 22:10

pytorch-bot bot added module: inductor and removed module: inductor labels

pytorch-bot bot commented Nov 13, 2025

The label module: inductor is only applicable to issues and has been removed. Please only use this label on issues.

eellison requested a review from v0i0

November 13, 2025 22:10

eellison added the topic: not user facing label


          Update

a1db54f

[ghstack-poisoned]

pytorch-bot bot added the module: inductor label

eellison added a commit that referenced this pull request


          Tiling bug fix

0fba4d2

ghstack-source-id: ef8b139
Pull Request resolved: #167771

pytorch-bot bot removed the module: inductor label

pytorch-bot bot commented Nov 13, 2025

The label module: inductor is only applicable to issues and has been removed. Please only use this label on issues.

pytorch-bot bot added module: inductor and removed module: inductor labels

pytorch-bot bot commented Nov 13, 2025

The label module: inductor is only applicable to issues and has been removed. Please only use this label on issues.

v0i0 approved these changes

View reviewed changes

torch/_inductor/tiling_utils.py Outdated Show resolved Hide resolved

torch/_inductor/tiling_utils.py

-                      byte_multipler = 0
+                      total_score = 0
                       for buf_name in buf_names:

Contributor

v0i0 Nov 14, 2025

more for me to understand, but why would there be more than one buffer per access?

Contributor Author

eellison Nov 14, 2025

this is just format of normalized read writes:

pytorch/torch/_inductor/tiling_utils.py

Lines 218 to 227 in a1db54f

    
           class FusedNormalizedReadsWrites: 
        
               """ 
        
               Normalized reads and writes for nodes in the same FusedSchedulerNode. 
        
               """ 
        
               index_vars: OrderedSet[sympy.Symbol] 
        
               reduce_vars: OrderedSet[sympy.Symbol] 
        
               reads: dict[sympy.Expr, OrderedSet[str]] 
        
               writes: dict[sympy.Expr, OrderedSet[str]] 
        
               var_ranges: dict[sympy.Symbol, int]

it contains a mapping of sympy memory expr -> all buffers with that expression


          Update

a7de484

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          Tiling bug fix

07ed47d

ghstack-source-id: c428a99
Pull Request resolved: #167771

pytorchmergebot removed the merging label

pytorchmergebot pushed a commit that referenced this pull request


          Tiling bug fix

9e8b72a

ghstack-source-id: 0102654
Pull Request resolved: #167771

shunting314 reviewed

View reviewed changes

torch/_inductor/tiling_utils.py

+                  """
+                  Try to find the variable that this index is broadcast over.
+                  A broadcast pattern is one where consecutive values of a variable
+                  access the same memory location (e.g., x // 10).

Contributor

shunting314 Nov 14, 2025

In general x % 10 can also be a broadcast?

x %10 v.s. x // 10 just picks different dimension

Contributor Author

eellison Nov 14, 2025

In this case, x % 10 will be read as coalesced so should still work the same

Contributor

shunting314 Nov 15, 2025

yea, but for stride * (x % 10), it's not coalesced

Contributor Author

eellison Nov 17, 2025

That won't be considered coalesced. see

pytorch/torch/_inductor/tiling_utils.py

Line 179 in cfe0425

def find_coalesced_var(

Contributor

shunting314 Nov 17, 2025

I think the main thing confuses me is the code treats:

stride * (x % 10) and stride * (x // 10) differently, while they are both broadcasting.

Contributor Author

eellison Nov 17, 2025 •

edited

Loading

x % 2048 is not broadcasting. it's only with a very small modulo that it is broadcasting. in this case we're treating both coalesced and broadcasting the same, so it shouldn't matter though.

Contributor

shunting314 Nov 17, 2025

import torch

@torch.compile
def f(x, y):
    return x[::2, None] + y[None, ::4]

x = torch.randn(1024, device="cuda")
y = torch.randn(2048, device="cuda")
f(x, y)

generates:

@triton.jit
def triton_poi_fused_add_slice_unsqueeze_0(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 262144
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = tl.full([XBLOCK], True, tl.int1)[:]
    x1 = xindex // 512
    x0 = (xindex % 512)
    x2 = xindex
    tmp0 = tl.load(in_ptr0 + (2*x1), None, eviction_policy='evict_last')
    tmp1 = tl.load(in_ptr1 + (4*x0), None, eviction_policy='evict_last')
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + (x2), tmp2, None)

Althoughly the generated code replace xindex // 512 with x1 and xindex % 512 with x0

Contributor

shunting314 Nov 17, 2025 •

edited

Loading

maybe a more complex example can trigger the case that xindex // 512 and xindex % 512 shows up in the memory address expression directly. But the tiny example above already shows the idea

jsuarez5341 pushed a commit to PufferAI/pytorch that referenced this pull request


          Tiling bug fix (pytorch#167771)

a06c72a

Fix for pytorch#166653.

Two fixes:
- We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache.
- Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel.

Pull Request resolved: pytorch#167771
Approved by: https://github.com/v0i0

Empty draft PR

Initial muon port

Change branch name

lint

refresh cla

lint

lint

Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request


          Tiling bug fix

5dc1b5a

ghstack-source-id: 0102654
Pull Request resolved: pytorch/pytorch#167771

Contributor Author

eellison commented Nov 17, 2025

@pytorchbot revert -c "weird"

pytorch-bot bot commented Nov 17, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst,autorevert}

Try @pytorchbot --help for more info.

Contributor Author

eellison commented Nov 17, 2025

@pytorchbot revert -m "needs one fix" -c weird

Collaborator

pytorchmergebot commented Nov 17, 2025

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

Collaborator

pytorchmergebot commented Nov 17, 2025

@eellison your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request


          Revert "Tiling bug fix (#167771)"

1c04a43

This reverts commit 7ede33b.

Reverted #167771 on behalf of https://github.com/eellison due to needs one fix ([comment](#167771 (comment)))

pytorchmergebot added Reverted ci-no-td labels

pytorchmergebot reopened this


          Update

5664cd8

[ghstack-poisoned]

eellison mentioned this pull request

handle multiple reductions in node splits & read/write normalization #168013

Closed

Contributor Author

eellison commented Nov 17, 2025

@pytorchbot merge

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Nov 17, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Collaborator

pytorchmergebot commented Nov 17, 2025

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / Test collect_env (with_torch, linux.24_04.4x)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

pytorchmergebot removed the merging label

Contributor Author

eellison commented Nov 17, 2025

@pytorchbot merge -i

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Nov 17, 2025

Merge started

Your change will be merged while ignoring the following 4 checks: Lint / Test collect_env (with_torch, linux.24_04.4x), inductor / inductor-cpu-test / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx), inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

8a8c634

pytorchmergebot removed the merging label

Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request


          Tiling bug fix (pytorch#167771)

fffa009

Fix for pytorch#166653.

Two fixes:
- We were inducing a split for broadcasted loads. e.g. (x // 16). While a split of 16 here will make the load coalesced in one of the tile vars, since the load is already in cache it's not worth splitting. And it would make the other tile var load from memory that isnt in cache.
- Add a slight term for uncoalesced memory. This prevents doing tiling for loads which are a small % of the overall kernel.

Pull Request resolved: pytorch#167771
Approved by: https://github.com/v0i0

tiendatngcs pushed a commit to tiendatngcs/pytorch-Dec25 that referenced this pull request


          Tiling bug fix

b9c7ba9

ghstack-source-id: dae7771
Pull Request resolved: pytorch/pytorch#167771

github-actions bot deleted the gh/eellison/865/head branch

December 18, 2025 02:18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td ciflow/inductor ciflow/trunk Merged module: inductor Reverted topic: not user facing