Skip to content

[DataTiling] Switch default to start from the DispatchCreation phase.#21441

Merged
hanhanW merged 6 commits intoiree-org:mainfrom
hanhanW:dt-fusion-default
Nov 14, 2025
Merged

[DataTiling] Switch default to start from the DispatchCreation phase.#21441
hanhanW merged 6 commits intoiree-org:mainfrom
hanhanW:dt-fusion-default

Conversation

@hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Jul 21, 2025

The revision updates the default behavior of data-tiling (i.e., --iree-opt-data-tiling flag) to start from the DispatchCreation phase instead of GlobalOptimization phase. The main benefit from the path is that it enables more fusion opportunities and supports multi-devices.

The path that starts from global optimization phase will be remained for a while. Users can get back to the legacy path by using --iree-global-opt-data-tiling flag.

Feel free to try the new path and file issues. We've seen improvements in sdxl clip model on the new path. We're happy to provide guidance if users have any questions about the new standard data-tiling optimization.

Website Doc Preview: https://iree.dev/reference/optimization-options/#global-optimization-iree-global-optimization-opt-level

ci-extra: test_torch

hanhanW added a commit that referenced this pull request Jul 23, 2025
The compiler is very smart on static shape inference that can generate
partially dynamic shape during the lowering. It makes data-tiling fusion
very struggle because they are expected to be dynamic shape but some
dimensions are inferred to static values in Stream
AnnotateDispatchAssumptions pass. Because it will lead to `tensor.cast
-> set_encoding -> tensor.cast` sequence in a dispatch, while we expect
the bindings have encoded tensor types. E.g.,

Input IR:

```mlir
%0 = iree_tensor_ext.dispatch_load ... tensor<?x?xi8>
%1 = set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #encoding>
iree_tensor_ext.dispatch_store %1, ... tensor<?x?xi8, #encoding> ->
  ... tensor<?x?xi8, #encoding>
```

After annotation:

```mlir
%0 = iree_tensor_ext.dispatch_load ... tensor<4x?xi8>
%cast = tensor.cast %0 : tensor<4x?xi8> -> tensor<?x?xi8>
%1 = set_encoding %cast : tensor<?x?xi8> -> tensor<?x?xi8, #encoding>
%cast_0 = tensor.cast %1 : tensor<?x?xi8, #encoding> to tensor<4x5xi8>
iree_tensor_ext.dispatch_store %cast_0, ... tensor<4x5xi8> ->
  ... tensor<?x?xi8, #encoding>
```

It is hard to materialize the encodings when cast op is present.

Given that the original goal is testing dynamic shape, modifying the
input program is an easier fix.

The issue is observed from #21441.

Signed-off-by: hanhanW <hanhan0912@gmail.com>
@hanhanW hanhanW force-pushed the dt-fusion-default branch 4 times, most recently from efb04d1 to 8932f5e Compare July 29, 2025 22:13
@hanhanW hanhanW force-pushed the dt-fusion-default branch 6 times, most recently from c8f64d5 to 510440c Compare August 6, 2025 18:13
@hanhanW hanhanW force-pushed the dt-fusion-default branch 5 times, most recently from 370c5d9 to 9b48bf9 Compare August 11, 2025 22:44
@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 11, 2025

Here is the failing dispatch on GPU side: https://gist.github.com/hanhanW/af6eb4b01ef2be2037e929be223a3fed

I think we need to replace the readonly dest tensor with tensor.empty for the elementwise op.

@hanhanW hanhanW force-pushed the dt-fusion-default branch 2 times, most recently from 9ce9a06 to 7c370dd Compare August 12, 2025 23:55
@hanhanW hanhanW force-pushed the dt-fusion-default branch 5 times, most recently from 946c2d8 to 02a3f01 Compare August 14, 2025 21:09
@hanhanW hanhanW force-pushed the dt-fusion-default branch 2 times, most recently from b15c16f to 7f03dc3 Compare August 22, 2025 17:28
@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 23, 2025

Folks, I plan to land the PR by the end of the week. @MaheshRavishankar can you help review the PR?

@egebeysel
Copy link
Contributor

I checked how dt fusion looks like with SVE and unfortunately it currently fails, I'll take a deeper dive to the issues later. But as mentioned, that's WIP and obviously not a blocker. I'd be curious to see a performance chart with DT vs DT-fusion though if you have that :)

@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 23, 2025

I checked how dt fusion looks like with SVE and unfortunately it currently fails, I'll take a deeper dive to the issues later. But as mentioned, that's WIP and obviously not a blocker. I'd be curious to see a performance chart with DT vs DT-fusion though if you have that :)

Does it work with the original path? I thought that there are vscale ops, so it is not working.

@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 23, 2025

The only perf data I have is from CI: https://github.com/iree-org/iree/actions/runs/18733272191?pr=21441

Without the patch:

image

With the patch:

image

The binary size drops a lot after rebase, I'll update it.

@JerryShih
Copy link
Contributor

@hanhanW
Will the const-eval still work when the tiling is moving to dispatch-creation pass(especially for mmt4d RHS/LHS packing)?

@Max191
Copy link
Contributor

Max191 commented Oct 27, 2025

@hanhanW Will the const-eval still work when the tiling is moving to dispatch-creation pass(especially for mmt4d RHS/LHS packing)?

Data tiling starting from dispatch creation will use the late materialization path, and const-eval is not yet implemented for late materialization, but const-expr-hoisting should still work.

@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 27, 2025

I may start the const-eval work after I finish the memory footprint issue. They are not directly related, but they are all important for data-tiling. Otherwise, it may crash when data-tiling is enabled.

Here is the initial breakdown for const-eval work: https://discord.com/channels/689900678990135345/1428131175528136764 The intention was having a broader discussion in the thread, but it is not happening. I will open an issue for it when I get cycles.

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 5, 2025

rebase to trigger new torch test suites, I'll update the golden values in a follow-up commit (within the PR).

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 7, 2025

We recently improved CPU backends. The golden value for the clip is 141 ms (old path) v.s. 133 ms (new path), FYI.

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 8, 2025

I'll land this on next Wed.

Signed-off-by: hanhanW <hanhan0912@gmail.com>
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Signed-off-by: hanhanW <hanhan0912@gmail.com>
@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 13, 2025

I can't land this because shark10 does not pick up the job today... I think it is dead, and I already reached out to POC who can help with the machine issue.

@hanhanW hanhanW merged commit 392931d into iree-org:main Nov 14, 2025
50 of 52 checks passed
@hanhanW hanhanW deleted the dt-fusion-default branch November 14, 2025 18:32
lialan added a commit that referenced this pull request Nov 17, 2025
bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Nov 19, 2025
…iree-org#21441)

The revision updates the default behavior of data-tiling (i.e.,
`--iree-opt-data-tiling` flag) to start from the **DispatchCreation**
phase instead of **GlobalOptimization** phase. The main benefit from the
path is that it enables more fusion opportunities and supports
multi-devices.

The path that starts from global optimization phase will be remained for
a while. Users can get back to the legacy path by using
`--iree-global-opt-data-tiling` flag.

Feel free to try the new path and file issues. We've seen improvements
in sdxl clip model on the new path. We're happy to provide guidance if
users have any questions about the new standard data-tiling
optimization.

Website Doc Preview:
https://iree.dev/reference/optimization-options/#global-optimization-iree-global-optimization-opt-level

ci-extra: test_torch

---------

Signed-off-by: hanhanW <hanhan0912@gmail.com>
pstarkcdpr pushed a commit to pstarkcdpr/iree that referenced this pull request Nov 28, 2025
…iree-org#21441)

The revision updates the default behavior of data-tiling (i.e.,
`--iree-opt-data-tiling` flag) to start from the **DispatchCreation**
phase instead of **GlobalOptimization** phase. The main benefit from the
path is that it enables more fusion opportunities and supports
multi-devices.

The path that starts from global optimization phase will be remained for
a while. Users can get back to the legacy path by using
`--iree-global-opt-data-tiling` flag.

Feel free to try the new path and file issues. We've seen improvements
in sdxl clip model on the new path. We're happy to provide guidance if
users have any questions about the new standard data-tiling
optimization.

Website Doc Preview:
https://iree.dev/reference/optimization-options/#global-optimization-iree-global-optimization-opt-level

ci-extra: test_torch

---------

Signed-off-by: hanhanW <hanhan0912@gmail.com>
AWoloszyn pushed a commit that referenced this pull request Dec 1, 2025
The compiler is very smart on static shape inference that can generate
partially dynamic shape during the lowering. It makes data-tiling fusion
very struggle because they are expected to be dynamic shape but some
dimensions are inferred to static values in Stream
AnnotateDispatchAssumptions pass. Because it will lead to `tensor.cast
-> set_encoding -> tensor.cast` sequence in a dispatch, while we expect
the bindings have encoded tensor types. E.g.,

Input IR:

```mlir
%0 = iree_tensor_ext.dispatch_load ... tensor<?x?xi8>
%1 = set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #encoding>
iree_tensor_ext.dispatch_store %1, ... tensor<?x?xi8, #encoding> ->
  ... tensor<?x?xi8, #encoding>
```

After annotation:

```mlir
%0 = iree_tensor_ext.dispatch_load ... tensor<4x?xi8>
%cast = tensor.cast %0 : tensor<4x?xi8> -> tensor<?x?xi8>
%1 = set_encoding %cast : tensor<?x?xi8> -> tensor<?x?xi8, #encoding>
%cast_0 = tensor.cast %1 : tensor<?x?xi8, #encoding> to tensor<4x5xi8>
iree_tensor_ext.dispatch_store %cast_0, ... tensor<4x5xi8> ->
  ... tensor<?x?xi8, #encoding>
```

It is hard to materialize the encodings when cast op is present.

Given that the original goal is testing dynamic shape, modifying the
input program is an easier fix.

The issue is observed from #21441.

Signed-off-by: hanhanW <hanhan0912@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants