[DataTiling] Switch default to start from the DispatchCreation phase.#21441
[DataTiling] Switch default to start from the DispatchCreation phase.#21441hanhanW merged 6 commits intoiree-org:mainfrom
Conversation
The compiler is very smart on static shape inference that can generate partially dynamic shape during the lowering. It makes data-tiling fusion very struggle because they are expected to be dynamic shape but some dimensions are inferred to static values in Stream AnnotateDispatchAssumptions pass. Because it will lead to `tensor.cast -> set_encoding -> tensor.cast` sequence in a dispatch, while we expect the bindings have encoded tensor types. E.g., Input IR: ```mlir %0 = iree_tensor_ext.dispatch_load ... tensor<?x?xi8> %1 = set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #encoding> iree_tensor_ext.dispatch_store %1, ... tensor<?x?xi8, #encoding> -> ... tensor<?x?xi8, #encoding> ``` After annotation: ```mlir %0 = iree_tensor_ext.dispatch_load ... tensor<4x?xi8> %cast = tensor.cast %0 : tensor<4x?xi8> -> tensor<?x?xi8> %1 = set_encoding %cast : tensor<?x?xi8> -> tensor<?x?xi8, #encoding> %cast_0 = tensor.cast %1 : tensor<?x?xi8, #encoding> to tensor<4x5xi8> iree_tensor_ext.dispatch_store %cast_0, ... tensor<4x5xi8> -> ... tensor<?x?xi8, #encoding> ``` It is hard to materialize the encodings when cast op is present. Given that the original goal is testing dynamic shape, modifying the input program is an easier fix. The issue is observed from #21441. Signed-off-by: hanhanW <hanhan0912@gmail.com>
efb04d1 to
8932f5e
Compare
c8f64d5 to
510440c
Compare
370c5d9 to
9b48bf9
Compare
|
Here is the failing dispatch on GPU side: https://gist.github.com/hanhanW/af6eb4b01ef2be2037e929be223a3fed I think we need to replace the readonly dest tensor with |
9ce9a06 to
7c370dd
Compare
946c2d8 to
02a3f01
Compare
b15c16f to
7f03dc3
Compare
7f03dc3 to
6fb349d
Compare
|
Folks, I plan to land the PR by the end of the week. @MaheshRavishankar can you help review the PR? |
|
I checked how dt fusion looks like with SVE and unfortunately it currently fails, I'll take a deeper dive to the issues later. But as mentioned, that's WIP and obviously not a blocker. I'd be curious to see a performance chart with DT vs DT-fusion though if you have that :) |
Does it work with the original path? I thought that there are vscale ops, so it is not working. |
|
The only perf data I have is from CI: https://github.com/iree-org/iree/actions/runs/18733272191?pr=21441 Without the patch:
With the patch:
The binary size drops a lot after rebase, I'll update it. |
|
@hanhanW |
Data tiling starting from dispatch creation will use the late materialization path, and const-eval is not yet implemented for late materialization, but const-expr-hoisting should still work. |
|
I may start the const-eval work after I finish the memory footprint issue. They are not directly related, but they are all important for data-tiling. Otherwise, it may crash when data-tiling is enabled. Here is the initial breakdown for const-eval work: https://discord.com/channels/689900678990135345/1428131175528136764 The intention was having a broader discussion in the thread, but it is not happening. I will open an issue for it when I get cycles. |
|
rebase to trigger new torch test suites, I'll update the golden values in a follow-up commit (within the PR). |
93a0e98 to
b252bb1
Compare
|
We recently improved CPU backends. The golden value for the clip is 141 ms (old path) v.s. 133 ms (new path), FYI. |
|
I'll land this on next Wed. |
b252bb1 to
2b40d89
Compare
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Signed-off-by: hanhanW <hanhan0912@gmail.com>
2b40d89 to
17f36fb
Compare
|
I can't land this because shark10 does not pick up the job today... I think it is dead, and I already reached out to POC who can help with the machine issue. |
Signed-off-by: hanhanW <hanhan0912@gmail.com>
…iree-org#21441) The revision updates the default behavior of data-tiling (i.e., `--iree-opt-data-tiling` flag) to start from the **DispatchCreation** phase instead of **GlobalOptimization** phase. The main benefit from the path is that it enables more fusion opportunities and supports multi-devices. The path that starts from global optimization phase will be remained for a while. Users can get back to the legacy path by using `--iree-global-opt-data-tiling` flag. Feel free to try the new path and file issues. We've seen improvements in sdxl clip model on the new path. We're happy to provide guidance if users have any questions about the new standard data-tiling optimization. Website Doc Preview: https://iree.dev/reference/optimization-options/#global-optimization-iree-global-optimization-opt-level ci-extra: test_torch --------- Signed-off-by: hanhanW <hanhan0912@gmail.com>
…iree-org#21441) The revision updates the default behavior of data-tiling (i.e., `--iree-opt-data-tiling` flag) to start from the **DispatchCreation** phase instead of **GlobalOptimization** phase. The main benefit from the path is that it enables more fusion opportunities and supports multi-devices. The path that starts from global optimization phase will be remained for a while. Users can get back to the legacy path by using `--iree-global-opt-data-tiling` flag. Feel free to try the new path and file issues. We've seen improvements in sdxl clip model on the new path. We're happy to provide guidance if users have any questions about the new standard data-tiling optimization. Website Doc Preview: https://iree.dev/reference/optimization-options/#global-optimization-iree-global-optimization-opt-level ci-extra: test_torch --------- Signed-off-by: hanhanW <hanhan0912@gmail.com>
The compiler is very smart on static shape inference that can generate partially dynamic shape during the lowering. It makes data-tiling fusion very struggle because they are expected to be dynamic shape but some dimensions are inferred to static values in Stream AnnotateDispatchAssumptions pass. Because it will lead to `tensor.cast -> set_encoding -> tensor.cast` sequence in a dispatch, while we expect the bindings have encoded tensor types. E.g., Input IR: ```mlir %0 = iree_tensor_ext.dispatch_load ... tensor<?x?xi8> %1 = set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #encoding> iree_tensor_ext.dispatch_store %1, ... tensor<?x?xi8, #encoding> -> ... tensor<?x?xi8, #encoding> ``` After annotation: ```mlir %0 = iree_tensor_ext.dispatch_load ... tensor<4x?xi8> %cast = tensor.cast %0 : tensor<4x?xi8> -> tensor<?x?xi8> %1 = set_encoding %cast : tensor<?x?xi8> -> tensor<?x?xi8, #encoding> %cast_0 = tensor.cast %1 : tensor<?x?xi8, #encoding> to tensor<4x5xi8> iree_tensor_ext.dispatch_store %cast_0, ... tensor<4x5xi8> -> ... tensor<?x?xi8, #encoding> ``` It is hard to materialize the encodings when cast op is present. Given that the original goal is testing dynamic shape, modifying the input program is an easier fix. The issue is observed from #21441. Signed-off-by: hanhanW <hanhan0912@gmail.com>


The revision updates the default behavior of data-tiling (i.e.,
--iree-opt-data-tilingflag) to start from the DispatchCreation phase instead of GlobalOptimization phase. The main benefit from the path is that it enables more fusion opportunities and supports multi-devices.The path that starts from global optimization phase will be remained for a while. Users can get back to the legacy path by using
--iree-global-opt-data-tilingflag.Feel free to try the new path and file issues. We've seen improvements in sdxl clip model on the new path. We're happy to provide guidance if users have any questions about the new standard data-tiling optimization.
Website Doc Preview: https://iree.dev/reference/optimization-options/#global-optimization-iree-global-optimization-opt-level
ci-extra: test_torch