[mlir][linalg] Fix UBSan division-by-zero in PackOp folding#186271
Conversation
|
@llvm/pr-subscribers-mlir @llvm/pr-subscribers-mlir-linalg Author: Mehdi Amini (joker-eph) ChangesWhen tensor-cast folding propagates a zero constant into a dynamic tile size, Two fixes:
Add a regression test that ensures a pack with a zero tile size is not folded into Fixes #185352 Assisted-by: Claude Code Full diff: https://github.com/llvm/llvm-project/pull/186271.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index ad2909f656eea..f908827b255f2 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -5112,7 +5112,9 @@ static LogicalResult commonVerifierPackAndUnPackOp(OpTy packOrUnPack) {
// Return true if we have a zero-value tile.
auto hasZeros = [&](ArrayRef<OpFoldResult> tiles) {
- return llvm::any_of(tiles, isZeroInteger);
+ return llvm::any_of(tiles, [](OpFoldResult tile) {
+ return isa<Attribute>(tile) && isZeroInteger(tile);
+ });
};
// Verify that the source and destination are ranked types.
@@ -5997,6 +5999,8 @@ struct FoldTensorCastPackOp : public OpRewritePattern<PackOp> {
// Get the updated mixed-tile-sizes attribute.
SmallVector<OpFoldResult> newMixedTileSizes =
getNewMixedTileSizes(rewriter, newResultTypes[0], op.getMixedTiles());
+ if (llvm::any_of(newMixedTileSizes, isZeroInteger))
+ return failure();
// Clone op.
// TODO: Strictly speaking, discardable attributes should be _discarded_ at
diff --git a/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir b/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir
index 6979770154bab..afd3a04a8996f 100644
--- a/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir
+++ b/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir
@@ -134,6 +134,24 @@ func.func @pack_32x1_to_16x1x1x2(%arg0 : tensor<32x1xf32>) -> tensor<16x1x1x2xf3
// -----
+// Zero tile size: pack must not be folded into expand_shape.
+// CHECK-LABEL: func.func @pack_zero_tile_not_folded
+// CHECK-NOT: tensor.expand_shape
+// CHECK: linalg.pack
+func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> {
+ %pad_val = arith.constant 123 : i32
+ %tile_size = arith.constant 0 : index
+ %empty = tensor.empty(%tile_size) : tensor<1x16x?x1xi32>
+ %pack = linalg.pack %A
+ padding_value(%pad_val : i32)
+ inner_dims_pos = [0, 1]
+ inner_tiles = [%tile_size, 1]
+ into %empty : tensor<7x16xi32> -> tensor<1x16x?x1xi32>
+ return %pack : tensor<1x16x?x1xi32>
+}
+
+// -----
+
// CHECK-LABEL: func.func @unpack_1d_to_collapse
// CHECK-SAME: %[[ARG0:.+]]: tensor<8x32xf32>)
// CHECK: %[[COLLAPSED:.+]] = tensor.collapse_shape %[[ARG0]] {{\[}}[0, 1]] : tensor<8x32xf32> into tensor<256xf32>
|
|
Ping linalg folks? |
|
Ping @banach-space and @llvm/pr-subscribers-mlir-linalg ? |
|
Thanks for the PR, Mehdi, and sorry for the delay!
In the discussion for that issue I mentioned that I wasn't able to reproduce the issue. How can I reproduce it? Thanks! |
| // Zero tile size: pack must not be folded into expand_shape. | ||
| // CHECK-LABEL: func.func @pack_zero_tile_not_folded | ||
| // CHECK-NOT: tensor.expand_shape | ||
| // CHECK: linalg.pack | ||
| func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> { | ||
| %pad_val = arith.constant 123 : i32 | ||
| %tile_size = arith.constant 0 : index | ||
| %empty = tensor.empty(%tile_size) : tensor<1x16x?x1xi32> | ||
| %pack = linalg.pack %A | ||
| padding_value(%pad_val : i32) | ||
| inner_dims_pos = [0, 1] | ||
| inner_tiles = [%tile_size, 1] | ||
| into %empty : tensor<7x16xi32> -> tensor<1x16x?x1xi32> | ||
| return %pack : tensor<1x16x?x1xi32> | ||
| } |
There was a problem hiding this comment.
ATM, this example fails verification (due to 0 tile size). This PR removes that verification, but then SimplifyPackToExpandShape bails out due to:
** Match Failure : expects no padding value
Out of two fixes in this PR:
Two fixes:
Guard FoldTensorCastPackOp: bail out early if any of the resolved tile sizes is zero, preventing the invalid fold entirely.
Restrict the hasZeros check in commonVerifierPackAndUnPackOp to only inspect Attribute operands (statically-known zeros), not dynamic Value operands. The verifier can only meaningfully reject zero tiles that are statically visible; dynamic zeros are an inherently runtime condition.
this seems to only "verify" the 2nd one, no?
There was a problem hiding this comment.
Sorry: I put the test in the wrong file. Locally I worked on the fix with the test in its own file, and running mlir-opt --canonicalize.
I just moved it to the canonicalize.mlir file, where it crashes without the folding fix.
There was a problem hiding this comment.
By the crash for me without UBSAN is a "floating point exception" with this trace (top only here):
0. Program arguments: bin/mlir-opt --canonicalize c.mlir
#0 0x00005699a6d528b8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (bin/mlir-opt+0x50808b8)
#1 0x00005699a6d4f6d1 llvm::sys::RunSignalHandlers() (bin/mlir-opt+0x507d6d1)
#2 0x00005699a6d53711 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
#3 0x00007c5f41642520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00005699acc2a695 mlir::linalg::PackOp::requirePaddingValue(llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<mlir::OpFoldResult>) (bin/mlir-opt+0xaf58695)
#5 0x00005699acc1352f mlir::linalg::PackOp::canonicalize(mlir::linalg::PackOp, mlir::PatternRewriter&) (bin/mlir-opt+0xaf4152f)
Build with clang and sanitizers? It reproduced for me. |
9f6a6f3 to
1e955bf
Compare
Thanks, that's what I was missing. The fix in the verifier makes sense, but I am not sure about In fact, what's MLIR's opinion on dimensions == 0? I've never hit this myself, but I guess MLIR does have an opinion? |
|
0-dim is a thing for the tensor type, then it's a matter of dialect decision to accept it or not.
I'm not sure I follow: what would you do in the folding here? Right now the problem is that the folder produced invalid IR, leading to the crash. This op is rejected by the verifier, because of the static dimension being 0. Basically this is the check I'm modifying to only check the static case and not the dynamic one. We were folding the dynamic into static, which we need to stop doing to align with the verifier. |
+1
This is a bit unclear. The proposed fix will only prevent folding when the size is effectively The current fold seems correct to me. It's replacing IR that's invalid (dynamic shapes mean that this can only be verified at runtime) with other invalid IR (with static shapes, so verification can kick in at compile time). Why not let Btw,
I don't quite see how |
This is not invalid IR, this is undefined behavior. That's quite different...
It is not OK for any transformation to generate invalid IR from valid IR: that's always a bug. They need to bail out before doing so.
The crash happens in the builder of the op: we never get to this point. |
I know what you mean, but this particular case raises a wider question for me. The following Op is already invalid, but we don't let the verify to catch that (it could and does ATM, but by design MLIR verifiers should only check "locally"): %tile_size = arith.constant 0 : index
%empty = tensor.empty(%tile_size) : tensor<1x16x?x1xi32>
%pack = linalg.pack %A
padding_value(%pad_val : i32)
inner_dims_pos = [0, 1]
inner_tiles = [%tile_size, 1]
into %empty : tensor<7x16xi32> -> tensor<1x16x?x1xi32>The folder turns that into something that the verifier can and indeed does reject: %2 = "tensor.empty"() : () -> tensor<1x16x0x1xi32>
%3 = "tensor.cast"(%2) : (tensor<1x16x0x1xi32>) -> tensor<1x16x?x1xi32>
%4 = "linalg.pack"(%arg0, %2, %0) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_inner_tiles = array<i64: 0, 1>}> : (tensor<7x16xi32>, tensor<1x16x0x1xi32>, i32) -> tensor<1x16x0x1xi32>So, in the context of the Op verifier, The approach that you are taking here is consistent with how MLIR has been approaching this and my comment is tangential.
Hm, the bulider seems to run fine. It's All in all LGTM % comments
Btw, IMO we are missing a wider discussion on 0-dim in Linalg. That's outside the scope here and I am only mentioning so that we refrain from special-casing for dim-0 if there are more issues like this. Thanks for digging into this and proposing a fix! |
| // Zero tile size: pack must not be folded into expand_shape. | ||
| // CHECK-LABEL: func.func @pack_zero_tile_not_folded | ||
| // CHECK-NOT: tensor.expand_shape | ||
| // CHECK: linalg.pack | ||
| func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> { |
There was a problem hiding this comment.
| // Zero tile size: pack must not be folded into expand_shape. | |
| // CHECK-LABEL: func.func @pack_zero_tile_not_folded | |
| // CHECK-NOT: tensor.expand_shape | |
| // CHECK: linalg.pack | |
| func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> { | |
| // Zero tile size: folding would create a pack op with one static tile equal to zero, i.e. invalid IR. | |
| // CHECK-LABEL: func.func @pack_zero_tile_not_folded | |
| // CHECK-NOT: tensor.expand_shape | |
| // CHECK: linalg.pack | |
| func.func @no_fold_pack_zero_tile(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> { |
Also suggesting to update the test function name to use prefixes to mark negative cases: https://mlir.llvm.org/getting_started/TestingGuide/#test-naming-convention
There was a problem hiding this comment.
The other one in this file are suffixed with negative, should I update them to be prefixed instead?
There was a problem hiding this comment.
Also other are prefixed no, like @no_infer_pack_shape: should it be @negative_infer_pack_shape?
There was a problem hiding this comment.
I did some fixed, PTAL?
There was a problem hiding this comment.
Also other are prefixed no, like @no_infer_pack_shape: should it be @negative_infer_pack_shape?
All of the above would be fine, I only ask for consistency within a particular file. What you did is perfect, thanks for tidying that up!
banach-space
left a comment
There was a problem hiding this comment.
LGTM % comments
Thanks!
I rather keep terminology strict: the op is not "invalid", it may have undefined behavior or be documented as having runtime checks with abort behavior, or anything you can imagine. But "invalid IR" is strictly "what we reject with the verifier".
Can you clarify what you mean by "this sort of issues"? Note also that we do have support in MLIR for turning undefined behavior into "runtime checks" (think: UBSAN): https://mlir.llvm.org/docs/Passes/#-generate-runtime-verification ; any op can hook there with an interface.
Oh right, I misread the backtrace and thought the builder was calling
Sure, will do.
Sure: that's a call Linalg needs to do on whether 0-dim should be well defined behavior: the alternative to my patch here is to relax the verifier further to accept 0-dim and update the client code to accept it. |
👍🏻
I was referring to the Op "having UB behaviour" (using your terminology), which should be easy to capture, yet we don't do that. Runtime verification sounds like the way to go.
That alternative would allow us to avoid special-casing for 0-dim, but it would also imply that Linalg supports 0-dim. IMO we should avoid that, your current fix is fine with me. |
2652705 to
75cf22a
Compare
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
75cf22a to
595f48e
Compare
When tensor-cast folding propagates a zero constant into a dynamic tile size, `FoldTensorCastPackOp` would proceed to fold the pack into an `expand_shape` using that zero tile — causing undefined behaviour (integer division/modulo by zero, caught by UBSan as SIGFPE). Changes: 1. Guard `FoldTensorCastPackOp`: bail out early if any of the resolved tile sizes is zero, preventing the invalid fold entirely. 2. Restrict the `hasZeros` check in `commonVerifierPackAndUnPackOp` to only inspect `Attribute` operands (statically-known zeros), not dynamic `Value` operands. The verifier can only meaningfully reject zero tiles that are statically visible; dynamic zeros are an inherently runtime condition. 3. Add `assert(*constantTile \!= 0)` guards in `requirePaddingValue` and `requirePaddingValueStrict` to document and enforce the precondition that static tile sizes passed to those helpers are non-zero. Also rename existing negative test cases from `no_fold_*`/`nofold_*` to `negative_*` for consistency with MLIR testing conventions, and add a regression test that ensures a pack with a zero tile size is not folded into `tensor.expand_shape`. Fixes llvm#185352 Assisted-by: Claude Code
595f48e to
2004be7
Compare
) When tensor-cast folding propagates a zero constant into a dynamic tile size, `FoldTensorCastPackOp` would proceed to fold the pack into an `expand_shape` using that zero tile — causing undefined behaviour (integer division/modulo by zero, caught by UBSan as SIGFPE). 1. Guard `FoldTensorCastPackOp`: bail out early if any of the resolved tile sizes is zero, preventing the invalid fold entirely. 2. Restrict the `hasZeros` check in `commonVerifierPackAndUnPackOp` to only inspect `Attribute` operands (statically-known zeros), not dynamic `Value` operands. The verifier can only meaningfully reject zero tiles that are statically visible; dynamic zeros are an inherently runtime condition. 3. Add `assert(*constantTile \!= 0)` guards in `requirePaddingValue` and `requirePaddingValueStrict` to document and enforce the precondition that static tile sizes passed to those helpers are non-zero. Also rename existing negative test cases from `no_fold_*`/`nofold_*` to `negative_*` for consistency with MLIR testing conventions, and add a regression test that ensures a pack with a zero tile size is not folded into `tensor.expand_shape`. Fixes llvm#185352 Assisted-by: Claude Code
When tensor-cast folding propagates a zero constant into a dynamic tile
size,
FoldTensorCastPackOpwould proceed to fold the pack into anexpand_shapeusing that zero tile — causing undefined behaviour (integerdivision/modulo by zero, caught by UBSan as SIGFPE).
Guard
FoldTensorCastPackOp: bail out early if any of the resolvedtile sizes is zero, preventing the invalid fold entirely.
Restrict the
hasZeroscheck incommonVerifierPackAndUnPackOptoonly inspect
Attributeoperands (statically-known zeros), not dynamicValueoperands. The verifier can only meaningfully reject zero tilesthat are statically visible; dynamic zeros are an inherently runtime
condition.
Add
assert(*constantTile \!= 0)guards inrequirePaddingValueandrequirePaddingValueStrictto document and enforce the preconditionthat static tile sizes passed to those helpers are non-zero.
Also rename existing negative test cases from
no_fold_*/nofold_*tonegative_*for consistency with MLIR testing conventions, and add aregression test that ensures a pack with a zero tile size is not folded
into
tensor.expand_shape.Fixes #185352
Assisted-by: Claude Code