Skip to content

[mlir][linalg] Fix UBSan division-by-zero in PackOp folding#186271

Merged
joker-eph merged 1 commit into
llvm:mainfrom
joker-eph:fix/issue-185352
Mar 25, 2026
Merged

[mlir][linalg] Fix UBSan division-by-zero in PackOp folding#186271
joker-eph merged 1 commit into
llvm:mainfrom
joker-eph:fix/issue-185352

Conversation

@joker-eph

@joker-eph joker-eph commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

When tensor-cast folding propagates a zero constant into a dynamic tile
size, FoldTensorCastPackOp would proceed to fold the pack into an
expand_shape using that zero tile — causing undefined behaviour (integer
division/modulo by zero, caught by UBSan as SIGFPE).

  1. Guard FoldTensorCastPackOp: bail out early if any of the resolved
    tile sizes is zero, preventing the invalid fold entirely.

  2. Restrict the hasZeros check in commonVerifierPackAndUnPackOp to
    only inspect Attribute operands (statically-known zeros), not dynamic
    Value operands. The verifier can only meaningfully reject zero tiles
    that are statically visible; dynamic zeros are an inherently runtime
    condition.

  3. Add assert(*constantTile \!= 0) guards in requirePaddingValue and
    requirePaddingValueStrict to document and enforce the precondition
    that static tile sizes passed to those helpers are non-zero.

Also rename existing negative test cases from no_fold_*/nofold_* to
negative_* for consistency with MLIR testing conventions, and add a
regression test that ensures a pack with a zero tile size is not folded
into tensor.expand_shape.

Fixes #185352

Assisted-by: Claude Code

@llvmbot

llvmbot commented Mar 12, 2026

Copy link
Copy Markdown
Member

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-linalg

Author: Mehdi Amini (joker-eph)

Changes

When tensor-cast folding propagates a zero constant into a dynamic tile size, FoldTensorCastPackOp would proceed to fold the pack into an expand_shape using that zero tile — causing undefined behaviour (integer division/modulo by zero, caught by UBSan as SIGFPE).

Two fixes:

  1. Guard FoldTensorCastPackOp: bail out early if any of the resolved tile sizes is zero, preventing the invalid fold entirely.

  2. Restrict the hasZeros check in commonVerifierPackAndUnPackOp to only inspect Attribute operands (statically-known zeros), not dynamic Value operands. The verifier can only meaningfully reject zero tiles that are statically visible; dynamic zeros are an inherently runtime condition.

Add a regression test that ensures a pack with a zero tile size is not folded into tensor.expand_shape.

Fixes #185352

Assisted-by: Claude Code


Full diff: https://github.com/llvm/llvm-project/pull/186271.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp (+5-1)
  • (modified) mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir (+18)
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index ad2909f656eea..f908827b255f2 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -5112,7 +5112,9 @@ static LogicalResult commonVerifierPackAndUnPackOp(OpTy packOrUnPack) {
 
   // Return true if we have a zero-value tile.
   auto hasZeros = [&](ArrayRef<OpFoldResult> tiles) {
-    return llvm::any_of(tiles, isZeroInteger);
+    return llvm::any_of(tiles, [](OpFoldResult tile) {
+      return isa<Attribute>(tile) && isZeroInteger(tile);
+    });
   };
 
   // Verify that the source and destination are ranked types.
@@ -5997,6 +5999,8 @@ struct FoldTensorCastPackOp : public OpRewritePattern<PackOp> {
     // Get the updated mixed-tile-sizes attribute.
     SmallVector<OpFoldResult> newMixedTileSizes =
         getNewMixedTileSizes(rewriter, newResultTypes[0], op.getMixedTiles());
+    if (llvm::any_of(newMixedTileSizes, isZeroInteger))
+      return failure();
 
     // Clone op.
     // TODO: Strictly speaking, discardable attributes should be _discarded_ at
diff --git a/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir b/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir
index 6979770154bab..afd3a04a8996f 100644
--- a/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir
+++ b/mlir/test/Dialect/Linalg/simplify-pack-unpack.mlir
@@ -134,6 +134,24 @@ func.func @pack_32x1_to_16x1x1x2(%arg0 : tensor<32x1xf32>) -> tensor<16x1x1x2xf3
 
 // -----
 
+// Zero tile size: pack must not be folded into expand_shape.
+// CHECK-LABEL: func.func @pack_zero_tile_not_folded
+// CHECK-NOT:     tensor.expand_shape
+// CHECK:         linalg.pack
+func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> {
+  %pad_val = arith.constant 123 : i32
+  %tile_size = arith.constant 0 : index
+  %empty = tensor.empty(%tile_size) : tensor<1x16x?x1xi32>
+  %pack = linalg.pack %A
+    padding_value(%pad_val : i32)
+    inner_dims_pos = [0, 1]
+    inner_tiles = [%tile_size, 1]
+    into %empty : tensor<7x16xi32> -> tensor<1x16x?x1xi32>
+  return %pack : tensor<1x16x?x1xi32>
+}
+
+// -----
+
 // CHECK-LABEL: func.func @unpack_1d_to_collapse
 // CHECK-SAME:    %[[ARG0:.+]]: tensor<8x32xf32>)
 // CHECK:         %[[COLLAPSED:.+]] = tensor.collapse_shape %[[ARG0]] {{\[}}[0, 1]] : tensor<8x32xf32> into tensor<256xf32>

@joker-eph

Copy link
Copy Markdown
Contributor Author

Ping linalg folks?

@joker-eph

Copy link
Copy Markdown
Contributor Author

Ping @banach-space and @llvm/pr-subscribers-mlir-linalg ?

@banach-space

Copy link
Copy Markdown
Contributor

Thanks for the PR, Mehdi, and sorry for the delay!

Fixes #185352

In the discussion for that issue I mentioned that I wasn't able to reproduce the issue. How can I reproduce it? Thanks!

Comment on lines +137 to +151
// Zero tile size: pack must not be folded into expand_shape.
// CHECK-LABEL: func.func @pack_zero_tile_not_folded
// CHECK-NOT: tensor.expand_shape
// CHECK: linalg.pack
func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> {
%pad_val = arith.constant 123 : i32
%tile_size = arith.constant 0 : index
%empty = tensor.empty(%tile_size) : tensor<1x16x?x1xi32>
%pack = linalg.pack %A
padding_value(%pad_val : i32)
inner_dims_pos = [0, 1]
inner_tiles = [%tile_size, 1]
into %empty : tensor<7x16xi32> -> tensor<1x16x?x1xi32>
return %pack : tensor<1x16x?x1xi32>
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM, this example fails verification (due to 0 tile size). This PR removes that verification, but then SimplifyPackToExpandShape bails out due to:

** Match Failure : expects no padding value

Out of two fixes in this PR:

Two fixes:

Guard FoldTensorCastPackOp: bail out early if any of the resolved tile sizes is zero, preventing the invalid fold entirely.

Restrict the hasZeros check in commonVerifierPackAndUnPackOp to only inspect Attribute operands (statically-known zeros), not dynamic Value operands. The verifier can only meaningfully reject zero tiles that are statically visible; dynamic zeros are an inherently runtime condition.

this seems to only "verify" the 2nd one, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry: I put the test in the wrong file. Locally I worked on the fix with the test in its own file, and running mlir-opt --canonicalize.

I just moved it to the canonicalize.mlir file, where it crashes without the folding fix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the crash for me without UBSAN is a "floating point exception" with this trace (top only here):

0.      Program arguments: bin/mlir-opt --canonicalize c.mlir
 #0 0x00005699a6d528b8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (bin/mlir-opt+0x50808b8)
 #1 0x00005699a6d4f6d1 llvm::sys::RunSignalHandlers() (bin/mlir-opt+0x507d6d1)
 #2 0x00005699a6d53711 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x00007c5f41642520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00005699acc2a695 mlir::linalg::PackOp::requirePaddingValue(llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<mlir::OpFoldResult>) (bin/mlir-opt+0xaf58695)
 #5 0x00005699acc1352f mlir::linalg::PackOp::canonicalize(mlir::linalg::PackOp, mlir::PatternRewriter&) (bin/mlir-opt+0xaf4152f)

@joker-eph

Copy link
Copy Markdown
Contributor Author

In the discussion for that issue I mentioned that I wasn't able to reproduce the issue. How can I reproduce it? Thanks!

Build with clang and sanitizers? It reproduced for me.

@banach-space

Copy link
Copy Markdown
Contributor

In the discussion for that issue I mentioned that I wasn't able to reproduce the issue. How can I reproduce it? Thanks!

Build with clang and sanitizers? It reproduced for me.

Thanks, that's what I was missing.

The fix in the verifier makes sense, but I am not sure about FoldTensorCastPackOp. Isn't the underlying issue the fact that we end-up with dimensions that are 0 and patching FoldTensorCastPackOp merely hides the issue?

In fact, what's MLIR's opinion on dimensions == 0? I've never hit this myself, but I guess MLIR does have an opinion?

@joker-eph

Copy link
Copy Markdown
Contributor Author

0-dim is a thing for the tensor type, then it's a matter of dialect decision to accept it or not.

Isn't the underlying issue the fact that we end-up with dimensions that are 0 and patching FoldTensorCastPackOp merely hides the issue?

I'm not sure I follow: what would you do in the folding here?

Right now the problem is that the folder produced invalid IR, leading to the crash.

%2 = "linalg.pack"(%arg0, %1, %0) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_inner_tiles = array<i64: 0, 1>}> : (tensor<7x16xi32>, tensor<1x16x0x1xi32>, i32) -> tensor<1x16x0x1xi32>

This op is rejected by the verifier, because of the static dimension being 0. Basically this is the check I'm modifying to only check the static case and not the dynamic one. We were folding the dynamic into static, which we need to stop doing to align with the verifier.

@banach-space

Copy link
Copy Markdown
Contributor

Right now the problem is that the folder produced invalid IR, leading to the crash.

%2 = "linalg.pack"(%arg0, %1, %0) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_inner_tiles = array<i64: 0, 1>}> : (tensor<7x16xi32>, tensor<1x16x0x1xi32>, i32) -> tensor<1x16x0x1xi32>

This op is rejected by the verifier, because of the static dimension being 0. Basically this is the check I'm modifying to only check the static case and not the dynamic one.

+1

We were folding the dynamic into static, which we need to stop doing to align with the verifier.

This is a bit unclear. The proposed fix will only prevent folding when the size is effectively 0 (isZeroInteger), right? As opposed to folding any dynamic size into a static one? That feels arbitrary.

The current fold seems correct to me. It's replacing IR that's invalid (dynamic shapes mean that this can only be verified at runtime) with other invalid IR (with static shapes, so verification can kick in at compile time). Why not let PackOp::verify do its job post-folding? Wouldn't the verifier fail before hitting UB and wouldn't that be sufficient to fix the original issue.

Btw,

Add a regression test that ensures a pack with a zero tile size is not folded into tensor.expand_shape.

I don't quite see how tensor.expand_shape is involved here. I run your example and didn't see any relevant patterns being run.

@joker-eph

Copy link
Copy Markdown
Contributor Author

The current fold seems correct to me. It's replacing IR that's invalid (dynamic shapes mean that this can only be verified at runtime)

This is not invalid IR, this is undefined behavior. That's quite different...

Why not let PackOp::verify do its job post-folding?

It is not OK for any transformation to generate invalid IR from valid IR: that's always a bug. They need to bail out before doing so.

Wouldn't the verifier fail before hitting UB and wouldn't that be sufficient to fix the original issue.

The crash happens in the builder of the op: we never get to this point.

@banach-space

Copy link
Copy Markdown
Contributor

It is not OK for any transformation to generate invalid IR from valid IR: that's always a bug.

I know what you mean, but this particular case raises a wider question for me. The following Op is already invalid, but we don't let the verify to catch that (it could and does ATM, but by design MLIR verifiers should only check "locally"):

  %tile_size = arith.constant 0 : index
  %empty = tensor.empty(%tile_size) : tensor<1x16x?x1xi32>
  %pack = linalg.pack %A
    padding_value(%pad_val : i32)
    inner_dims_pos = [0, 1]
    inner_tiles = [%tile_size, 1]
    into %empty : tensor<7x16xi32> -> tensor<1x16x?x1xi32>

The folder turns that into something that the verifier can and indeed does reject:

    %2 = "tensor.empty"() : () -> tensor<1x16x0x1xi32>
    %3 = "tensor.cast"(%2) : (tensor<1x16x0x1xi32>) -> tensor<1x16x?x1xi32>
    %4 = "linalg.pack"(%arg0, %2, %0) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_inner_tiles = array<i64: 0, 1>}> : (tensor<7x16xi32>, tensor<1x16x0x1xi32>, i32) -> tensor<1x16x0x1xi32>

So, in the context of the Op verifier, FoldTensorCastPackOp turns something "valid" into something "invalid", agreed. However, I feel that we are missing something at the framework level that would prevent/capture this sort of issues (e.g. runtime Op verification).

The approach that you are taking here is consistent with how MLIR has been approaching this and my comment is tangential.

The crash happens in the builder of the op: we never get to this point.

Hm, the bulider seems to run fine. It's PackOp::canonicalize() -> paddingIsNotNeeded() -> requirePaddingValue() that leads to UB, right? In fact, how about an assert in requirePaddingValue? That method effectively assumes that tile sizes are non-zero.


All in all LGTM % comments

  • Remove references to expand_shape - the relation is not clear to me.
  • [optional] Comments + assert re assumption on tile sizes in paddingIsNotNeeded().

Btw, IMO we are missing a wider discussion on 0-dim in Linalg. That's outside the scope here and I am only mentioning so that we refrain from special-casing for dim-0 if there are more issues like this.

Thanks for digging into this and proposing a fix!

Comment on lines +1454 to +1458
// Zero tile size: pack must not be folded into expand_shape.
// CHECK-LABEL: func.func @pack_zero_tile_not_folded
// CHECK-NOT: tensor.expand_shape
// CHECK: linalg.pack
func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Zero tile size: pack must not be folded into expand_shape.
// CHECK-LABEL: func.func @pack_zero_tile_not_folded
// CHECK-NOT: tensor.expand_shape
// CHECK: linalg.pack
func.func @pack_zero_tile_not_folded(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> {
// Zero tile size: folding would create a pack op with one static tile equal to zero, i.e. invalid IR.
// CHECK-LABEL: func.func @pack_zero_tile_not_folded
// CHECK-NOT: tensor.expand_shape
// CHECK: linalg.pack
func.func @no_fold_pack_zero_tile(%A: tensor<7x16xi32>) -> tensor<1x16x?x1xi32> {

Also suggesting to update the test function name to use prefixes to mark negative cases: https://mlir.llvm.org/getting_started/TestingGuide/#test-naming-convention

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other one in this file are suffixed with negative, should I update them to be prefixed instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also other are prefixed no, like @no_infer_pack_shape: should it be @negative_infer_pack_shape?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some fixed, PTAL?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also other are prefixed no, like @no_infer_pack_shape: should it be @negative_infer_pack_shape?

All of the above would be fine, I only ask for consistency within a particular file. What you did is perfect, thanks for tidying that up!

@banach-space banach-space left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

Thanks!

@joker-eph

Copy link
Copy Markdown
Contributor Author

It is not OK for any transformation to generate invalid IR from valid IR: that's always a bug.

I know what you mean, but this particular case raises a wider question for me. The following Op is already invalid

I rather keep terminology strict: the op is not "invalid", it may have undefined behavior or be documented as having runtime checks with abort behavior, or anything you can imagine. But "invalid IR" is strictly "what we reject with the verifier".

However, I feel that we are missing something at the framework level that would prevent/capture this sort of issues (e.g. runtime Op verification).

Can you clarify what you mean by "this sort of issues"?

Note also that we do have support in MLIR for turning undefined behavior into "runtime checks" (think: UBSAN): https://mlir.llvm.org/docs/Passes/#-generate-runtime-verification ; any op can hook there with an interface.

Hm, the bulider seems to run fine. It's PackOp::canonicalize() -> paddingIsNotNeeded() -> requirePaddingValue() that leads to UB, right?

Oh right, I misread the backtrace and thought the builder was calling requirePaddingValue()...

In fact, how about an assert in requirePaddingValue? That method effectively assumes that tile sizes are non-zero.

  • Remove references to expand_shape - the relation is not clear to me.
  • [optional] Comments + assert re assumption on tile sizes in paddingIsNotNeeded().

Sure, will do.

Btw, IMO we are missing a wider discussion on 0-dim in Linalg. That's outside the scope here and I am only mentioning so that we refrain from special-casing for dim-0 if there are more issues like this.

Sure: that's a call Linalg needs to do on whether 0-dim should be well defined behavior: the alternative to my patch here is to relax the verifier further to accept 0-dim and update the client code to accept it.

@banach-space

Copy link
Copy Markdown
Contributor

I rather keep terminology strict: the op is not "invalid", it may have undefined behavior or be documented as having runtime checks with abort behavior, or anything you can imagine. But "invalid IR" is strictly "what we reject with the verifier".

👍🏻

Can you clarify what you mean by "this sort of issues"?

Note also that we do have support in MLIR for turning undefined behavior into "runtime checks" (think: UBSAN): https://mlir.llvm.org/docs/Passes/#-generate-runtime-verification ; any op can hook there with an interface.

I was referring to the Op "having UB behaviour" (using your terminology), which should be easy to capture, yet we don't do that. Runtime verification sounds like the way to go.

(...) the alternative to my patch here is to relax the verifier further to accept 0-dim and update the client code to accept it.

That alternative would allow us to avoid special-casing for 0-dim, but it would also imply that Linalg supports 0-dim. IMO we should avoid that, your current fix is fine with me.

@joker-eph joker-eph force-pushed the fix/issue-185352 branch 2 times, most recently from 2652705 to 75cf22a Compare March 25, 2026 12:46
@github-actions

github-actions Bot commented Mar 25, 2026

Copy link
Copy Markdown

🪟 Windows x64 Test Results

  • 3577 tests passed
  • 415 tests skipped

✅ The build succeeded and all tests passed.

@github-actions

github-actions Bot commented Mar 25, 2026

Copy link
Copy Markdown

🐧 Linux x64 Test Results

  • 7722 tests passed
  • 603 tests skipped

✅ The build succeeded and all tests passed.

@joker-eph joker-eph enabled auto-merge (squash) March 25, 2026 15:06
When tensor-cast folding propagates a zero constant into a dynamic tile
size, `FoldTensorCastPackOp` would proceed to fold the pack into an
`expand_shape` using that zero tile — causing undefined behaviour (integer
division/modulo by zero, caught by UBSan as SIGFPE).

Changes:

1. Guard `FoldTensorCastPackOp`: bail out early if any of the resolved
   tile sizes is zero, preventing the invalid fold entirely.

2. Restrict the `hasZeros` check in `commonVerifierPackAndUnPackOp` to
   only inspect `Attribute` operands (statically-known zeros), not dynamic
   `Value` operands. The verifier can only meaningfully reject zero tiles
   that are statically visible; dynamic zeros are an inherently runtime
   condition.

3. Add `assert(*constantTile \!= 0)` guards in `requirePaddingValue` and
   `requirePaddingValueStrict` to document and enforce the precondition
   that static tile sizes passed to those helpers are non-zero.

Also rename existing negative test cases from `no_fold_*`/`nofold_*` to
`negative_*` for consistency with MLIR testing conventions, and add a
regression test that ensures a pack with a zero tile size is not folded
into `tensor.expand_shape`.

Fixes llvm#185352

Assisted-by: Claude Code
@joker-eph joker-eph merged commit 7398eb0 into llvm:main Mar 25, 2026
10 checks passed
Aadarsh-Keshri pushed a commit to Aadarsh-Keshri/llvm-project that referenced this pull request Mar 28, 2026
)

When tensor-cast folding propagates a zero constant into a dynamic tile
size, `FoldTensorCastPackOp` would proceed to fold the pack into an
`expand_shape` using that zero tile — causing undefined behaviour
(integer
division/modulo by zero, caught by UBSan as SIGFPE).

1. Guard `FoldTensorCastPackOp`: bail out early if any of the resolved
   tile sizes is zero, preventing the invalid fold entirely.

2. Restrict the `hasZeros` check in `commonVerifierPackAndUnPackOp` to
only inspect `Attribute` operands (statically-known zeros), not dynamic
`Value` operands. The verifier can only meaningfully reject zero tiles
   that are statically visible; dynamic zeros are an inherently runtime
   condition.

3. Add `assert(*constantTile \!= 0)` guards in `requirePaddingValue` and
   `requirePaddingValueStrict` to document and enforce the precondition
   that static tile sizes passed to those helpers are non-zero.

Also rename existing negative test cases from `no_fold_*`/`nofold_*` to
`negative_*` for consistency with MLIR testing conventions, and add a
regression test that ensures a pack with a zero tile size is not folded
into `tensor.expand_shape`.

Fixes llvm#185352

Assisted-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MLIR][Linalg] UBSan division-by-zero in PackOp::requirePaddingValue when dynamic tile size is propagated as zero via SCCP

3 participants