Summary
When the loop unroller creates parallel reduction phis (the
unroll-add-parallel-reductions path, introduced in #149470) for a reduction
that accumulates via subtraction (acc -= x), the final recombination of
the partial accumulators is incorrect. The combine chains the partial results
with the opcode of the recurrence kind — sub for RecurKind::Sub, fsub for
RecurKind::FSub — computing pN - (... - (p1 - p0)), which flips the sign of
every other partial result. With unroll_count(4) the result is
p3 - p2 + p1 - p0 instead of p0 + p1 + p2 + p3.
Each partial accumulator starts at the recurrence's identity value and applies
its subtractions independently, so each partial already carries the correct
sign; the partials must always be summed.
Add-form reductions (acc += x) over the same loop are recombined correctly.
Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO
and PGO: a hot hand-written NEON sub-form reduction (counting bytes >= 0x80,
the well-known Latin-1→UTF-8 length pattern
used e.g. by simdutf's arm64 utf8_length_from_latin1) gets unrolled with
parallel reduction phis (enabled for Apple CPUs by #149699) and silently
returns wrong lengths. Since callers commonly size buffers from such results,
this class of miscompile leads to memory-safety failures downstream.
Reproducer (no PGO/LTO required)
// clang++ -O2 repro.cpp -o repro && ./repro # any aarch64 target
#include <arm_neon.h>
#include <cstdio>
#include <cstring>
__attribute__((noinline))
size_t count_high(const uint8_t* data, size_t length) {
uint64_t result = 0;
const uint8_t* simd_end = data + (length / 16) * 16;
#pragma clang loop unroll_count(4)
for (; data < simd_end; data += 16) {
uint8x16_t input_vec = vld1q_u8(data);
uint8x16_t withhighbit = vcgeq_u8(input_vec, vdupq_n_u8(0x80));
result -= vaddvq_s8(vreinterpretq_s8_u8(withhighbit));
}
return result;
}
int main() {
static uint8_t buf[2048];
int fails = 0;
for (size_t pos = 0; pos < 256; pos += 16) {
memset(buf, 'a', sizeof buf);
buf[pos] = 0xB1; // one byte >= 0x80
size_t got = count_high(buf, sizeof buf);
if (got != 1) { fails++; printf("pos=%zu got=%zd want=1\n", pos, (ssize_t)got); }
}
printf(fails ? "MISCOMPILED (%d/16 positions wrong)\n" : "clean\n", fails);
}
Observed output (clang 23.0.0git, -O2, aarch64):
pos=0 got=-1 want=1 # 16-byte block 0 (block % 4 == 0 -> subtracted)
pos=32 got=-1 want=1 # block 2 (block % 4 == 2 -> subtracted)
...
MISCOMPILED (8/16 positions wrong)
Bytes whose 16-byte block index is ≡ 1 or 3 (mod 4) are counted correctly (+1);
blocks ≡ 0 or 2 (mod 4) are counted with the wrong sign (−1).
-mllvm -unroll-add-parallel-reductions=false produces correct code (the loop
still unrolls 4×, with a single serial accumulator). The same loop rewritten
add-form (result += vaddvq_u8(vshrq_n_u8(input_vec, 7));) is compiled
correctly under identical flags.
Mid-end IR (after loop-unroll + instcombine)
All four partial accumulators subtract in the same direction; the exit-block
combine alternates signs:
; loop body: 4 parallel accumulators, all "acc - sext(reduce.add(...))"
%29 = sub i64 %23, %28 ; p0
%35 = sub i64 %20, %34 ; p1
%41 = sub i64 %21, %40 ; p2
%47 = sub i64 %22, %46 ; p3
...
; exit block:
%52 = sub i64 %29, %35 ; p0 - p1
%53 = add i64 %52, %41 ; p0 - p1 + p2
%54 = sub i64 %47, %53 ; p3 - p0 + p1 - p2 <<< should be p0+p1+p2+p3
Code path
UnrollLoop() in llvm/lib/Transforms/Utils/LoopUnroll.cpp:
RecurKind RK = Reductions.begin()->second.getRecurrenceKind();
for (Instruction *RdxPart : drop_begin(PartialReductions)) {
RdxResult = Builder.CreateBinOp(
(Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK),
RdxPart, RdxResult, "bin.rdx");
}
canParallelizeReductionWhenUnrolling() only rejects AnyOf/Find/MinMax kinds,
so RecurKind::Sub (classifiable since #147026) flows into the parallel-phi
path, and RecurrenceDescriptor::getOpcode(RecurKind::Sub) returns
Instruction::Sub. RecurKind::FSub is reachable the same way via the FP
variant (#166630) under reassoc, with the same problem (confirmed by test).
Note: llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll currently
only covers add/fadd reductions, which is how this shipped untested.
Environment
- clang 23.0.0git (
llvmorg-23-init-10931-g20b6ec66); also reproduced with
earlier llvmorg-22-era snapshots
- Target: aarch64 (the reassociation bug is target-independent in the IR-level
unroller; AArch64/Apple CPUs are where TTI enables the feature organically)
- Trigger: any path that applies parallel reduction phis to a sub-form
reduction: #pragma clang loop unroll_count(N), or organic PGO-hot
unrolling on Apple CPUs
Summary
When the loop unroller creates parallel reduction phis (the
unroll-add-parallel-reductionspath, introduced in #149470) for a reductionthat accumulates via subtraction (
acc -= x), the final recombination ofthe partial accumulators is incorrect. The combine chains the partial results
with the opcode of the recurrence kind —
subforRecurKind::Sub,fsubforRecurKind::FSub— computingpN - (... - (p1 - p0)), which flips the sign ofevery other partial result. With
unroll_count(4)the result isp3 - p2 + p1 - p0instead ofp0 + p1 + p2 + p3.Each partial accumulator starts at the recurrence's identity value and applies
its subtractions independently, so each partial already carries the correct
sign; the partials must always be summed.
Add-form reductions (
acc += x) over the same loop are recombined correctly.Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO
and PGO: a hot hand-written NEON sub-form reduction (counting bytes >= 0x80,
the well-known Latin-1→UTF-8 length pattern
used e.g. by simdutf's arm64
utf8_length_from_latin1) gets unrolled withparallel reduction phis (enabled for Apple CPUs by #149699) and silently
returns wrong lengths. Since callers commonly size buffers from such results,
this class of miscompile leads to memory-safety failures downstream.
Reproducer (no PGO/LTO required)
Observed output (clang 23.0.0git,
-O2, aarch64):Bytes whose 16-byte block index is ≡ 1 or 3 (mod 4) are counted correctly (+1);
blocks ≡ 0 or 2 (mod 4) are counted with the wrong sign (−1).
-mllvm -unroll-add-parallel-reductions=falseproduces correct code (the loopstill unrolls 4×, with a single serial accumulator). The same loop rewritten
add-form (
result += vaddvq_u8(vshrq_n_u8(input_vec, 7));) is compiledcorrectly under identical flags.
Mid-end IR (after loop-unroll + instcombine)
All four partial accumulators subtract in the same direction; the exit-block
combine alternates signs:
Code path
UnrollLoop()inllvm/lib/Transforms/Utils/LoopUnroll.cpp:canParallelizeReductionWhenUnrolling()only rejects AnyOf/Find/MinMax kinds,so
RecurKind::Sub(classifiable since #147026) flows into the parallel-phipath, and
RecurrenceDescriptor::getOpcode(RecurKind::Sub)returnsInstruction::Sub.RecurKind::FSubis reachable the same way via the FPvariant (#166630) under reassoc, with the same problem (confirmed by test).
Note:
llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.llcurrentlyonly covers add/fadd reductions, which is how this shipped untested.
Environment
llvmorg-23-init-10931-g20b6ec66); also reproduced withearlier llvmorg-22-era snapshots
unroller; AArch64/Apple CPUs are where TTI enables the feature organically)
reduction:
#pragma clang loop unroll_count(N), or organic PGO-hotunrolling on Apple CPUs