[LoopUnroll] Parallel reduction phis miscompile sub-form reductions: partials recombined with alternating signs

## Summary

When the loop unroller creates parallel reduction phis (the
`unroll-add-parallel-reductions` path, introduced in #149470) for a reduction
that accumulates via **subtraction** (`acc -= x`), the final recombination of
the partial accumulators is incorrect. The combine chains the partial results
with the opcode of the recurrence kind — `sub` for `RecurKind::Sub`, `fsub` for
`RecurKind::FSub` — computing `pN - (... - (p1 - p0))`, which flips the sign of
every other partial result. With `unroll_count(4)` the result is
`p3 - p2 + p1 - p0` instead of `p0 + p1 + p2 + p3`.

Each partial accumulator starts at the recurrence's identity value and applies
its subtractions independently, so each partial already carries the correct
sign; the partials must always be summed.

Add-form reductions (`acc += x`) over the same loop are recombined correctly.

Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO
and PGO: a hot hand-written NEON sub-form reduction (counting bytes >= 0x80,
the [well-known Latin-1→UTF-8 length pattern](https://lemire.me/blog/2023/05/15/computing-the-utf-8-size-of-a-latin-1-string-quickly-arm-neon-edition/)
used e.g. by simdutf's arm64 `utf8_length_from_latin1`) gets unrolled with
parallel reduction phis (enabled for Apple CPUs by #149699) and silently
returns wrong lengths. Since callers commonly size buffers from such results,
this class of miscompile leads to memory-safety failures downstream.

## Reproducer (no PGO/LTO required)

```cpp
// clang++ -O2 repro.cpp -o repro && ./repro   # any aarch64 target
#include <arm_neon.h>
#include <cstdio>
#include <cstring>
__attribute__((noinline))
size_t count_high(const uint8_t* data, size_t length) {
  uint64_t result = 0;
  const uint8_t* simd_end = data + (length / 16) * 16;
#pragma clang loop unroll_count(4)
  for (; data < simd_end; data += 16) {
    uint8x16_t input_vec = vld1q_u8(data);
    uint8x16_t withhighbit = vcgeq_u8(input_vec, vdupq_n_u8(0x80));
    result -= vaddvq_s8(vreinterpretq_s8_u8(withhighbit));
  }
  return result;
}
int main() {
  static uint8_t buf[2048];
  int fails = 0;
  for (size_t pos = 0; pos < 256; pos += 16) {
    memset(buf, 'a', sizeof buf);
    buf[pos] = 0xB1;                       // one byte >= 0x80
    size_t got = count_high(buf, sizeof buf);
    if (got != 1) { fails++; printf("pos=%zu got=%zd want=1\n", pos, (ssize_t)got); }
  }
  printf(fails ? "MISCOMPILED (%d/16 positions wrong)\n" : "clean\n", fails);
}
```

Observed output (clang 23.0.0git, `-O2`, aarch64):

```
pos=0 got=-1 want=1      # 16-byte block 0  (block % 4 == 0 -> subtracted)
pos=32 got=-1 want=1     # block 2          (block % 4 == 2 -> subtracted)
...
MISCOMPILED (8/16 positions wrong)
```

Bytes whose 16-byte block index is ≡ 1 or 3 (mod 4) are counted correctly (+1);
blocks ≡ 0 or 2 (mod 4) are counted with the wrong sign (−1).

`-mllvm -unroll-add-parallel-reductions=false` produces correct code (the loop
still unrolls 4×, with a single serial accumulator). The same loop rewritten
add-form (`result += vaddvq_u8(vshrq_n_u8(input_vec, 7));`) is compiled
correctly under identical flags.

## Mid-end IR (after loop-unroll + instcombine)

All four partial accumulators subtract in the same direction; the exit-block
combine alternates signs:

```llvm
; loop body: 4 parallel accumulators, all "acc - sext(reduce.add(...))"
%29 = sub i64 %23, %28      ; p0
%35 = sub i64 %20, %34      ; p1
%41 = sub i64 %21, %40      ; p2
%47 = sub i64 %22, %46      ; p3
...
; exit block:
%52 = sub i64 %29, %35      ; p0 - p1
%53 = add i64 %52, %41      ; p0 - p1 + p2
%54 = sub i64 %47, %53      ; p3 - p0 + p1 - p2   <<< should be p0+p1+p2+p3
```

## Code path

`UnrollLoop()` in `llvm/lib/Transforms/Utils/LoopUnroll.cpp`:

```cpp
RecurKind RK = Reductions.begin()->second.getRecurrenceKind();
for (Instruction *RdxPart : drop_begin(PartialReductions)) {
  RdxResult = Builder.CreateBinOp(
      (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK),
      RdxPart, RdxResult, "bin.rdx");
}
```

`canParallelizeReductionWhenUnrolling()` only rejects AnyOf/Find/MinMax kinds,
so `RecurKind::Sub` (classifiable since #147026) flows into the parallel-phi
path, and `RecurrenceDescriptor::getOpcode(RecurKind::Sub)` returns
`Instruction::Sub`. `RecurKind::FSub` is reachable the same way via the FP
variant (#166630) under reassoc, with the same problem (confirmed by test).

Note: `llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll` currently
only covers add/fadd reductions, which is how this shipped untested.

## Environment

- clang 23.0.0git (`llvmorg-23-init-10931-g20b6ec66`); also reproduced with
  earlier llvmorg-22-era snapshots
- Target: aarch64 (the reassociation bug is target-independent in the IR-level
  unroller; AArch64/Apple CPUs are where TTI enables the feature organically)
- Trigger: any path that applies parallel reduction phis to a sub-form
  reduction: `#pragma clang loop unroll_count(N)`, or organic PGO-hot
  unrolling on Apple CPUs


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopUnroll] Parallel reduction phis miscompile sub-form reductions: partials recombined with alternating signs #201065

Summary

Reproducer (no PGO/LTO required)

Mid-end IR (after loop-unroll + instcombine)

Code path

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[LoopUnroll] Parallel reduction phis miscompile sub-form reductions: partials recombined with alternating signs #201065

Description

Summary

Reproducer (no PGO/LTO required)

Mid-end IR (after loop-unroll + instcombine)

Code path

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions