Skip to content

[LoopUnroll] Parallel reduction phis miscompile sub-form reductions: partials recombined with alternating signs #201065

@MarshallOfSound

Description

@MarshallOfSound

Summary

When the loop unroller creates parallel reduction phis (the
unroll-add-parallel-reductions path, introduced in #149470) for a reduction
that accumulates via subtraction (acc -= x), the final recombination of
the partial accumulators is incorrect. The combine chains the partial results
with the opcode of the recurrence kind — sub for RecurKind::Sub, fsub for
RecurKind::FSub — computing pN - (... - (p1 - p0)), which flips the sign of
every other partial result. With unroll_count(4) the result is
p3 - p2 + p1 - p0 instead of p0 + p1 + p2 + p3.

Each partial accumulator starts at the recurrence's identity value and applies
its subtractions independently, so each partial already carries the correct
sign; the partials must always be summed.

Add-form reductions (acc += x) over the same loop are recombined correctly.

Observed in the wild on AArch64 (Apple Silicon) release builds using ThinLTO
and PGO: a hot hand-written NEON sub-form reduction (counting bytes >= 0x80,
the well-known Latin-1→UTF-8 length pattern
used e.g. by simdutf's arm64 utf8_length_from_latin1) gets unrolled with
parallel reduction phis (enabled for Apple CPUs by #149699) and silently
returns wrong lengths. Since callers commonly size buffers from such results,
this class of miscompile leads to memory-safety failures downstream.

Reproducer (no PGO/LTO required)

// clang++ -O2 repro.cpp -o repro && ./repro   # any aarch64 target
#include <arm_neon.h>
#include <cstdio>
#include <cstring>
__attribute__((noinline))
size_t count_high(const uint8_t* data, size_t length) {
  uint64_t result = 0;
  const uint8_t* simd_end = data + (length / 16) * 16;
#pragma clang loop unroll_count(4)
  for (; data < simd_end; data += 16) {
    uint8x16_t input_vec = vld1q_u8(data);
    uint8x16_t withhighbit = vcgeq_u8(input_vec, vdupq_n_u8(0x80));
    result -= vaddvq_s8(vreinterpretq_s8_u8(withhighbit));
  }
  return result;
}
int main() {
  static uint8_t buf[2048];
  int fails = 0;
  for (size_t pos = 0; pos < 256; pos += 16) {
    memset(buf, 'a', sizeof buf);
    buf[pos] = 0xB1;                       // one byte >= 0x80
    size_t got = count_high(buf, sizeof buf);
    if (got != 1) { fails++; printf("pos=%zu got=%zd want=1\n", pos, (ssize_t)got); }
  }
  printf(fails ? "MISCOMPILED (%d/16 positions wrong)\n" : "clean\n", fails);
}

Observed output (clang 23.0.0git, -O2, aarch64):

pos=0 got=-1 want=1      # 16-byte block 0  (block % 4 == 0 -> subtracted)
pos=32 got=-1 want=1     # block 2          (block % 4 == 2 -> subtracted)
...
MISCOMPILED (8/16 positions wrong)

Bytes whose 16-byte block index is ≡ 1 or 3 (mod 4) are counted correctly (+1);
blocks ≡ 0 or 2 (mod 4) are counted with the wrong sign (−1).

-mllvm -unroll-add-parallel-reductions=false produces correct code (the loop
still unrolls 4×, with a single serial accumulator). The same loop rewritten
add-form (result += vaddvq_u8(vshrq_n_u8(input_vec, 7));) is compiled
correctly under identical flags.

Mid-end IR (after loop-unroll + instcombine)

All four partial accumulators subtract in the same direction; the exit-block
combine alternates signs:

; loop body: 4 parallel accumulators, all "acc - sext(reduce.add(...))"
%29 = sub i64 %23, %28      ; p0
%35 = sub i64 %20, %34      ; p1
%41 = sub i64 %21, %40      ; p2
%47 = sub i64 %22, %46      ; p3
...
; exit block:
%52 = sub i64 %29, %35      ; p0 - p1
%53 = add i64 %52, %41      ; p0 - p1 + p2
%54 = sub i64 %47, %53      ; p3 - p0 + p1 - p2   <<< should be p0+p1+p2+p3

Code path

UnrollLoop() in llvm/lib/Transforms/Utils/LoopUnroll.cpp:

RecurKind RK = Reductions.begin()->second.getRecurrenceKind();
for (Instruction *RdxPart : drop_begin(PartialReductions)) {
  RdxResult = Builder.CreateBinOp(
      (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK),
      RdxPart, RdxResult, "bin.rdx");
}

canParallelizeReductionWhenUnrolling() only rejects AnyOf/Find/MinMax kinds,
so RecurKind::Sub (classifiable since #147026) flows into the parallel-phi
path, and RecurrenceDescriptor::getOpcode(RecurKind::Sub) returns
Instruction::Sub. RecurKind::FSub is reachable the same way via the FP
variant (#166630) under reassoc, with the same problem (confirmed by test).

Note: llvm/test/Transforms/LoopUnroll/runtime-unroll-reductions.ll currently
only covers add/fadd reductions, which is how this shipped untested.

Environment

  • clang 23.0.0git (llvmorg-23-init-10931-g20b6ec66); also reproduced with
    earlier llvmorg-22-era snapshots
  • Target: aarch64 (the reassociation bug is target-independent in the IR-level
    unroller; AArch64/Apple CPUs are where TTI enables the feature organically)
  • Trigger: any path that applies parallel reduction phis to a sub-form
    reduction: #pragma clang loop unroll_count(N), or organic PGO-hot
    unrolling on Apple CPUs

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions