First SIMD program

Question

I am learning SIMD programming and have chosen to write a program that sums up a randomly generated array of signed 8-bit integers. The program is written in aarch64 assembly, ran with QEMU userspace emulation.

The routine expects the address of the memory containing the numbers in x0 and the number of elements (number of bytes) in x1; it returns the total signed sum of all bytes in x0, assume that the running sum never overflows 64-bits.

.type sum_vector, %function
.set frame_size, 0x10
sum_vector:
    stp fp, lr, [sp, #-frame_size]!
    mov fp, sp

    // x2 is index into buffer; x3 is running total
    mov x2, xzr
    mov x3, xzr

.L_sum_vector_vec_loop:
    mov x4, x2
    add x4, x4, #16
    cmp x4, x1
    b.ge .L_sum_vector_vec_done

    ldr q0, [x0, x2]
    saddlv h0, v0.16b

    smov x4, v0.h[0]
    add x3, x3, x4
    add x2, x2, #16
    b .L_sum_vector_vec_loop
.L_sum_vector_vec_done:

.L_sum_vector_scalar_loop:
    cmp x2, x1
    b.ge .L_sum_vector_scalar_done

    ldrsb x4, [x0, x2]
    add x3, x3, x4

    add x2, x2, #1
    b .L_sum_vector_scalar_loop
.L_sum_vector_scalar_done:

    mov x0, x3
    ldp fp, lr, [sp], #frame_size
    ret

My understanding is that 'horizontal summation' is the preferred approach to summing arrays with SIMD, but I could not figure out how to make it work while avoiding overflow. Thus, my approach simply loads 16-bytes of memory in a loop (.L_sum_vector_vec_loop) and does a saddlv reduction, adding to a 64-bit total, with a later pass to handle any leftover values (.L_sum_vector_scalar_loop).

There are some optimizations I could do like unrolling the loop and using a padded-array to avoid the second loop, but I am hoping there is a more idiomatic or correct SIMD way of doing this. General comments on style, calling conventions, idiomaticity, and performance are also welcome.

@J_H Speaking of putting information in the correct place: the answer box is down there. ↓ — wizzwizz4
– wizzwizz4, Commented Apr 3 at 1:24

user555045 · Accepted Answer · 2026-04-01 15:02:00Z

My understanding is that 'horizontal summation' is the preferred approach to summing arrays with SIMD, but I could not figure out how to make it work while avoiding overflow.

Usually we call that vertical (operations between vectors), and your approach horizontal (operations across a vector).

Anyway, one approach is to make 16-bit sums. As you know, that can overflow. Instead of fixing the overflow problem in the inner loop (which you can also do but it'll cost more), limit how many iterations that loop can run for in one go (to 256 iterations at most), and put an outer loop around it that repeats the inner loop until all the data has been processed. That way you don't pay for extra widening in the part of the code the runs the most, extra widening is only triggered occasionally, once per iteration of the outer loop. Cascading this through a layer of 32-bit sums seems like overkill but you could do that, or you could just sum the 16-bit intermediate sums into the 64-bit running total.

You could use ldp to load more data at once, and correspondingly do more additions per iteration, that's unrolling but not the boring kind of unrolling where the loop body is simply duplicated.

For the loop you can use a different loop pattern, with only one branch per iteration (plus one extra rarely-taken branch). Just put the conditional branch at the bottom. To avoid trouble with what should be a zero-iteration loop, you can branch entirely around the loop in that case (the aforementioned rarely-taken branch).

An example of nested loops to avoid overflow in a related problem: stackoverflow.com/questions/54541129/… (that's AVX2 with intrinsics, and hsum from 8 to 64-bit is the most efficient thing for x86, and it's summing 0/1 values instead of arbitrary 8-bit, but still needs a nested loop, in that case 255 iters at most.) — Peter Cordes
– Peter Cordes, Commented Apr 2 at 11:09

Stack Exchange Network

First SIMD program

1 Answer 1

You must log in to answer this question.

Hot Network Questions

First SIMD program

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions