I am learning SIMD programming and have chosen to write a program that sums up a randomly generated array of signed 8-bit integers. The program is written in aarch64 assembly, ran with QEMU userspace emulation.
The routine expects the address of the memory containing the numbers in x0 and the number of elements (number of bytes) in x1; it returns the total signed sum of all bytes in x0, assume that the running sum never overflows 64-bits.
.type sum_vector, %function
.set frame_size, 0x10
sum_vector:
stp fp, lr, [sp, #-frame_size]!
mov fp, sp
// x2 is index into buffer; x3 is running total
mov x2, xzr
mov x3, xzr
.L_sum_vector_vec_loop:
mov x4, x2
add x4, x4, #16
cmp x4, x1
b.ge .L_sum_vector_vec_done
ldr q0, [x0, x2]
saddlv h0, v0.16b
smov x4, v0.h[0]
add x3, x3, x4
add x2, x2, #16
b .L_sum_vector_vec_loop
.L_sum_vector_vec_done:
.L_sum_vector_scalar_loop:
cmp x2, x1
b.ge .L_sum_vector_scalar_done
ldrsb x4, [x0, x2]
add x3, x3, x4
add x2, x2, #1
b .L_sum_vector_scalar_loop
.L_sum_vector_scalar_done:
mov x0, x3
ldp fp, lr, [sp], #frame_size
ret
My understanding is that 'horizontal summation' is the preferred approach to summing arrays with SIMD, but I could not figure out how to make it work while avoiding overflow. Thus, my approach simply loads 16-bytes of memory in a loop (.L_sum_vector_vec_loop) and does a saddlv reduction, adding to a 64-bit total, with a later pass to handle any leftover values (.L_sum_vector_scalar_loop).
There are some optimizations I could do like unrolling the loop and using a padded-array to avoid the second loop, but I am hoping there is a more idiomatic or correct SIMD way of doing this. General comments on style, calling conventions, idiomaticity, and performance are also welcome.