What is the most efficient way to do unsigned 64 bit comparison on SSE2?

Question

PCMPGTQ doesn't exist on SSE2 and doesn't natively work on unsigned integers. Our goal here is to provide backward-compatible solutions for unsigned 64-bit comparisons so we can include them into the WebAssembly SIMD standard.

This is a sister question to this one for ARMv7+NEON: What is the most efficient way to do SIMD unsigned 64 bit comparison (CMHS) on ARMv7 with NEON?

And is related to the questions for signed comparisons variants already answered for SSE2 and Neon:

How to simulate pcmpgtq on sse2?

What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?

If you found a solution for signed numbers: You might invert the upper bits (bit 63) of both unsigned values and then perform a signed comparison. — Martin Rosenau
– Martin Rosenau, Commented Dec 24, 2020 at 20:23
Presumably you're going to want to design this primitive so it can use AVX512VL VPCMPUQ k1 {k2}, xmm2, xmm3/m128/m64bcst, imm8 (felixcloutier.com/x86/vpcmpq:vpcmpuq) if available. That compares into a mask register, not a vector. But if SSE4.2 is available, you'd xor to flip the signs of both inputs for pcmpgtq. It's obviously possible to implement with SSE2, and some HW has it fairly efficiently; does the level of inefficiency for SSE2 really determine whether you choose to include it or not in WebAssembly? — Peter Cordes
– Peter Cordes, Commented Dec 24, 2020 at 21:36
@PeterCordes is flipping the 63 bit with xor or subtracting the numbers on both inputs the only way? Or is there another comparison approach that works with unsigned numbers but is different from how we did it last time? — Dan Weber
– Dan Weber, Commented Dec 25, 2020 at 0:39
@njuffa I'm looking to see if there is an approach for SSE2 that wasn't merely an XOR to the inputs. — Dan Weber
– Dan Weber, Commented Dec 25, 2020 at 0:40
If you have pcmpgtq, then flipping the high bits of both is almost certainly best. With just SSE2, IDK, likely there is something more clever than building more steps to feed your existing pcmpgtq emulation. My point was that a naive starting-point like that is totally fine, as long as users of WebAssembly understand that it might not be efficient everywhere. (Unless there's zero hardware in the world that can do it in one instruction, not even AArch64 with or without SVE; Although even if your model would require AVX512VL to get the result back into a vector, that could be optimized out) — Peter Cordes
– Peter Cordes, Commented Dec 25, 2020 at 1:33

6 revs, 2 users 82% · Accepted Answer · 2024-05-10 14:50:02Z

4

The carry out from a subtract is an unsigned comparision predicate.

We can capture the carry out using same trick that is used when averaging two numbers. Only here we'll base it on a half-subtractor instead of a half-adder.

__m128i sse2_cmpgt_epu64(__m128i a, __m128i b) {
    b = _mm_xor_si128(b, a); // diff
    a = _mm_and_si128(a, b); // borrow `a & ~b`
    b = _mm_sub_epi64(_mm_srli_epi64(b, 1), a); 
    return _mm_shuffle_epi32(_mm_srai_epi32(b, 31), _MM_SHUFFLE(3,3,1,1));
}

Translated from Hacker's Delight:

static
__m128i sse2_cmpgt_epu64(__m128i a, __m128i b) {
    __m128i r = _mm_andnot_si128(_mm_xor_si128(b, a), _mm_sub_epi64(b, a));
    r = _mm_or_si128(r, _mm_andnot_si128(b, a));
    return _mm_shuffle_epi32(_mm_srai_epi32(r, 31), _MM_SHUFFLE(3,3,1,1));
}

Concept: If mixed "signs" (unsigned MSBs) then return a else return b - a

(MSB(a) ^ MSB(b)) ? a : b - a; // result in MSB

This makes sense:

if a's MSB is set and b's isn't, a is unsigned above (so MSB(a) is our result)
if b's MSB is set and a's isn't, a is unsigned below (so MSB(a) is our result)
if their MSBs are the same, they values are in the same half of the unsigned range so b-a is effectively a 63-bit subtraction. The MSBs will cancel and the MSB of b-a will be equal to the "borrow" output which tells you if a is strictly above b. (Like the CF flag for scalar sub. jb is jc). So MSB(b-a) is our result.

Note that the SIMD andnot/and/or is a bit-blend, but we only care about the MSB. We broadcast it with srai -> shuffle_epi32, discarding the garbage in lower bits. (Or with SSE3, movshdup as described in @Soont's answer.)

It differs from signed compare:

(MSB(a) ^ MSB(b)) ? ~a : b - a; // result in MSB

If signs are mixed then the sign of ~a is also the sign of b, of course.

edited May 10, 2024 at 14:50

community wiki

6 revs, 2 users 82%
aqrit

Sign up to request clarification or add additional context in comments.

3 Comments

Peter Cordes Over a year ago

(a ^ b) ? in C is a logical test of the whole result for being non-zero. Using an _mm_xor result as a blend control will instead bit-blend, not select between either of two whole results. Is that what this notation means in Hacker's Delight? Anyway, are you sure that works, and you don't need to pcmpeqq (or equivalent) against zero first?

Peter Cordes Over a year ago

Oh right, you're broadcasting the sign bit after selecting, so you only care about blending the top bit. And your a^b condition is actually MSB(a) ^ MSB(b), not whether the whole numbers is equal at all bit-positions, making the whole 64-bit xor == 0. This is not what the expression would mean in C.

Peter Cordes Over a year ago

Edited to include a more detailed explanation of how/why it works.

Soonts · Accepted Answer · 2020-12-26 19:45:50Z

2

Here you go.

__m128i cmpgt_epu64_sse2( __m128i a, __m128i b )
{
    // Compare uint32_t lanes for a > b and a < b
    const __m128i signBits = _mm_set1_epi32( 0x80000000 );
    a = _mm_xor_si128( a, signBits );
    b = _mm_xor_si128( b, signBits );
    __m128i gt = _mm_cmpgt_epi32( a, b );
    __m128i lt = _mm_cmpgt_epi32( b, a );

    // It's too long to explain why, but the result we're after is equal to ( gt > lt ) for uint64_t lanes of these vectors.
    // Unlike the source numbers, lt and gt vectors contain a single bit of information per 32-bit lane.
    // This way it's much easier to compare them with SSE2.

    // Clear the highest bit to avoid overflows of _mm_sub_epi64.
    // _mm_srli_epi32 by any number of bits in [ 1 .. 31 ] would work too, only slightly slower.
    gt = _mm_andnot_si128( signBits, gt );
    lt = _mm_andnot_si128( signBits, lt );

    // Subtract 64-bit integers; we're after the sign bit of the result.
    // ( gt > lt ) is equal to extractSignBit( lt - gt )
    // The above is only true when ( lt - gt ) does not overflow, that's why we can't use it on the source numbers.
    __m128i res = _mm_sub_epi64( lt, gt );

    // Arithmetic shift to broadcast the sign bit into higher halves of uint64_t lanes
    res = _mm_srai_epi32( res, 31 );
    // Broadcast higher 32-bit lanes into the final result.
    return _mm_shuffle_epi32( res, _MM_SHUFFLE( 3, 3, 1, 1 ) );
}

Here’s a test app.

If SSE3 is available, movshdup is also a good option instead of the pshufd (_mm_shuffle_epi32) to copy the srai result to the low dword in each element. (Or optimize that away if the next use is movmskpd or something else which only depends on the high part of each qword).

For example, on Conroe/Merom (first gen Core 2, SSSE3 and most SIMD execution units are 128-bit wide but the shuffle unit has limitations), pshufd is 2 uops, 3 cycle latency (flt->int domain). movshdup is only 1 uop, 1 cycle latency because its hard-wired shuffle is only within each 64-bit half of a register. movshdup runs in the "SIMD-int" domain so it doesn't cause any extra bypass latency between integer shift and whatever integer thing you do next, unlike pshufd. (https://agner.org/optimize/)

If you're JITing, you'd only ever be using this on CPUs without SSE4.2, which means Intel before Nehalem, AMD before Bulldozer. Note that psubq (_mm_sub_epi64) is somewhat slower than narrower psub on some of those CPUs, but it's still the best option.

For completeness, here’s SSSE3 version (not quite the same thing as SSE3), saves a few instructions at the cost of a constant load. The only way to find out if it’s faster or slower — test on old computers.

__m128i cmpgt_epu64_ssse3( __m128i a, __m128i b )
{
    // Compare uint32_t lanes for a > b and a < b
    const __m128i signBits = _mm_set1_epi32( 0x80000000 );
    a = _mm_xor_si128( a, signBits );
    b = _mm_xor_si128( b, signBits );
    __m128i gt = _mm_cmpgt_epi32( a, b );
    __m128i lt = _mm_cmpgt_epi32( b, a );

    // Shuffle bytes making two pairs of equal uint32_t values to compare.
    // Each uint32_t combines two bytes from lower and higher parts of the vectors.
    const __m128i shuffleIndices = _mm_setr_epi8(
        0, 4, -1, -1,
        0, 4, -1, -1,
        8, 12, -1, -1,
        8, 12, -1, -1 );
    gt = _mm_shuffle_epi8( gt, shuffleIndices );
    lt = _mm_shuffle_epi8( lt, shuffleIndices );
    // Make the result
    return _mm_cmpgt_epi32( gt, lt );
}

edited Dec 26, 2020 at 19:45

answered Dec 25, 2020 at 18:47

Soonts

22.3k11 gold badges69 silver badges148 bronze badges

8 Comments

Peter Cordes Over a year ago

Is this just How to simulate pcmpgtq on sse2? with sign bits flipped? I guess not quite; you're flipping sign bits of each 32-bit chunk, not 64. (The 64-bit signed compared would have to range-shift the low halves of each qword to from unsigned signed.) So there is some saving vs. flipping just high bits and feeding to signed compare, which compilers might or might not have optimized away for us.

Peter Cordes Over a year ago

psrai res,31 => pshufd might be better than srli / psubq. pshufd is a copy-and-shuffle and you wouldn't need a zero to sub from. (OTOH, some old CPUs without SSE4.2 have slow shuffle units as described in Fastest way to do horizontal SSE vector sum (or other reduction) making pshufb slower. But some old CPUs also have slow psubq, IIRC; have to check on that.)

Soonts Over a year ago

@PeterCordes Good idea, updated. pshufd is 1 latency instruction in all recent CPUs. That 32-bit arithmetic shift, psrad, is not, before Skylake and Ryzen it had 2 cycles latency. However, it still makes sense to do that because what I wrote about psrad equally applies to psrlq, and psrad saves pxor instruction used by _mm_setzero_si128

Peter Cordes Over a year ago

Remember the real goal of this question is a JIT fallback when SSE4.2 is not available, otherwise you do 2x pxor / pcmpgtq. Given that WebAssembly is JITed, there's no chance of this being used on Sandybridge-family or Bulldozer / Zen. (Unless someone else uses it for ahead-of-time compilation without dynamic dispatching, or someone configures a VM badly to not advertise SSE4 to a guest...)

Soonts Over a year ago

Also, for completeness, here’s SSSE3 version, might be slightly faster, I don’t know if that’s the case. gist.github.com/Const-me/5999328a743128a86f7f5a93d07f2463

|

Collectives™ on Stack Overflow

What is the most efficient way to do unsigned 64 bit comparison on SSE2?

2 Answers 2

3 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related