-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
Description
Relates to #2379
For contiguous ranges, simple types (1,2,4,8 byte integers, maybe also 4,8 bytes float in fast mode) the following vector algorithm is possible (assuming SSE2 and 8-bit type, but applicable to other sizes/vector sizes):
Spread the value to a vector register (_mm_set1 intrinsics)
Obtain matched bitmask (_mm_cmpeq_epi8 intrinsic)
Get mask as bits (_mm_movemask_epi8) , add them up (_popcnt)
Accumulate this result.
Probably hand-coded popcount will be inefficient, in this case can apply starting SSE4.2, for which we assume popcnt available.
Reactions are currently unavailable