-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
Description
With contiguous memory range and the default predicate (a flavor of less) or no predicate it can be vectorized.
SSE4.1 max_element algorithm.
Vertical part:
_mm_cmpgt_epito compute mask of greater value of current against previous_mm_add_epito count indices of the current value (add precomputed vector_mm_set1_epiwith 1)_mm_blendv_epi8to select indices of greater value into_Cur_idx_max_mm_max_epito update previous maximum, or alternatively can use_mm_blendv_epi8again (for 64-bit where there's no max)- loop until the end or until some number to make sure accumulated
_Cur_idx_maxdoes not overflow
Horizontal part:
- out of the vector, select greatest value by shuffling and comparing with itself with
_mm_max_epi - find mask of greatest value by
_mm_cmpeq_epiwith maximum - use mask of greatest value with
_mm_blendv_epi8on_Cur_idx_maxto get vector of indices of greater value or all FFs for not the greatest indices - out of indices vector find smallest indices by shuffling and comparing with itself with
_mm_min_epu - find smallest indices mask. Bitwise-and this mask with the previous indices mask to handle the case if all FFs is actually the index.
- Extract the mask to GPR and
_BitScanForwardinto_H_pos - knowing index of greater element in vector, extract its index
_Cur_idx_maxinto_V_pos - the answer is
_V_pos*16 + _Result_indexrelative to the starting pointer in bytes - if loop stopped to prevent
_Result_positionsoverflow, start again with previous maximum already initialized
(For 64-bit vector it takes _mm_cmpgt_epi64 from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)
It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.
Reactions are currently unavailable