Skip to content

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers #2438

@AlexGuteniev

Description

@AlexGuteniev

With contiguous memory range and the default predicate (a flavor of less) or no predicate it can be vectorized.
SSE4.1 max_element algorithm.

Vertical part:

  • _mm_cmpgt_epi to compute mask of greater value of current against previous
  • _mm_add_epi to count indices of the current value (add precomputed vector _mm_set1_epi with 1)
  • _mm_blendv_epi8 to select indices of greater value into _Cur_idx_max
  • _mm_max_epi to update previous maximum, or alternatively can use _mm_blendv_epi8 again (for 64-bit where there's no max)
  • loop until the end or until some number to make sure accumulated _Cur_idx_max does not overflow

Horizontal part:

  • out of the vector, select greatest value by shuffling and comparing with itself with _mm_max_epi
  • find mask of greatest value by _mm_cmpeq_epi with maximum
  • use mask of greatest value with _mm_blendv_epi8 on _Cur_idx_max to get vector of indices of greater value or all FFs for not the greatest indices
  • out of indices vector find smallest indices by shuffling and comparing with itself with _mm_min_epu
  • find smallest indices mask. Bitwise-and this mask with the previous indices mask to handle the case if all FFs is actually the index.
  • Extract the mask to GPR and _BitScanForward into _H_pos
  • knowing index of greater element in vector, extract its index _Cur_idx_max into _V_pos
  • the answer is _V_pos*16 + _Result_index relative to the starting pointer in bytes
  • if loop stopped to prevent _Result_positions overflow, start again with previous maximum already initialized

(For 64-bit vector it takes _mm_cmpgt_epi64 from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)

It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixedSomething works now, yay!performanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions