Skip to content

<xutility>: vectorize std::count #2384

@AlexGuteniev

Description

@AlexGuteniev

Relates to #2379

For contiguous ranges, simple types (1,2,4,8 byte integers, maybe also 4,8 bytes float in fast mode) the following vector algorithm is possible (assuming SSE2 and 8-bit type, but applicable to other sizes/vector sizes):

Spread the value to a vector register (_mm_set1 intrinsics)
Obtain matched bitmask (_mm_cmpeq_epi8 intrinsic)
Get mask as bits (_mm_movemask_epi8) , add them up (_popcnt)
Accumulate this result.
Probably hand-coded popcount will be inefficient, in this case can apply starting SSE4.2, for which we assume popcnt available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixedSomething works now, yay!performanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions