Skip to content

Improve std::count vectorization #4456

@AlexGuteniev

Description

@AlexGuteniev

I noticed from DevCom-10610716 and DevCom-10553384 that MSVC tries to auto-vectorize element counting.

I observe that the following is auto-vectorized for every int_type given that size_type is of the same size:

using int_type = int;
using size_type = int_type;

size_type count(int_type* b, int_type* e, int_type v)
{
    size_type r = 0;
    for (;b!=e;++b){
        if (*b == v){
            ++r;
        }
    }
    return r;
}

See https://godbolt.org/z/Yxx5e3foz

The algorithm is not the same as in vector_algorithms.cpp, specifically, it counts in vector registers, rather than general-purpose registers, hence the limitation of same elements size.

This leads to std::count and ranges::std::count being auto-vectorized with /D_USE_STD_VECTOR_ALGORITHMS=0

  • On x64 for element size 8, beating __std_count_trivial_8 slightly
  • On x86 for element size 4, beating __std_count_trivial_4 almost twice

The current compiler-generated code is likely to be incorrect, as the DevCom-10610716 and DevCom-10553384 are respectively about 8 and 4 element size implementation of this, and the issues are only Fixed - Pending Release. Still the whole approach looks like valid.

The approach of auto-vectorization can potentially scale towards supporting standard size type for each element size.
So we have few options to improve over existing std::count vectorization:

  • Wait for the compiler to auto-vectorize even mismatching element size, then drop std::count manual vectorization. Can wait passively, or create yet another DevCom issue. It may be too much to expect.
  • Reimplement the approach from the complier-generated code in vector_algorithm.cpp. Either same-element-size, or generalized version. Sure this would make the algorithm more complex. Not like minmax_element, but still more complex than the current count.
  • Reimplement count in headers, so that it counts portions to to the same size counter as element size, and then accumulates to destination count. So that compiler auto-vectorizes the inner counting. "This is gross".

(I'm not inclined to do anything about this in the near time, especially as there are some recent changes in the compiler in this area, and also because there are some more useful optimization and non-optimization things to do. Just recording the observation).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions