Vectorize std::search of 1 and 2 bytes elements with pcmpestri#4745
Vectorize std::search of 1 and 2 bytes elements with pcmpestri#4745StephanTLavavej merged 49 commits intomicrosoft:mainfrom
std::search of 1 and 2 bytes elements with pcmpestri#4745Conversation
Who's a good search? You are! Yes you!
…pred`. `_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`. `_Equal_rev_pred` is called by ranges `search`/`find_end`. This doesn't affect `equal` etc.
This reverts commit 72a0d29.
might restore one or both later
|
The previous attempt was #4654 and it ended up being just |
Resolved conflicts in xutility.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:
Aside from I am mildly confused as to why performance for |
I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again. |
|
Thanks, makes sense! |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
🔍 🕵️ 🔎 |
Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.
For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.
In benchmark results 0 is small needle, 1 is large needle.