Make bitset would_modify_words more vectorizer-friendly#153640
Make bitset would_modify_words more vectorizer-friendly#153640Zalathar wants to merge 3 commits intorust-lang:mainfrom
would_modify_words more vectorizer-friendly#153640Conversation
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Make bitset `would_modify_words` more vectorizer-friendly
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (af612eb): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResults (primary -2.3%, secondary 0.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 479.112s -> 477.505s (-0.34%) |
|
Let's see what happens if we double the subchunk length from 32 bytes (4 words) to 64 bytes (8 words). @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Make bitset `would_modify_words` more vectorizer-friendly
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (b3e83b4): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (secondary 7.5%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -2.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 480.034s -> 484.684s (0.97%) |
|
I was initially disappointed to see this only affect I wonder what code patterns cause these paths to be relevant. |
|
Probably its huge functions with a bunch of locals, exercising the move/init dataflow a lot? |
|
If it has large functions with thousands of locals, then yeah I can imagine that stressing MixedBitSet in ways that most crates never come close to. |
|
My recollection is that it has indeed, with the usual suspect of using machine-generated code. |
Currently this function compares a single pair of
u64at a time, which is potentially slower than comparing multiple words before each early-exit check, especially for the large chunks used by ChunkedBitSet.