Optimize BitArray APIs using Sse2#33781
Conversation
|
|
||
| for (int i = 0; i < m_array.Length; i++) | ||
| int i = 0; | ||
| if (Sse2.IsSupported && m_array.Length >= 4) |
There was a problem hiding this comment.
Do we have any metrics for the average length of a BitArray?
There was a problem hiding this comment.
I tried to gain some insights here that might be relevant: #33367 (comment)
What is the break-even point? Also, is that bit length or byte length or backing integer array length? Edit: It's integer length. |
|
@ahsonkhan updated the results for all small values ^ |
Hmm. So 8 integers is the break-even, i.e. 256 bits. In some places 1-4 ints was common. If the break-even was something like 3 or 4, I think the trade-off would have made more sense. Here, I am not too sure. But the gains above 8 are tempting. On the other hand, 30-50% regression for small lengths is quite significant. The regression for lengths between 4-8 is tolerable. We made the trade-off here because the regression was small (and the break-even point was 4 ints): #33367 (comment) |
| throw new ArgumentException(SR.Arg_ArrayLengthsDiffer); | ||
|
|
||
| for (int i = 0; i < m_array.Length; i++) | ||
| int i = 0; |
There was a problem hiding this comment.
Can we get rid of some of the regression for small numbers by special-casing and manually unrolling the 1 through 8 case?
There was a problem hiding this comment.
Something like:
var count = m_array.Length;
switch (count)
{
case 7: m_array[7] &= value.m_array[7]; goto case 6;
case 6: m_array[6] &= value.m_array[6]; goto case 5;
case 5: m_array[5] &= value.m_array[5]; goto case 4;
case 4: m_array[4] &= value.m_array[4]; goto case 3;
case 3: m_array[3] &= value.m_array[3]; goto case 2;
case 2: m_array[2] &= value.m_array[2]; goto case 1;
case 1: m_array[1] &= value.m_array[1]; goto _done;
}
There was a problem hiding this comment.
or maybe something like:
var count = m_array.Length;
switch (count)
{
case 3: m_array[3] &= value.m_array[3]; goto case 2;
case 2: m_array[2] &= value.m_array[2]; goto case 1;
case 1: m_array[1] &= value.m_array[1]; goto _done;
case 7: m_array[7] &= value.m_array[7]; goto case 6;
case 6: m_array[6] &= value.m_array[6]; goto case 5;
case 5: m_array[5] &= value.m_array[5]; goto case 4;
case 4: result = vector & vector; goto _done;
case 8: result = vector & vector; goto case 4;
}
There was a problem hiding this comment.
The performance of the switch vs. for-loop for patterns like these depends a lot on the predictability of count. If it is predictable (e.g. always the same), switch is somewhat better. If if is non-predictable (e.g. random), switch tends to be much worse.
I think the current code is better because of it is smaller.
There was a problem hiding this comment.
The indirect jumps tend to get poorly predicted.
Not sure what you mean here. There should only be a single jump generated....
For switch statements that have linear cases covered, you can store the targets in a constant array and do a single jump:
mov r8d, DWORD PTR $LN14@And[rcx+rax*4]
add r8, rcx
jmp r8This is obviously not 100% equivalent, but this shows how native compilers do it:
https://godbolt.org/z/MWX7QC
There was a problem hiding this comment.
(and since the jump is unconditional, there should be no miss-prediction)
There was a problem hiding this comment.
The processor needs to know where the branch is going to go to keep decoding instruction stream. For indirect jumps, it cannot even guess taken/not-taken like for conditional jumps when it has no data for the given branch in the predictor. It needs to wait for the execution to catch up.
There was a problem hiding this comment.
There should only be a single jump generated....
(and since the jump is unconditional, there should be no miss-prediction)
Agree with @jkotas, the branch prediction is not related how less jumps you have or unconditional/conditional jump. In your example, jmp r8 is an indirect jump but CPU cannot predict its targets if the data (DWORD PTR $LN14@And[rcx+rax*4]) does not have obvious patterns.
The indirect jumps tend to get poorly predicted.
Another reason for poor predictability is switch only has one dispatch node for all cases, so that the branch history table has to remember recently seen targets for all cases. But if sequences (or other equivalents, like threaded code ) have a dispatch node for each "case", so each branch history table just needs to remember its own recent targets, which would have much better branch prediction rate.
There was a problem hiding this comment.
Just benchmarked the unroll-switch algorithm (see https://gist.github.com/EgorBo/66fad0c6b50775ec0ec022a14437bd77) and got this results:

| } | ||
| } | ||
|
|
||
| for (; i < m_array.Length; i++) |
There was a problem hiding this comment.
We could probably unroll this as well, since we should only have (at most) 3 remaining.
There was a problem hiding this comment.
@tannergooding I wish JIT could unroll it for me 😄 (dotnet/coreclr#19594?). I'll try and update the benchmarks.
There was a problem hiding this comment.
It will not be unrolled until limit of IV is constant, you have to unroll it manually if you want.
that partial unrolling supports does not unroll dynamic IV which has lots of cost to analyse.
And it will not really affect on performance even after its unrolled.
There was a problem hiding this comment.
And reason why its just a normal ALU, which has only one port.
If you want to unroll it. You have to make it to use SIMD array.(use difference 2 vector registers) and make it unrolled twice so will make significal changes on performance.
There was a problem hiding this comment.
@ArtBlnd well, anyway I don't think it makes sense to unroll it by hands and make code uglier here - it's invoked only after Length >= 5
There was a problem hiding this comment.
Yes, It really doesn't matter its unrolled or not. so its good code :>
|
According to my benchmarks it looks like unroll-switch is the fastest solution for Length <= 8 |
|
@dotnet-bot test this please |
|
@dotnet-bot test Windows x64 Debug Build |
* Vectorize BitArray * add braces after 'fixed' * expand 'var' * fix typo (don't worry, the benchmarks were made without it) * unroll-switch * simplify unroll-switch * bring braces after 'fixed' back * use Vector128<int>.Count instead of `4` * CamelCase for labels
* Vectorize BitArray * add braces after 'fixed' * expand 'var' * fix typo (don't worry, the benchmarks were made without it) * unroll-switch * simplify unroll-switch * bring braces after 'fixed' back * use Vector128<int>.Count instead of `4` * CamelCase for labels Commit migrated from dotnet/corefx@372b575
A graph for small values where there is a small perf regression:

^ the benchmark looks like this: