Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Optimize BitArray APIs using Sse2#33781

Merged
stephentoub merged 11 commits intodotnet:masterfrom
EgorBo:intrinsify-bitarray
Dec 15, 2018
Merged

Optimize BitArray APIs using Sse2#33781
stephentoub merged 11 commits intodotnet:masterfrom
EgorBo:intrinsify-bitarray

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Nov 30, 2018

image

A graph for small values where there is a small perf regression:
image

^ the benchmark looks like this:

static void Main(string[] args)
{
    if (Environment.GetEnvironmentVariable("COMPlus_TieredCompilation") != "0")
        throw new Exception();
    BenchmarkRunner.Run<Program>();
}

[Benchmark(Baseline = true)]
[ArgumentsSource(nameof(Data))]
public BitArray Or(BitArray b1, BitArray b2, int length)
{
    return b1.Or(b2);
}

public IEnumerable<object[]> Data()
{
    for (int i = 1; i < 20; i++)
        yield return new object[] { GenerateBitArray(i, 2), GenerateBitArray(i, 3), i };
    yield return new object[] { GenerateBitArray(1000, 2), GenerateBitArray(1000, 3), 1000 };
}

static BitArray GenerateBitArray(int len, int d) => new BitArray(Enumerable.Range(0, len).Select(b => b % d).ToArray());


for (int i = 0; i < m_array.Length; i++)
int i = 0;
if (Sse2.IsSupported && m_array.Length >= 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any metrics for the average length of a BitArray?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to gain some insights here that might be relevant: #33367 (comment)

@ahsonkhan
Copy link

ahsonkhan commented Dec 1, 2018

the benchmark looks like this

What is the break-even point?
Length = 3 regresses a bit, and Length = 9+ improves (at what length do they cross). What about Length 1,2? How bad is the regression for those?

Also, is that bit length or byte length or backing integer array length?

Edit: It's integer length.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 1, 2018

@ahsonkhan updated the results for all small values ^

@ahsonkhan
Copy link

ahsonkhan commented Dec 1, 2018

updated the results for all small values ^

Hmm. So 8 integers is the break-even, i.e. 256 bits. In some places 1-4 ints was common. If the break-even was something like 3 or 4, I think the trade-off would have made more sense. Here, I am not too sure. But the gains above 8 are tempting.

On the other hand, 30-50% regression for small lengths is quite significant. The regression for lengths between 4-8 is tolerable.

We made the trade-off here because the regression was small (and the break-even point was 4 ints): #33367 (comment)

@ahsonkhan ahsonkhan requested a review from stephentoub December 1, 2018 01:48
throw new ArgumentException(SR.Arg_ArrayLengthsDiffer);

for (int i = 0; i < m_array.Length; i++)
int i = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of some of the regression for small numbers by special-casing and manually unrolling the 1 through 8 case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a shot :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

var count = m_array.Length;

switch (count)
{
    case 7: m_array[7] &= value.m_array[7]; goto case 6;
    case 6: m_array[6] &= value.m_array[6]; goto case 5;
    case 5: m_array[5] &= value.m_array[5]; goto case 4;
    case 4: m_array[4] &= value.m_array[4]; goto case 3;
    case 3: m_array[3] &= value.m_array[3]; goto case 2;
    case 2: m_array[2] &= value.m_array[2]; goto case 1;
    case 1: m_array[1] &= value.m_array[1]; goto _done;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe something like:

var count = m_array.Length;

switch (count)
{
    case 3: m_array[3] &= value.m_array[3]; goto case 2;
    case 2: m_array[2] &= value.m_array[2]; goto case 1;
    case 1: m_array[1] &= value.m_array[1]; goto _done;

    case 7: m_array[7] &= value.m_array[7]; goto case 6;
    case 6: m_array[6] &= value.m_array[6]; goto case 5;
    case 5: m_array[5] &= value.m_array[5]; goto case 4;
    case 4: result = vector & vector; goto _done;

    case 8: result = vector & vector; goto case 4;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance of the switch vs. for-loop for patterns like these depends a lot on the predictability of count. If it is predictable (e.g. always the same), switch is somewhat better. If if is non-predictable (e.g. random), switch tends to be much worse.

I think the current code is better because of it is smaller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indirect jumps tend to get poorly predicted.

Not sure what you mean here. There should only be a single jump generated....

For switch statements that have linear cases covered, you can store the targets in a constant array and do a single jump:

mov     r8d, DWORD PTR $LN14@And[rcx+rax*4]
add     r8, rcx
jmp     r8

This is obviously not 100% equivalent, but this shows how native compilers do it:
https://godbolt.org/z/MWX7QC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and since the jump is unconditional, there should be no miss-prediction)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processor needs to know where the branch is going to go to keep decoding instruction stream. For indirect jumps, it cannot even guess taken/not-taken like for conditional jumps when it has no data for the given branch in the predictor. It needs to wait for the execution to catch up.

Copy link
Contributor

@fiigii fiigii Dec 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should only be a single jump generated....
(and since the jump is unconditional, there should be no miss-prediction)

Agree with @jkotas, the branch prediction is not related how less jumps you have or unconditional/conditional jump. In your example, jmp r8 is an indirect jump but CPU cannot predict its targets if the data (DWORD PTR $LN14@And[rcx+rax*4]) does not have obvious patterns.

The indirect jumps tend to get poorly predicted.

Another reason for poor predictability is switch only has one dispatch node for all cases, so that the branch history table has to remember recently seen targets for all cases. But if sequences (or other equivalents, like threaded code ) have a dispatch node for each "case", so each branch history table just needs to remember its own recent targets, which would have much better branch prediction rate.

Copy link
Member Author

@EgorBo EgorBo Dec 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just benchmarked the unroll-switch algorithm (see https://gist.github.com/EgorBo/66fad0c6b50775ec0ec022a14437bd77) and got this results:
image

}
}

for (; i < m_array.Length; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably unroll this as well, since we should only have (at most) 3 remaining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I wish JIT could unroll it for me 😄 (dotnet/coreclr#19594?). I'll try and update the benchmarks.

Copy link

@ArtBlnd ArtBlnd Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not be unrolled until limit of IV is constant, you have to unroll it manually if you want.
that partial unrolling supports does not unroll dynamic IV which has lots of cost to analyse.

And it will not really affect on performance even after its unrolled.

Copy link

@ArtBlnd ArtBlnd Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And reason why its just a normal ALU, which has only one port.
If you want to unroll it. You have to make it to use SIMD array.(use difference 2 vector registers) and make it unrolled twice so will make significal changes on performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArtBlnd well, anyway I don't think it makes sense to unroll it by hands and make code uglier here - it's invoked only after Length >= 5

Copy link

@ArtBlnd ArtBlnd Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It really doesn't matter its unrolled or not. so its good code :>

@EgorBo
Copy link
Member Author

EgorBo commented Dec 1, 2018

According to my benchmarks it looks like unroll-switch is the fastest solution for Length <= 8
I know benchmarks are not the real world scenarios so let me know if there is a better solution.
I guess it doesn't make any sense to unroll the loop at the end of the methods since it's executed only after Length > 8 where performance is already good

@ahsonkhan
Copy link

@dotnet-bot test this please

@EgorBo
Copy link
Member Author

EgorBo commented Dec 14, 2018

@dotnet-bot test Windows x64 Debug Build
@dotnet-bot test OSX x64 Debug Build
@dotnet-bot test Packaging All Configurations x64 Debug Build

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @EgorBo!

@stephentoub stephentoub merged commit 372b575 into dotnet:master Dec 15, 2018
jlennox pushed a commit to jlennox/corefx that referenced this pull request Dec 16, 2018
* Vectorize BitArray

* add braces after 'fixed'

* expand 'var'

* fix typo (don't worry, the benchmarks were made without it)

* unroll-switch

* simplify unroll-switch

* bring braces after 'fixed' back

* use Vector128<int>.Count instead of `4`

* CamelCase for labels
@karelz karelz added this to the 3.0 milestone Dec 21, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Vectorize BitArray

* add braces after 'fixed'

* expand 'var'

* fix typo (don't worry, the benchmarks were made without it)

* unroll-switch

* simplify unroll-switch

* bring braces after 'fixed' back

* use Vector128<int>.Count instead of `4`

* CamelCase for labels


Commit migrated from dotnet/corefx@372b575
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants