Optimize BitArray APIs using Sse2 by EgorBo · Pull Request #33781 · dotnet/corefx

EgorBo · 2018-11-30T23:28:31Z

A graph for small values where there is a small perf regression:

^ the benchmark looks like this:

static void Main(string[] args)
{
    if (Environment.GetEnvironmentVariable("COMPlus_TieredCompilation") != "0")
        throw new Exception();
    BenchmarkRunner.Run<Program>();
}

[Benchmark(Baseline = true)]
[ArgumentsSource(nameof(Data))]
public BitArray Or(BitArray b1, BitArray b2, int length)
{
    return b1.Or(b2);
}

public IEnumerable<object[]> Data()
{
    for (int i = 1; i < 20; i++)
        yield return new object[] { GenerateBitArray(i, 2), GenerateBitArray(i, 3), i };
    yield return new object[] { GenerateBitArray(1000, 2), GenerateBitArray(1000, 3), 1000 };
}

static BitArray GenerateBitArray(int len, int d) => new BitArray(Enumerable.Range(0, len).Select(b => b % d).ToArray());

src/System.Collections/src/System/Collections/BitArray.cs

tannergooding · 2018-12-01T00:03:30Z

src/System.Collections/src/System/Collections/BitArray.cs


-            for (int i = 0; i < m_array.Length; i++)
+            int i = 0;
+            if (Sse2.IsSupported && m_array.Length >= 4)


Do we have any metrics for the average length of a BitArray?

I tried to gain some insights here that might be relevant: #33367 (comment)

ahsonkhan · 2018-12-01T01:37:56Z

the benchmark looks like this

What is the break-even point?
Length = 3 regresses a bit, and Length = 9+ improves (at what length do they cross). What about Length 1,2? How bad is the regression for those?

Also, is that bit length or byte length or backing integer array length?

Edit: It's integer length.

src/System.Collections/src/System/Collections/BitArray.cs

EgorBo · 2018-12-01T01:39:05Z

@ahsonkhan updated the results for all small values ^

ahsonkhan · 2018-12-01T01:42:30Z

updated the results for all small values ^

Hmm. So 8 integers is the break-even, i.e. 256 bits. In some places 1-4 ints was common. If the break-even was something like 3 or 4, I think the trade-off would have made more sense. Here, I am not too sure. But the gains above 8 are tempting.

On the other hand, 30-50% regression for small lengths is quite significant. The regression for lengths between 4-8 is tolerable.

We made the trade-off here because the regression was small (and the break-even point was 4 ints): #33367 (comment)

tannergooding · 2018-12-01T01:51:41Z

src/System.Collections/src/System/Collections/BitArray.cs

                throw new ArgumentException(SR.Arg_ArrayLengthsDiffer);

-            for (int i = 0; i < m_array.Length; i++)
+            int i = 0;


Can we get rid of some of the regression for small numbers by special-casing and manually unrolling the 1 through 8 case?

Worth a shot :)

Something like:

var count = m_array.Length; switch (count) { case 7: m_array[7] &= value.m_array[7]; goto case 6; case 6: m_array[6] &= value.m_array[6]; goto case 5; case 5: m_array[5] &= value.m_array[5]; goto case 4; case 4: m_array[4] &= value.m_array[4]; goto case 3; case 3: m_array[3] &= value.m_array[3]; goto case 2; case 2: m_array[2] &= value.m_array[2]; goto case 1; case 1: m_array[1] &= value.m_array[1]; goto _done; }

or maybe something like:

var count = m_array.Length; switch (count) { case 3: m_array[3] &= value.m_array[3]; goto case 2; case 2: m_array[2] &= value.m_array[2]; goto case 1; case 1: m_array[1] &= value.m_array[1]; goto _done; case 7: m_array[7] &= value.m_array[7]; goto case 6; case 6: m_array[6] &= value.m_array[6]; goto case 5; case 5: m_array[5] &= value.m_array[5]; goto case 4; case 4: result = vector & vector; goto _done; case 8: result = vector & vector; goto case 4; }

The performance of the switch vs. for-loop for patterns like these depends a lot on the predictability of count. If it is predictable (e.g. always the same), switch is somewhat better. If if is non-predictable (e.g. random), switch tends to be much worse.

I think the current code is better because of it is smaller.

The indirect jumps tend to get poorly predicted.

Not sure what you mean here. There should only be a single jump generated....

For switch statements that have linear cases covered, you can store the targets in a constant array and do a single jump:

mov r8d, DWORD PTR $LN14@And[rcx+rax*4] add r8, rcx jmp r8

This is obviously not 100% equivalent, but this shows how native compilers do it:
https://godbolt.org/z/MWX7QC

(and since the jump is unconditional, there should be no miss-prediction)

The processor needs to know where the branch is going to go to keep decoding instruction stream. For indirect jumps, it cannot even guess taken/not-taken like for conditional jumps when it has no data for the given branch in the predictor. It needs to wait for the execution to catch up.

There should only be a single jump generated....
(and since the jump is unconditional, there should be no miss-prediction)

Agree with @jkotas, the branch prediction is not related how less jumps you have or unconditional/conditional jump. In your example, jmp r8 is an indirect jump but CPU cannot predict its targets if the data (DWORD PTR $LN14@And[rcx+rax*4]) does not have obvious patterns.

The indirect jumps tend to get poorly predicted.

Another reason for poor predictability is switch only has one dispatch node for all cases, so that the branch history table has to remember recently seen targets for all cases. But if sequences (or other equivalents, like threaded code ) have a dispatch node for each "case", so each branch history table just needs to remember its own recent targets, which would have much better branch prediction rate.

Just benchmarked the unroll-switch algorithm (see https://gist.github.com/EgorBo/66fad0c6b50775ec0ec022a14437bd77) and got this results:

tannergooding · 2018-12-01T02:06:40Z

src/System.Collections/src/System/Collections/BitArray.cs

+                }
            }

+            for (; i < m_array.Length; i++)


We could probably unroll this as well, since we should only have (at most) 3 remaining.

@tannergooding I wish JIT could unroll it for me 😄 (dotnet/coreclr#19594?). I'll try and update the benchmarks.

It will not be unrolled until limit of IV is constant, you have to unroll it manually if you want.
that partial unrolling supports does not unroll dynamic IV which has lots of cost to analyse.

And it will not really affect on performance even after its unrolled.

And reason why its just a normal ALU, which has only one port.
If you want to unroll it. You have to make it to use SIMD array.(use difference 2 vector registers) and make it unrolled twice so will make significal changes on performance.

@ArtBlnd well, anyway I don't think it makes sense to unroll it by hands and make code uglier here - it's invoked only after Length >= 5

Yes, It really doesn't matter its unrolled or not. so its good code :>

src/System.Collections/src/System/Collections/BitArray.cs

EgorBo · 2018-12-01T18:12:59Z

According to my benchmarks it looks like unroll-switch is the fastest solution for Length <= 8
I know benchmarks are not the real world scenarios so let me know if there is a better solution.
I guess it doesn't make any sense to unroll the loop at the end of the methods since it's executed only after Length > 8 where performance is already good

src/System.Collections/src/System/Collections/BitArray.cs

ahsonkhan · 2018-12-06T10:57:27Z

@dotnet-bot test this please

EgorBo · 2018-12-14T03:26:06Z

@dotnet-bot test Windows x64 Debug Build
@dotnet-bot test OSX x64 Debug Build
@dotnet-bot test Packaging All Configurations x64 Debug Build

src/System.Collections/src/System/Collections/BitArray.cs

stephentoub

Thanks, @EgorBo!

* Vectorize BitArray * add braces after 'fixed' * expand 'var' * fix typo (don't worry, the benchmarks were made without it) * unroll-switch * simplify unroll-switch * bring braces after 'fixed' back * use Vector128<int>.Count instead of `4` * CamelCase for labels

* Vectorize BitArray * add braces after 'fixed' * expand 'var' * fix typo (don't worry, the benchmarks were made without it) * unroll-switch * simplify unroll-switch * bring braces after 'fixed' back * use Vector128<int>.Count instead of `4` * CamelCase for labels Commit migrated from dotnet/corefx@372b575

Vectorize BitArray

b5444c7

tannergooding reviewed Dec 1, 2018

View reviewed changes

src/System.Collections/src/System/Collections/BitArray.cs Show resolved Hide resolved

tannergooding reviewed Dec 1, 2018

View reviewed changes

add braces after 'fixed'

992f353

ahsonkhan reviewed Dec 1, 2018

View reviewed changes

src/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

ahsonkhan approved these changes Dec 1, 2018

View reviewed changes

ahsonkhan requested a review from stephentoub December 1, 2018 01:48

tannergooding reviewed Dec 1, 2018

View reviewed changes

expand 'var'

315e92f

ahsonkhan added the area-System.Collections label Dec 1, 2018

tannergooding reviewed Dec 1, 2018

View reviewed changes

fix typo (don't worry, the benchmarks were made without it)

5497a27

jkotas reviewed Dec 1, 2018

View reviewed changes

src/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

unroll-switch

e5b07f8

EgorBo added 2 commits December 1, 2018 22:06

simplify unroll-switch

3538a49

bring braces after 'fixed' back

7cae14e

tannergooding reviewed Dec 1, 2018

View reviewed changes

src/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

EgorBo added 2 commits December 2, 2018 14:41

use Vector128<int>.Count instead of 4

a1702bb

Merge remote-tracking branch 'dotnet/master' into intrinsify-bitarray

40e16f9

Merge remote-tracking branch 'dotnet/master' into intrinsify-bitarray

996ddb8

jkotas reviewed Dec 14, 2018

View reviewed changes

src/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

CamelCase for labels

57d43a3

jkotas approved these changes Dec 15, 2018

View reviewed changes

stephentoub approved these changes Dec 15, 2018

View reviewed changes

stephentoub merged commit 372b575 into dotnet:master Dec 15, 2018

karelz added this to the 3.0 milestone Dec 21, 2018

GrabYourPitchforks mentioned this pull request Jul 4, 2019

Speed up BitArray #39173

Closed

GrabYourPitchforks mentioned this pull request Feb 1, 2020

BitArray race conditions may lead to memory corruption dotnet/runtime#30141

Closed

Conversation

EgorBo commented Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahsonkhan commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Dec 1, 2018

Uh oh!

ahsonkhan commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fiigii Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EgorBo Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArtBlnd Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArtBlnd Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArtBlnd Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EgorBo commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ahsonkhan commented Dec 6, 2018

Uh oh!

EgorBo commented Dec 14, 2018

Uh oh!

Uh oh!

stephentoub left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

EgorBo commented Nov 30, 2018 •

edited

Loading

ahsonkhan commented Dec 1, 2018 •

edited

Loading

ahsonkhan commented Dec 1, 2018 •

edited

Loading

fiigii Dec 1, 2018 •

edited

Loading

EgorBo Dec 1, 2018 •

edited

Loading

ArtBlnd Dec 14, 2018 •

edited

Loading

ArtBlnd Dec 14, 2018 •

edited

Loading

ArtBlnd Dec 14, 2018 •

edited

Loading

EgorBo commented Dec 1, 2018 •

edited

Loading