Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Improve get/set performance in HashSet<T>#40106

Merged
danmoseley merged 11 commits intodotnet:masterfrom
JeffreyZhao:issue-36448
Nov 6, 2019
Merged

Improve get/set performance in HashSet<T>#40106
danmoseley merged 11 commits intodotnet:masterfrom
JeffreyZhao:issue-36448

Conversation

@JeffreyZhao
Copy link
Contributor

Similar optimization in Dictionary<,> has been ported to HashSet<>.

_comparer has been made nullable, and default by null. This
makes us possible to check if it's null for faster branches, instead
of comparing it to EqualityComparer<T>.Default. Since _comparer
is null by default, the constructors and the OnDeserialization
method have been modified accordingly to save null instead of the
EqualityComparer<T>.Default if it's been passed/saved. Also, the
places that use _comparer directly much do a null check and switch
to EqualityComparer<T>.Default instead.

The major optimization is under Contains and AddIfNotPresent
methods. Basically the code was:

// Code uses _comparer

Now it becomes:

var comparer = _comparer;
if (comparer == null)
{
  if (default(T) != null)
  {
    // Code uses EqualityComparer<T>.Default for de-virtualizing.
  }
  else
  {
    var defaultComparer = EqualityComparer<T>.Default;
    // Code uses cached defaultComparer
  }
}
else
{
  // Code uses comparer
}

I've intentionally kept the origin implementation and comments as
much as possible, and the optimized code is not exactly the same as
in Dictionary<T>. Further micro-optimizations might be possible
but I think it's cleaner to keep it that way in the first step.

The Remove method hasn't been optimized since it's complicated,
and not been done in Dictionary either. Some other methods like
AddOrGetLocation or InternalIndexOf are not optimized since
they're only used in less-common scenarios, and not in the scope
of the current issue.

The benchmark shows there're obvious improvements in get and set
operations for all cases, especialy if the item is a value type.
I was actually expecting to see some pull back for "slower" case,
like if we use a custom comparer, but it turns out it's getting
faster too, since we have cached the comparer instead of using
_comparer directly in the loop.

The benchmark also shows the optimized HashSet has close or
better performance comparing to Dictionary in almost all cases,
except for HashSet<string> with default comparer. The default
comparer in Dictionary<string> is a "non-random" implementation
that performs better. Unfortunately the comparer is internal to
the CoreLib.

The optimized HashSet<string> is still faster since it's using
cached comparer of type EqualityComparer<string>, which uses
virtual method dispatching instead of interface dispatching. It's
just not as faster as Dictionary<string, T> yet.

Fix #36448

Similar optimization in Dictionary<,> has been ported to HashSet<>.

`_comparer` has been made nullable, and default by `null`. This
makes us possible to check if it's null for faster branches, instead
of comparing it to `EqualityComparer<T>.Default`. Since `_comparer`
is `null` by default, the constructors and the `OnDeserialization`
method have been modified accordingly to save `null` instead of the
`EqualityComparer<T>.Default` if it's been passed/saved. Also, the
places that use `_comparer` directly much do a null check and switch
to `EqualityComparer<T>.Default` instead.

The major optimization is under `Contains` and `AddIfNotPresent`
methods. Basically the code was:

``
// Code uses _comparer
``

Now it becomes:

```
var comparer = _comparer;
if (comparer == null)
{
  if (default(T) != null)
  {
    // Code uses EqualityComparer<T>.Default for de-virtualizing.
  }
  else
  {
    var defaultComparer = EqualityComparer<T>.Default;
    // Code uses cached defaultComparer
  }
}
else
{
  // Code uses comparer
}
```

I've intentionally kept the origin implementation and comments as
much as possible, and the optimized code is not exactly the same as
in `Dictionary<T>`. Further micro-optimizations might be possible
but I think it's cleaner to keep it that way in the first step.

The `Remove` method hasn't been optimized since it's complicated,
and not been done in `Dictionary` either. Some other methods like
`AddOrGetLocation` or `InternalIndexOf` are not optimized since
they're only used in less-common scenarios, and not in the scope
of the current issue.

The benchmark shows there're obvious improvements in get and set
operations for all cases, especialy if the item is a value type.
I was actually expecting to see some pull back for "slower" case,
like if we use a custom comparer, but it turns out it's getting
faster too, since we have cached the comparer instead of using
`_comparer` directly in the loop.

The benchmark also shows the optimized `HashSet` has close or
better performance comparing to `Dictionary` in almost all cases,
except for `HashSet<string>` with default comparer. The default
comparer in `Dictionary<string>` is a "non-random" implementation
that performs better. Unfortunately the comparer is internal to
the CoreLib.

The optimized `HashSet<string>` is still faster since it's using
cached comparer of type `EqualityComparer<string>`, which uses
virtual method dispatching instead of interface dispatching. It's
just not as faster as `Dictionary<string, T>` yet.

Fix #36448
@JeffreyZhao
Copy link
Contributor Author

The benchmark code can be found in https://github.com/JeffreyZhao/HashSetPerf, in which the FastHashSet is the optimized version. The benchmark compares performance among Dictionary<TKey, TValue>, current HashSet<T> and optimized FastHashSet<T>, and use HashSet<T> as the baseline.

Result:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.615 (1809/October2018Update/Redstone5)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview7-012821
  [Host]     : .NET Core 3.0.0-preview7-27912-14 (CoreCLR 4.700.19.32702, CoreFX 4.700.19.36209), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-preview7-27912-14 (CoreCLR 4.700.19.32702, CoreFX 4.700.19.36209), 64bit RyuJIT
Method Mean Error StdDev Median Ratio RatioSD
Get_StringDict 21.053 ns 0.2522 ns 0.2359 ns 20.999 ns 0.84 0.01
Get_StringSet 25.051 ns 0.2697 ns 0.2523 ns 24.971 ns 1.00 0.00
Get_StringFastSet 23.986 ns 0.1639 ns 0.1533 ns 23.997 ns 0.96 0.01
Get_IntDict 6.596 ns 0.0388 ns 0.0363 ns 6.597 ns 0.63 0.01
Get_IntSet 10.430 ns 0.2240 ns 0.2200 ns 10.377 ns 1.00 0.00
Get_IntFastSet 6.632 ns 0.0335 ns 0.0297 ns 6.638 ns 0.64 0.01
Get_ObjDict 16.235 ns 0.0976 ns 0.0865 ns 16.201 ns 0.82 0.02
Get_ObjSet 19.278 ns 0.3845 ns 0.7222 ns 19.200 ns 1.00 0.00
Get_ObjFastSet 17.266 ns 0.3439 ns 0.7178 ns 16.835 ns 0.90 0.05
Get_CustomDict 11.809 ns 0.5410 ns 0.5556 ns 11.587 ns 0.87 0.04
Get_CustomSet 13.536 ns 0.1752 ns 0.1638 ns 13.540 ns 1.00 0.00
Get_CustomFastSet 11.547 ns 0.1316 ns 0.1231 ns 11.573 ns 0.85 0.01
Set_StringDict 24.956 ns 0.4348 ns 0.4067 ns 24.889 ns 0.97 0.02
Set_StringSet 25.757 ns 0.2293 ns 0.2145 ns 25.667 ns 1.00 0.00
Set_StringFastSet 25.145 ns 0.3634 ns 0.3399 ns 25.057 ns 0.98 0.01
Set_IntDict 8.010 ns 0.0707 ns 0.0661 ns 7.988 ns 0.76 0.01
Set_IntSet 10.552 ns 0.0894 ns 0.0792 ns 10.537 ns 1.00 0.00
Set_IntFastSet 8.005 ns 0.1733 ns 0.5109 ns 8.203 ns 0.66 0.03
Set_ObjDict 19.934 ns 0.1394 ns 0.1236 ns 19.939 ns 1.08 0.02
Set_ObjSet 18.538 ns 0.2455 ns 0.2297 ns 18.520 ns 1.00 0.00
Set_ObjFastSet 16.306 ns 0.0373 ns 0.0291 ns 16.315 ns 0.88 0.01
Set_CustomDict 14.611 ns 0.0883 ns 0.0783 ns 14.610 ns 1.09 0.02
Set_CustomSet 13.434 ns 0.2412 ns 0.2138 ns 13.353 ns 1.00 0.00
Set_CustomFastSet 11.443 ns 0.1962 ns 0.1835 ns 11.479 ns 0.85 0.01

@JeffreyZhao
Copy link
Contributor Author

The check failed at "Send to Helix". Is it a glitch on check server? Can we re-run the checks?

@danmoseley
Copy link
Member

You should be able to see the test issue by clicking through the log. It's apparently unrelated Http failure. I'll open issue.

@danmoseley
Copy link
Member

@danmoseley
Copy link
Member

Thanks for hte contribution. Do we have sufficient micro benchmark coverage in https://github.com/dotnet/performance ? If not would you consider improving coverage there? This is how we typically measure and prevent regression.

cc @adamsitnik

@danmoseley
Copy link
Member

It would not be part of this change, but another pending backport to HashSet is dotnet/coreclr#23591.

@eiriktsarpalis eiriktsarpalis requested a review from safern August 8, 2019 13:57
@eiriktsarpalis eiriktsarpalis added this to the 5.0 milestone Aug 8, 2019
@JeffreyZhao
Copy link
Contributor Author

JeffreyZhao commented Aug 8, 2019

@danmosemsft: You should be able to see the test issue by clicking through the log. It's apparently unrelated Http failure. I'll open issue.

Thanks, I've noticed that now. BTW why it's not part of previous step (compile and test) but "Send to Helix", and what's that step?

Thanks for hte contribution. Do we have sufficient micro benchmark coverage in https://github.com/dotnet/performance ? If not would you consider improving coverage there? This is how we typically measure and prevent regression.

It seems we don't have any coverage for HashSet<> and there's only one case for Dictionary<int, int>.ContainsValue, which is a bit strange to me since it's one of the least used methods. Yes, I can add some for HashSet<> and probably Dictionary<,> too.

It would not be part of this change, but another pending backport to HashSet is dotnet/coreclr#23591.

Yes, I've noticed that too. That's what I mean by this:

I've intentionally kept the origin implementation and comments as
much as possible, and the optimized code is not exactly the same as
in Dictionary. Further micro-optimizations might be possible
but I think it's cleaner to keep it that way in the first step.

There're actually more optimizations can be put in HashSet<> like the one you mentioned, and also including the changes to other methods. My thought is to do it gradually so it's easier to review, test and merge. Those optimizations could be made later. Now I think it could be made after having more performance coverage.

private int _lastIndex;
private int _freeList;
private IEqualityComparer<T> _comparer = default!;
private IEqualityComparer<T>? _comparer = default!;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need the !? I don't think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Actually the whole default! isn't required. Do you think I should fix it now, or commit with other changes (if any) when the review is complete? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in a separate PR 😄 -- feel free to start preparing it.

@safern
Copy link
Member

safern commented Aug 8, 2019

the constructors and the OnDeserialization
method have been modified accordingly to save null instead of the
EqualityComparer.Default if it's been passed/saved.

@ViktorHofer would this break serialization scenarios?

@ViktorHofer
Copy link
Member

@ViktorHofer would this break serialization scenarios?

No it doesn't break serialization as during serialization the default equality comparer is used if the comparer is null. The description is a bit misleading but the change is fine.

After deserialization the `_comparer` will be set to a different
instance of default comparer, so we need to use `Equals` to compare
`_comparer` with `EqualityComparer<T>.Default` instead of reference
comparison by `==`.
Previously we tried to reset `_comparer` in `OnDeserialization` if
it's `EqualityComparer<T>.Default`. But after discussion in #40150
we decide to follow the same approach as in `Dictionary<,>` to make
it simple.
Slot[] slots = _slots;
// see note at "HashSet" level describing why "- 1" appears in for loop
for (int i = _buckets[hashCode % _buckets.Length] - 1; i >= 0; i = slots[i].next)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove extra line

if (slots[i].hashCode == hashCode && _comparer.Equals(slots[i].value, item))
int hashCode = item == null ? 0 : InternalGetHashCode(item.GetHashCode());

if (default(T)! != null) // TODO-NULLABLE: default(T) == null warning (https://github.com/dotnet/roslyn/issues/34757)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this trailing comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@safern, I don't have a strong opinion, but why do you want to remove the comment? We still have this in a bunch of places in coreclr... do you want to remove it from all of those (separate from this PR)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a new line added in this PR, I was suggesting to not include a comment based on past discussion here: #40651 (comment) -- but I don't have a strong opinion on this either.

If we'd like to remove all the comments and then address the removal of ! later I can happily submit a PR.

if (_buckets != null)
{
int hashCode = InternalGetHashCode(item);
IEqualityComparer<T> comparer = _comparer ?? EqualityComparer<T>.Default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these null checks introduced, impact perf?

Copy link
Contributor Author

@JeffreyZhao JeffreyZhao Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize that the code here may not be optimal.

In Dictionary<T> it uses code like below to get hash code and compare two keys to leverage EqualityComparer<TKey>.Default intrinsic some other methods (e.g. Remove):

// get hash code
comparer?.GetHashCode(key) ?? key.GetHashCode()
// compare keys
comparer?.Equals(entry.key, key) ?? EqualityComparer<TKey>.Default.Equals(entry.key, key)

It seems like having saved one more indirection by checking if the comparer is null, but it also requires null check for each call. I'll do a quick benchmark to confirm this micro-optimization and submit another commit.

Copy link
Contributor Author

@JeffreyZhao JeffreyZhao Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark calls GetHashCode once and Equals twice to simulate a lookup in methods like Remove.

[GenericTypeArguments(typeof(int))]
[GenericTypeArguments(typeof(object))]
public class NullComparerPerf<T>
{
    private IEqualityComparer<T> _comparer;
    public T Value1;
    public T Value2;

    public T Key = Activator.CreateInstance<T>();

    [Benchmark]
    public int AssignDefault()
    {
        var comparer = _comparer ?? EqualityComparer<T>.Default;
        var hashCode = comparer.GetHashCode(Key);
        var b1 = comparer.Equals(Value1, Key);
        var b2 = comparer.Equals(Value2, Key);
        return hashCode;
    }

    [Benchmark]
    public int NullCheck()
    {
        var comparer = _comparer;
        var hashCode = comparer?.GetHashCode(Key) ?? Key.GetHashCode();
        var b1 = comparer?.Equals(Value1, Key) ?? EqualityComparer<T>.Default.Equals(Value1, Key);
        var b2 = comparer?.Equals(Value2, Key) ?? EqualityComparer<T>.Default.Equals(Value2, Key);
        return hashCode;
    }
}

Benchmark shows for int it takes 75% less time, but for object it takes 25% more:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.805 (1809/October2018Update/Redstone5)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
Type Method Mean Error StdDev
NullComparerPerf<Int32> AssignDefault 6.536 ns 0.0768 ns 0.0680 ns
NullComparerPerf<Int32> NullCheck 1.450 ns 0.0439 ns 0.0411 ns
NullComparerPerf<Object> AssignDefault 27.053 ns 0.2639 ns 0.2469 ns
NullComparerPerf<Object> NullCheck 33.090 ns 0.5051 ns 0.4477 ns

In theory we can do the same thing for Remove as in lookup but that's what Dictionary<,> currently doing so I just keep it the same.

Copy link
Member

@safern safern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. Are you including benchmark tests into https://github.com/dotnet/performance/tree/master/src/benchmarks?

I'd like to get @stephentoub input on this as well.

@safern
Copy link
Member

safern commented Oct 15, 2019

@JeffreyZhao any update on this?

@JeffreyZhao
Copy link
Contributor Author

Overall, looks good. Are you including benchmark tests into https://github.com/dotnet/performance/tree/master/src/benchmarks?
I'd like to get @stephentoub input on this as well.

@safern I'll move my benchmark to performance project when I finish learning how to do it properly.

@danmoseley
Copy link
Member

@JeffreyZhao thank you for continuing to drive this along, it is likely to be a significant customer benefit.

@danmoseley
Copy link
Member

danmoseley commented Oct 17, 2019

@adamsitnik for that question.

@JeffreyZhao
Copy link
Contributor Author

JeffreyZhao commented Oct 18, 2019

@danmosemsft @safern Find some more information regarding [IterationSetup] here: dotnet/BenchmarkDotNet#1003 (comment)

When you use [IterationSetup] BDN executes the benchmark once per iteration. If this a very short operation, the result is unstable. This is why we produce this warning.

Anyway, I've added some more benchmarks in my fork of dotnet/performance: https://github.com/JeffreyZhao/performance/tree/Collection-Remove-Benchmarks

I've added two new set of benchmarks (RemoveFalse and CopyAndRemove), mainly for HashSet<> and Dictionary<>, and I've pasted the result below.

I've also found the method call of InternalGetHashCode hasn't been inlined in Remove (thanks SharpLab!) so I've submitted another commit to put AggressiveInlining on InternalGetHashCode and InternalEquals methods. In Dictionary<T> it's manually inlined but I've extracted a helper method for less code duplications.

I've made three builds of CoreFx for benchmark:

  1. netcoreapp-original: Original implementation without this pull request.
  2. netcoreapp-hashset: HashSet optimization without InternalGetHashCode inlining.
  3. netcoreapp-hashset-inline: HashSet optimization with InternalGetHashCode inlining.
dotnet run -c Release -f netcoreapp3.0 --filter "*Remove*.HashSet" --join --coreRun \
    C:\Projects\dotnet\netcoreapp-original\coreRun.exe \
    C:\Projects\dotnet\netcoreapp-hashset\coreRun.exe \
    C:\Projects\dotnet\netcoreapp-hashset-inline\coreRun.exe
Expand to see benchmark results.
BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.17763.805 (1809/October2018Update/Redstone5)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
  Job-QFCKJK : .NET Core ? (CoreCLR 5.0.19.38002, CoreFX 5.0.19.51701), X64 RyuJIT
  Job-TMFJJO : .NET Core ? (CoreCLR 5.0.19.38002, CoreFX 5.0.19.51001), X64 RyuJIT
  Job-FUBAAA : .NET Core ? (CoreCLR 5.0.19.38002, CoreFX 5.0.19.51701), X64 RyuJIT
Type Method Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
CopyAndRemove<Int32> HashSet \netcoreapp-hashset-inline\coreRun.exe 512 8.050 us 0.1281 us 0.1198 us 8.004 us 7.881 us 8.288 us 0.78 0.01 4.5778 - - 8456 B
CopyAndRemove<Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 9.102 us 0.0761 us 0.0674 us 9.102 us 8.999 us 9.198 us 0.88 0.01 4.5872 - - 8456 B
CopyAndRemove<Int32> HashSet \netcoreapp-original\coreRun.exe 512 10.296 us 0.1012 us 0.0897 us 10.303 us 10.154 us 10.449 us 1.00 0.00 4.5569 - - 8456 B
CopyAndRemove<String> HashSet \netcoreapp-hashset-inline\coreRun.exe 512 32.356 us 0.3855 us 0.3606 us 32.289 us 31.874 us 33.163 us 1.12 0.01 3.4022 - - 10536 B
CopyAndRemove<String> HashSet \netcoreapp-hashset\coreRun.exe 512 31.714 us 0.2834 us 0.2512 us 31.795 us 31.297 us 32.074 us 1.10 0.01 3.3886 - - 10536 B
CopyAndRemove<String> HashSet \netcoreapp-original\coreRun.exe 512 28.788 us 0.3398 us 0.3178 us 28.656 us 28.427 us 29.370 us 1.00 0.00 3.4153 - - 10536 B
RemoveFalse<Int32> HashSet \netcoreapp-hashset-inline\coreRun.exe 512 3.936 us 0.0349 us 0.0326 us 3.928 us 3.890 us 4.005 us 0.61 0.01 - - - -
RemoveFalse<Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 5.583 us 0.1057 us 0.0989 us 5.559 us 5.434 us 5.748 us 0.87 0.02 - - - -
RemoveFalse<Int32> HashSet \netcoreapp-original\coreRun.exe 512 6.409 us 0.0622 us 0.0520 us 6.402 us 6.314 us 6.508 us 1.00 0.00 - - - -
RemoveFalse<String> HashSet \netcoreapp-hashset-inline\coreRun.exe 512 18.432 us 0.0971 us 0.0861 us 18.421 us 18.277 us 18.547 us 0.88 0.01 - - - -
RemoveFalse<String> HashSet \netcoreapp-hashset\coreRun.exe 512 18.432 us 0.1500 us 0.1403 us 18.413 us 18.168 us 18.702 us 0.88 0.01 - - - -
RemoveFalse<String> HashSet \netcoreapp-original\coreRun.exe 512 20.980 us 0.1797 us 0.1501 us 20.969 us 20.669 us 21.303 us 1.00 0.00 - - - -
CreateAddAndRemove<Int32> HashSet \netcoreapp-hashset-inline\coreRun.exe 512 18.468 us 0.3007 us 0.2813 us 18.482 us 18.024 us 19.119 us 0.76 0.02 8.8038 - - 27712 B
CreateAddAndRemove<Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 20.677 us 0.2535 us 0.2371 us 20.721 us 20.226 us 20.996 us 0.85 0.01 8.8120 - - 27712 B
CreateAddAndRemove<Int32> HashSet \netcoreapp-original\coreRun.exe 512 24.407 us 0.3310 us 0.2934 us 24.458 us 23.820 us 24.916 us 1.00 0.00 8.7461 - - 27712 B
CreateAddAndRemove<String> HashSet \netcoreapp-hashset-inline\coreRun.exe 512 60.367 us 1.1194 us 1.0471 us 59.980 us 59.257 us 62.924 us 0.96 0.02 10.8902 - - 34481 B
CreateAddAndRemove<String> HashSet \netcoreapp-hashset\coreRun.exe 512 59.649 us 0.4124 us 0.3444 us 59.722 us 59.007 us 60.182 us 0.95 0.01 10.8173 - - 34480 B
CreateAddAndRemove<String> HashSet \netcoreapp-original\coreRun.exe 512 62.911 us 0.7728 us 0.7228 us 62.884 us 61.911 us 64.280 us 1.00 0.00 10.7932 - - 34480 B

@danmoseley
Copy link
Member

danmoseley commented Oct 18, 2019

Cc @adamsitnik

@safern
Copy link
Member

safern commented Oct 22, 2019

It seems like CopyAndRemove<String> regressed for both cases ~15-20% which I guess wouldn't be acceptable to take that regression as string is a very common case to use for a HashSet... do you know why is that regression?

@JeffreyZhao
Copy link
Contributor Author

JeffreyZhao commented Oct 24, 2019

@safern do you know why is that regression?

Yes, it's explained by the benchmark I posted above.

Basically the original code is like:

private readonly IEqualityComparer<T> _comparer = EqualityComparer<T>.Default; // not null

int hashCode = _comparer.GetHashCode(key); // GetHashCode
bool equals = _comparer.Equals(key, item); // Equals

And after the "improvement" (with regression for string) it's like:

private readonly IEqualityComparer<T>? _comparer; // nullable

var comparer = _comparer;
int hashCode = comparer?.GetHashCode(key) ?? key.GetHashCode(); // GetHashCode
bool equals = comparer?.Equals(key, item) ?? EqualityComparer<T>.Default.Equals(key, item); // Equals

For value types like int there's a saving about 75%, but for reference types like string there're simply more checks, which makes it 25% slower than original.

This code pattern is actually used by current Dictionary<,> implementation for methods like RemoveKey so I was assuming the trade-off between value types and reference types is acceptable. It also means Dictionary<,> also has the same problem and I would like to create a separated pull request for that.

There's no "quick" way to avoid the trade-off by minimum code changes, and initially this pull request is for get/set operations so the Remove method is out of scope (also Dictionary<,> hasn't done that for RemoveKey either). But I agree it's not good to introduce a regression here so I decided to apply the same optimization for Remove method, including the InternalIndexOf method used by TryGetValue method and others.

More benchmarks

Since I think the dotnet/performance project doesn't have enough benchmarks for Remove method (and none for TryGetValue method) of HashSet, I decided to fork and project and added several more.

https://github.com/JeffreyZhao/performance/tree/Improve-Collection-Benchmarks

I've added RemoveFalse and CopyAndRemove benchmarks for Remove method, and also create CustomObject (a reference type) and CustomValue (a value type) besides the existing int and string.

There're two reasons for adding those two types:

  1. I'd like to confirm the optimization also applies to custom types.
  2. I think string is not a good type for benchmark, as
    1. It's GetHashCode and Equals implementation has larger overhead than regular reference types, so we're not only benchmarking the usage of comparer.
    2. Type likes Dictionary<,> has special treatment for string, so string has a different routine than regular reference types.

Benchmark result for Remove method

In the following benchmarks netcoreapp-original is the build before the change and netcoreapp-hashset is the one with the HashSet optimizations.

dotnet run -c Release -f netcoreapp3.0 --filter *Remove*.HashSet --join --coreRun \
    C:\Projects\dotnet\netcoreapp-original\coreRun.exe \
    C:\Projects\dotnet\netcoreapp-hashset\coreRun.exe
Expand to see the benchmark result for Remove method.
BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.17763.805 (1809/October2018Update/Redstone5)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
  Job-CHLVSZ : .NET Core ? (CoreCLR 5.0.19.38002, CoreFX 5.0.19.52301), X64 RyuJIT
  Job-EVQEJS : .NET Core ? (CoreCLR 5.0.19.38002, CoreFX 5.0.19.51701), X64 RyuJIT
Type Method Toolchain Size Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated
CopyAndRemove<CustomObject> HashSet \netcoreapp-hashset\coreRun.exe 512 14.798 us 0.1527 us 0.1428 us 14.774 us 14.552 us 15.149 us 0.98 3.4624 - - 10536 B
CopyAndRemove<CustomObject> HashSet \netcoreapp-original\coreRun.exe 512 15.093 us 0.1118 us 0.1045 us 15.099 us 14.883 us 15.267 us 1.00 3.4524 - - 10536 B
CopyAndRemove<CustomValue> HashSet \netcoreapp-hashset\coreRun.exe 512 6.422 us 0.0536 us 0.0501 us 6.426 us 6.331 us 6.495 us 0.67 4.5900 - - 8456 B
CopyAndRemove<CustomValue> HashSet \netcoreapp-original\coreRun.exe 512 9.660 us 0.0951 us 0.0843 us 9.676 us 9.537 us 9.813 us 1.00 4.5685 - - 8456 B
CopyAndRemove<Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 6.655 us 0.0662 us 0.0619 us 6.652 us 6.561 us 6.770 us 0.70 4.5833 - - 8456 B
CopyAndRemove<Int32> HashSet \netcoreapp-original\coreRun.exe 512 9.470 us 0.1155 us 0.1080 us 9.474 us 9.321 us 9.702 us 1.00 4.5945 - - 8456 B
CopyAndRemove<String> HashSet \netcoreapp-hashset\coreRun.exe 512 28.187 us 0.2034 us 0.1903 us 28.169 us 27.834 us 28.509 us 1.04 3.4091 - - 10536 B
CopyAndRemove<String> HashSet \netcoreapp-original\coreRun.exe 512 27.064 us 0.2087 us 0.1952 us 27.031 us 26.824 us 27.377 us 1.00 3.4542 - - 10536 B
CreateAddAndRemove<CustomObject> HashSet \netcoreapp-hashset\coreRun.exe 512 32.130 us 0.3202 us 0.2839 us 32.176 us 31.388 us 32.539 us 0.95 10.9148 - - 34480 B
CreateAddAndRemove<CustomObject> HashSet \netcoreapp-original\coreRun.exe 512 33.922 us 0.3267 us 0.3056 us 34.029 us 33.462 us 34.349 us 1.00 10.9816 - - 34480 B
CreateAddAndRemove<CustomValue> HashSet \netcoreapp-hashset\coreRun.exe 512 16.088 us 0.1815 us 0.1515 us 16.134 us 15.771 us 16.313 us 0.70 8.7617 - - 27712 B
CreateAddAndRemove<CustomValue> HashSet \netcoreapp-original\coreRun.exe 512 23.018 us 0.2179 us 0.2038 us 23.010 us 22.690 us 23.398 us 1.00 8.7719 - - 27712 B
CreateAddAndRemove<Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 16.182 us 0.1129 us 0.1056 us 16.178 us 15.981 us 16.388 us 0.71 8.7810 - - 27712 B
CreateAddAndRemove<Int32> HashSet \netcoreapp-original\coreRun.exe 512 22.651 us 0.2656 us 0.2485 us 22.562 us 22.274 us 23.200 us 1.00 8.7356 - - 27712 B
CreateAddAndRemove<String> HashSet \netcoreapp-hashset\coreRun.exe 512 56.573 us 0.5230 us 0.4892 us 56.661 us 55.265 us 57.185 us 1.01 10.8303 - - 34480 B
CreateAddAndRemove<String> HashSet \netcoreapp-original\coreRun.exe 512 55.952 us 0.6334 us 0.5925 us 56.232 us 54.980 us 56.727 us 1.00 10.8599 - - 34480 B
RemoveFalse<CustomObject> HashSet \netcoreapp-hashset\coreRun.exe 512 8.866 us 0.0474 us 0.0444 us 8.867 us 8.791 us 8.938 us 0.87 - - - -
RemoveFalse<CustomObject> HashSet \netcoreapp-original\coreRun.exe 512 10.191 us 0.1157 us 0.1082 us 10.148 us 10.058 us 10.385 us 1.00 - - - -
RemoveFalse<CustomValue> HashSet \netcoreapp-hashset\coreRun.exe 512 3.196 us 0.0306 us 0.0271 us 3.201 us 3.151 us 3.233 us 0.33 - - - -
RemoveFalse<CustomValue> HashSet \netcoreapp-original\coreRun.exe 512 9.779 us 0.1318 us 0.1233 us 9.767 us 9.633 us 10.031 us 1.00 - - - -
RemoveFalse<Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 3.235 us 0.0436 us 0.0408 us 3.243 us 3.165 us 3.294 us 0.47 - - - -
RemoveFalse<Int32> HashSet \netcoreapp-original\coreRun.exe 512 6.937 us 0.0672 us 0.0628 us 6.955 us 6.857 us 7.036 us 1.00 - - - -
RemoveFalse<String> HashSet \netcoreapp-hashset\coreRun.exe 512 20.515 us 0.2362 us 0.2209 us 20.584 us 20.143 us 20.829 us 1.00 0.00 - - -
RemoveFalse<String> HashSet \netcoreapp-original\coreRun.exe 512 20.817 us 0.1720 us 0.1609 us 20.833 us 20.534 us 21.116 us 1.01 0.01 - - -

The benchmark result shows all cases become faster with the optimization, except CopyAndRemove<String>, which is a little bit slower. I haven't had an answer for that, especially CopyAndRemove<CustomObject> is always faster than original. But even for CopyAndRemove<String> it's actually quite close, and I've seen it with smaller gap, or even a little bit faster in some runs. I'll keep looking into this.

Benchmark result for TryGetValue method

There's no benchmarks for TryGetValue for HashSet, although there're some for other collections like Dictionary<,>. I've added benchmarks into TryGetValueTrue and TryGetValueFalse for HashSet.

dotnet run -c Release -f netcoreapp3.0 --filter *TryGetValue*.HashSet --join --coreRun \
    C:\Projects\dotnet\netcoreapp-original\coreRun.exe \
    C:\Projects\dotnet\netcoreapp-hashset\coreRun.exe
Expand to see benchmark result for TryGetValue method.
BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.17763.805 (1809/October2018Update/Redstone5)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
  Job-YSRJTI : .NET Core ? (CoreCLR 5.0.19.38002, CoreFX 5.0.19.52301), X64 RyuJIT
  Job-JTOUAA : .NET Core ? (CoreCLR 5.0.19.38002, CoreFX 5.0.19.51701), X64 RyuJIT
Type Method Toolchain Size Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated
TryGetValueFalse<CustomObject, CustomObject> HashSet \netcoreapp-hashset\coreRun.exe 512 11.983 us 0.0940 us 0.0833 us 11.971 us 11.880 us 12.115 us 0.90 - - - -
TryGetValueFalse<CustomObject, CustomObject> HashSet \netcoreapp-original\coreRun.exe 512 13.324 us 0.1154 us 0.1079 us 13.323 us 13.179 us 13.509 us 1.00 - - - -
TryGetValueFalse<CustomValue, CustomValue> HashSet \netcoreapp-hashset\coreRun.exe 512 6.120 us 0.1153 us 0.1078 us 6.113 us 5.945 us 6.278 us 0.63 - - - -
TryGetValueFalse<CustomValue, CustomValue> HashSet \netcoreapp-original\coreRun.exe 512 9.688 us 0.1215 us 0.1136 us 9.683 us 9.525 us 9.940 us 1.00 - - - -
TryGetValueFalse<Int32, Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 6.190 us 0.0730 us 0.0648 us 6.199 us 6.025 us 6.276 us 0.78 - - - -
TryGetValueFalse<Int32, Int32> HashSet \netcoreapp-original\coreRun.exe 512 7.913 us 0.0741 us 0.0656 us 7.898 us 7.835 us 8.048 us 1.00 - - - -
TryGetValueFalse<String, String> HashSet \netcoreapp-hashset\coreRun.exe 512 22.196 us 0.1752 us 0.1638 us 22.118 us 22.011 us 22.505 us 0.86 - - - -
TryGetValueFalse<String, String> HashSet \netcoreapp-original\coreRun.exe 512 25.755 us 0.1675 us 0.1485 us 25.719 us 25.496 us 26.012 us 1.00 - - - -
TryGetValueTrue<CustomObject, CustomObject> HashSet \netcoreapp-hashset\coreRun.exe 512 15.858 us 0.1389 us 0.1299 us 15.830 us 15.727 us 16.191 us 0.98 - - - -
TryGetValueTrue<CustomObject, CustomObject> HashSet \netcoreapp-original\coreRun.exe 512 16.138 us 0.1624 us 0.1520 us 16.160 us 15.836 us 16.384 us 1.00 - - - -
TryGetValueTrue<CustomValue, CustomValue> HashSet \netcoreapp-hashset\coreRun.exe 512 7.272 us 0.0645 us 0.0603 us 7.268 us 7.198 us 7.415 us 0.73 - - - -
TryGetValueTrue<CustomValue, CustomValue> HashSet \netcoreapp-original\coreRun.exe 512 9.973 us 0.1102 us 0.1031 us 9.955 us 9.817 us 10.141 us 1.00 - - - -
TryGetValueTrue<Int32, Int32> HashSet \netcoreapp-hashset\coreRun.exe 512 7.061 us 0.0953 us 0.0891 us 7.029 us 6.936 us 7.236 us 0.70 - - - -
TryGetValueTrue<Int32, Int32> HashSet \netcoreapp-original\coreRun.exe 512 10.078 us 0.0540 us 0.0505 us 10.074 us 9.998 us 10.161 us 1.00 - - - -
TryGetValueTrue<String, String> HashSet \netcoreapp-hashset\coreRun.exe 512 24.457 us 0.2453 us 0.2295 us 24.478 us 24.053 us 24.884 us 0.86 - - - -
TryGetValueTrue<String, String> HashSet \netcoreapp-original\coreRun.exe 512 28.455 us 0.2050 us 0.1818 us 28.420 us 28.172 us 28.839 us 1.00 - - - -

The benchmark result shows that it's always faster with the optimization, including the cases for string.

Before end

Currently only the SymmetricExceptWith has the non-optimized implementation. Do I need to do the same thing? It doesn't have benchmark and I'm quite sure it's not used a lot.

It a little bit strange to me that why there're several duplications in the code. For example, Contains has it's own implementation, which is so close to InternalIndexOf that it can be implemented as return InternalIndexOf(item) >= 0;. Probably it's a about saving a call instruction, isn't it?

@danmoseley
Copy link
Member

Probably it's a about saving a call instruction, isn't it?

Yes, Dictionary is so perf critical that the perf justifies that code duplication.

cc @adamsitnik for this PR. Thanks for continuing to move it forward.

@JeffreyZhao
Copy link
Contributor Author

JeffreyZhao commented Oct 25, 2019

(Previous comment removed as I realized I was running with a different data set.)

Today's runs of CopyAndRemove<String>.HashSet has different results. Here're the results from three consecutive runs:

dotnet run -c Release -f netcoreapp3.0 --filter "*CopyAndRemove<String>.HashSet" --join --coreRun \
    C:\Projects\dotnet\netcoreapp-original\coreRun.exe \
    C:\Projects\dotnet\netcoreapp-hashset\coreRun.exe
Expand to see benchmark results.
Method Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
HashSet \netcoreapp-hashset\coreRun.exe 512 26.75 us 0.526 us 0.492 us 26.65 us 25.95 us 27.71 us 0.93 0.02 3.4781 - - 10.29 KB
HashSet \netcoreapp-original\coreRun.exe 512 28.63 us 0.357 us 0.334 us 28.65 us 28.11 us 29.37 us 1.00 0.00 3.4722 - - 10.29 KB
Method Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
HashSet \netcoreapp-hashset\coreRun.exe 512 27.02 us 0.398 us 0.372 us 26.99 us 26.58 us 27.74 us 0.96 0.02 3.4722 - - 10.29 KB
HashSet \netcoreapp-original\coreRun.exe 512 28.22 us 0.310 us 0.275 us 28.16 us 27.94 us 28.95 us 1.00 0.00 3.4537 - - 10.29 KB
Method Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
HashSet \netcoreapp-hashset\coreRun.exe 512 26.76 us 0.400 us 0.374 us 26.67 us 26.12 us 27.48 us 0.94 0.02 3.4606 - - 10.29 KB
HashSet \netcoreapp-original\coreRun.exe 512 28.55 us 0.223 us 0.209 us 28.58 us 28.21 us 29.00 us 1.00 0.00 3.4594 - - 10.29 KB

We can see that the optimized version is always faster.

@danmoseley
Copy link
Member

@jkotas any feedback?

return hashCode & Lower31BitMask;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need really need to be marked with aggressive inlining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not necessary for this one, but just keep it the same with other Internal* methods.

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private bool InternalEquals(T x, T y, IEqualityComparer<T>? comparer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrappers like this add overhead for larger structs; even with AggressiveInlining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fine-tuned Dictionary implementation avoids them for a good reason, so you may want to avoid them too - if you are shooting for HashSet to be fine-tuned in the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Actually it's only used in SymmetricExceptWith now. All the other usages (in Remove, TryGetValue etc.) have been replace by the optimization. I'll manually inline the method.

@maryamariyan maryamariyan requested a review from safern November 1, 2019 16:46
@danmoseley
Copy link
Member

Thanks @JeffreyZhao for your work and patience here. Let's try to get this merged before 11/13 so we don't have to move it to the new repo: dotnet/announcements#133

Copy link
Member

@safern safern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @JeffreyZhao -- @jkotas do you have any other feedback.

/// <returns>hash code</returns>
private int InternalGetHashCode(T item)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int InternalGetHashCode(T item, IEqualityComparer<T>? comparer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrappers like this add overhead for larger structs; even with AggressiveInlining.

(similar comment as the other one you have fixed already)

@jkotas
Copy link
Member

jkotas commented Nov 6, 2019

Dictionary keeps the hashCode as unsigned to get another bit of performance. Do the same for HashSet?

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private int InternalGetHashCode(int hashCode)
{
return hashCode & Lower31BitMask;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. this masking is not necessary once you switch to uint hashCode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas we mentioned that above, I believe @JeffreyZhao was thinking of looking at that next. #40106 (comment)

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me then.

@danmoseley danmoseley merged commit f9564cd into dotnet:master Nov 6, 2019
@danmoseley
Copy link
Member

Merged. Thank you @JeffreyZhao ! Normally PR's do not take so long, but we are specially careful with such core types that are performance sensitive. If you are still interested, it would be good to bring over dotnet/coreclr#23591

@JeffreyZhao
Copy link
Contributor Author

JeffreyZhao commented Nov 6, 2019

Thanks for everyone's help for reviewing!

@danmosemsft Yes, I will do the backport after the repository consolidation. There's actually another improvement on my list is to use ref to reduce boundary checks like in Dictionary, but it requires Unsafe.NullRef<T> as discussed in dotnet/corefx#41783. I'll do it when the new API is settled.

@danmoseley
Copy link
Member

Sounds good. Do you think we have all the right coverage in the perf repo now?

@JeffreyZhao
Copy link
Contributor Author

Sounds good. Do you think we have all the right coverage in the perf repo now?

I think we should add more cases for removal operations. I've added some two benchmarks RemoveFalse and CopyAndRemove, and two custom types CustomValue and CustomObject (which have been applied to other collection benchmarks) as String has additional noises to me for collection benchmarks, although I understand it's important to include those cases since they're fundamental.

I've made some changes into forked repository in Improve-Collection-Benchmarks branch. Do you think it's a good time to create new PR as the repo consolidation coming up? I don't find perf repo being mentioned in the announcement, so I assume it's not affected, is it?

@danmoseley
Copy link
Member

The performance repo is not part of the consolidation and will not be affected. Yes a PR there would be great.

@JeffreyZhao
Copy link
Contributor Author

@danmosemsft A PR (dotnet/performance#1010) has been created. Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hashset<T>.Contains twice slower than Dictionary<>.ContainsKey windows, 2.2.3

8 participants