Skip to content

Optimize RedBlackTree-based immutable collections#7284

Merged
retronym merged 7 commits intoscala:2.13.xfrom
szeiger:wip/red-black-trees-perf4
Nov 1, 2018
Merged

Optimize RedBlackTree-based immutable collections#7284
retronym merged 7 commits intoscala:2.13.xfrom
szeiger:wip/red-black-trees-perf4

Conversation

@szeiger
Copy link
Contributor

@szeiger szeiger commented Sep 28, 2018

  • Some improved inlining in RedBlackTree
  • Remove unnecessary contains check before deleting from TreeSet or TreeMap
  • Avoid recreating wrapper objects when appending to and deleting from TreeSets and TreeMaps without a change
  • Bulk building / appending without creating intermediate TreeSets or TreeMaps
  • Building TreeSets and TreeMaps directly without a Builder
  • Use null instead of () as dummy value in TreeSet

Here's some performance data from Java 8 with HotSpot on x64. Getting reliable results was not always easy. I ran the benchmarks with bench/jmh:run -jvmArgs "-Xms128M -Xmx128M" scala.collection.immutable.RedBlackTreeBenchmark. Giving it lots of memory led to unreliable data, with optimizations kicking in very late only when there was GC pressure, so you'd need a huge number of warmup iterations.

Building from a source in the correct order and in random order, in 2.13.x:

[info] Benchmark                          (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.build             0  avgt   60       15.032 ±     0.084  ns/op
[info] RedBlackTreeBenchmark.build             1  avgt   60       24.781 ±     0.111  ns/op
[info] RedBlackTreeBenchmark.build            10  avgt   60      741.863 ±     4.457  ns/op
[info] RedBlackTreeBenchmark.build           100  avgt   60    11803.651 ±    68.646  ns/op
[info] RedBlackTreeBenchmark.build          1000  avgt   60   178889.209 ±   947.915  ns/op
[info] RedBlackTreeBenchmark.build         10000  avgt   60  2641918.688 ± 13482.549  ns/op
[info] RedBlackTreeBenchmark.buildRandom       0  avgt   60       27.135 ±     0.155  ns/op
[info] RedBlackTreeBenchmark.buildRandom       1  avgt   60       44.238 ±     0.583  ns/op
[info] RedBlackTreeBenchmark.buildRandom      10  avgt   60      706.818 ±    15.026  ns/op
[info] RedBlackTreeBenchmark.buildRandom     100  avgt   60    11097.576 ±    87.879  ns/op
[info] RedBlackTreeBenchmark.buildRandom    1000  avgt   60   242717.013 ±  1577.754  ns/op
[info] RedBlackTreeBenchmark.buildRandom   10000  avgt   60  3932147.592 ± 23255.622  ns/op

With this change (avoiding the unnecessary intermediate TreeSet objects):

[info] RedBlackTreeBenchmark.build             0  avgt   60        6.529 ±     0.036  ns/op
[info] RedBlackTreeBenchmark.build             1  avgt   60       11.644 ±     0.126  ns/op
[info] RedBlackTreeBenchmark.build            10  avgt   60      398.927 ±     2.394  ns/op
[info] RedBlackTreeBenchmark.build           100  avgt   60     8533.978 ±    50.420  ns/op
[info] RedBlackTreeBenchmark.build          1000  avgt   60   164994.754 ±  1869.387  ns/op
[info] RedBlackTreeBenchmark.build         10000  avgt   60  2253731.587 ± 11218.765  ns/op
[info] RedBlackTreeBenchmark.buildRandom       0  avgt   60       25.479 ±     0.160  ns/op
[info] RedBlackTreeBenchmark.buildRandom       1  avgt   60       32.923 ±     0.642  ns/op
[info] RedBlackTreeBenchmark.buildRandom      10  avgt   60      346.081 ±     2.612  ns/op
[info] RedBlackTreeBenchmark.buildRandom     100  avgt   60     8221.890 ±    43.662  ns/op
[info] RedBlackTreeBenchmark.buildRandom    1000  avgt   60   209767.283 ±  1639.279  ns/op
[info] RedBlackTreeBenchmark.buildRandom   10000  avgt   60  3668358.135 ± 51228.253  ns/op

Deleting all elements in random order in 2.13.x:

[info] RedBlackTreeBenchmark.drain             0  avgt   60        4.347 ±     0.032  ns/op
[info] RedBlackTreeBenchmark.drain             1  avgt   60       13.908 ±     0.078  ns/op
[info] RedBlackTreeBenchmark.drain            10  avgt   60      563.643 ±     3.989  ns/op
[info] RedBlackTreeBenchmark.drain           100  avgt   60    10659.737 ±    69.129  ns/op
[info] RedBlackTreeBenchmark.drain          1000  avgt   60   284227.686 ±  2179.154  ns/op
[info] RedBlackTreeBenchmark.drain         10000  avgt   60  4932893.507 ± 34018.351  ns/op

With this change (benefiting mostly from not performing the unnecessary contains checks):

[info] RedBlackTreeBenchmark.drain             0  avgt   60        4.334 ±     0.030  ns/op
[info] RedBlackTreeBenchmark.drain             1  avgt   60       13.726 ±     0.082  ns/op
[info] RedBlackTreeBenchmark.drain            10  avgt   60      486.583 ±     2.400  ns/op
[info] RedBlackTreeBenchmark.drain           100  avgt   60     9553.461 ±    54.927  ns/op
[info] RedBlackTreeBenchmark.drain          1000  avgt   60   220971.471 ±  1502.734  ns/op
[info] RedBlackTreeBenchmark.drain         10000  avgt   60  4115382.806 ± 27286.882  ns/op

Performance of iteration via iterator and foreach is unchanged.

I spent the better part of a week on this and the resulting set of changes is rather small. There's a lot more that I tried that didn't work:

  • Removing null trees and using a proper empty tree object doesn't give you any real advantage. The way the red-black tree algorithms work you have to treat the empty tree specially in almost all cases so you may as well check for null. And even if you add an empty tree you'll still want to check for that by comparing it with eq empty because this is much faster than an isEmpty method on the tree itself.

  • Removing polymorphic dispatch of Tree methods by making Tree final is promising. You have to encode the color somehow if it's not encoded in the class anymore. Adding an extra field is not a good option. The basic Tree type has a size of 32 bytes with no wasted space, just like the original RedTree and BlackTree types. Adding an extra field would increase the memory footprint by 25% (assuming 64-bit alignment), so I used the (otherwise unused) sign bit of the count field for the color:

      final class Tree[A, +B](
        @(`inline` @getter) final val key: A,
        @(`inline` @getter) final val value: B,
        @(`inline` @getter) final val left: Tree[A, B],
        @(`inline` @getter) final val right: Tree[A, B],
        @(`inline` @getter) final val _count: Int) // negative for black trees, positive for red trees

    This works rather well in some cases:

    [info] RedBlackTreeBenchmark.buildRandom       0  avgt   16       24.148 ±     0.250  ns/op
    [info] RedBlackTreeBenchmark.buildRandom       1  avgt   16       37.722 ±     0.615  ns/op
    [info] RedBlackTreeBenchmark.buildRandom      10  avgt   16      378.835 ±     7.607  ns/op
    [info] RedBlackTreeBenchmark.buildRandom     100  avgt   16     7688.842 ±    93.652  ns/op
    [info] RedBlackTreeBenchmark.buildRandom    1000  avgt   16   186845.275 ±  3112.246  ns/op
    [info] RedBlackTreeBenchmark.buildRandom   10000  avgt   16  3210497.905 ± 50008.618  ns/op
    

    Other benchmarks, in particular for iteration, showed decreased performance. Looking back now at the results I got when I did these tests, it's possible that they were skewed by the late optimization without limiting the heap size. I'll do another proper test run tonight. Maybe this encoding is better after all.

  • Inlining some of the small methods in TreeIterator. This looks like an obvious win but it's not. Letting HotSpot take care of inlining is actually better in this case.

  • Iteration instead of tail recursion in foreach. We can't enforce tail recursion (marking the method @tailrec is not an option because it is also used in non-tail position) but scalac still generates better code for the tail recursion than for an explicit loop.

  • Using a Java implementation of the tree classes. The Scala version with the @(`inline` @getter) annotations is clunky, but so is Java. Performance is exactly the same.

- Some improved inlining in RedBlackTree
- Remove unnecessary `contains` check before deleting from TreeSet or
  TreeMap
- Avoid recreating wrapper objects when appending to and deleting from
  TreeSets and TreeMaps without a change
- Bulk building / appending without creating intermediate TreeSets or
  TreeMaps
- Building TreeSets and TreeMaps directly without a Builder
- Use null instead of () as dummy value in TreeSet
@scala-jenkins scala-jenkins added this to the 2.13.0-RC1 milestone Sep 28, 2018
val it = set1.iterator
var res = 0
while(it.hasNext)
res += it.next()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to structure the test without any auto-unboxing, e.g. by using a TreeSet[String] and just feeding the strings into the Blackhole inside the loop. I don't think it will change the results of this benchmark, but it reduces the cognitive burden :)

override def concat(that: collection.IterableOnce[A]): TreeSet[A] = {
val it = that.iterator
var t = tree
while (it.hasNext) t = RB.update(t, it.next(), null, overwrite = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do anything clever when that is another TreeSet? E.g could TreeSet((0 to 63): _*) ++ TreeSet((64 to 128): _*) perform some structural sharing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are more efficient algorithms for merging two trees or building from an ordered sequence (which should be possible to generalize to inserting into an existing tree from an ordered sequence).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These optimizations will need a way to check that they have the operands share the same Ordering. ord1 eq ord2 is a sure way to compare them, but will miss many opportunities. We could try to make == work for Ordering instances, e.g. by making TupleOrdering a case class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable RBTree-s will still be beneficial during building, where only have a single Ordering instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent the last 2 days implementing the algorithms from https://www.cs.cmu.edu/~guyb/papers/BFS16.pdf. The current version is very close to the paper without any optimizations for our data structures. Bulk union (or concat) for a few different scenarios combined (overlapping and non-overlapping sets of equal or different sizes) is already up to 20x faster (the larger the sets, the better).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make that 10x when comparing against this PR with the improved single-element concatenation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's amazing! Nice work!

def newBuilder[K : Ordering, V]: Builder[(K, V), TreeMap[K, V]] =
new ImmutableBuilder[(K, V), TreeMap[K, V]](empty) {
def addOne(elem: (K, V)): this.type = { elems = elems + elem; this }
override def addAll(xs: IterableOnce[(K, V)]): this.type = { elems = elems.concat(xs); this }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could go down the path of HashMapBuilder and defer creation of the collection that wraps the RBTree until result, and make addOne just call t = RB.update(t, k, v, overwrite = true) like you've done above.

Going further, we could even use mutation of the trees during building.

/cc @joshlemer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a good solution for the memory barrier problem in List and Vector yet? Switching to a mutable tree would get us into the same situation here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we're now using an explicit memory barrier in the builders of those. @joshlemer did the same recently from the Champ builders with great results.

I'd suggest that we at least pave the way for such a change by making the trees mutable with private[scala] setters in 2.13.0 (package private would be even nicer but would require us to write the data classes in Java).

@szeiger
Copy link
Contributor Author

szeiger commented Oct 1, 2018

Here are the results from the alternate encoding. The performance regression in iterator that I saw earlier was indeed caused by bad measurement. Iteration with iterator and foreach is basically the same as in 2.13.x and in this PR. drain is a bit slower than the version in this PR:

[info] RedBlackTreeBenchmark.drain             0  avgt   60        4.375 ±     0.035  ns/op
[info] RedBlackTreeBenchmark.drain             1  avgt   60       13.876 ±     0.088  ns/op
[info] RedBlackTreeBenchmark.drain            10  avgt   60      503.655 ±     2.433  ns/op
[info] RedBlackTreeBenchmark.drain           100  avgt   60    10057.843 ±    57.836  ns/op
[info] RedBlackTreeBenchmark.drain          1000  avgt   60   239275.018 ±  1582.677  ns/op
[info] RedBlackTreeBenchmark.drain         10000  avgt   60  4291043.901 ± 25764.321  ns/op

Building in random order is faster than this PR, building in correct order is slower:

[info] RedBlackTreeBenchmark.build             0  avgt   60        6.511 ±     0.031  ns/op
[info] RedBlackTreeBenchmark.build             1  avgt   60       19.576 ±     0.090  ns/op
[info] RedBlackTreeBenchmark.build            10  avgt   60      431.087 ±     3.340  ns/op
[info] RedBlackTreeBenchmark.build           100  avgt   60     9434.790 ±    56.822  ns/op
[info] RedBlackTreeBenchmark.build          1000  avgt   60   186017.401 ±  1021.874  ns/op
[info] RedBlackTreeBenchmark.build         10000  avgt   60  2646389.650 ± 21761.592  ns/op
[info] RedBlackTreeBenchmark.buildRandom       0  avgt   60       25.820 ±     0.109  ns/op
[info] RedBlackTreeBenchmark.buildRandom       1  avgt   60       40.536 ±     0.269  ns/op
[info] RedBlackTreeBenchmark.buildRandom      10  avgt   60      406.591 ±     2.317  ns/op
[info] RedBlackTreeBenchmark.buildRandom     100  avgt   60     8314.624 ±    45.755  ns/op
[info] RedBlackTreeBenchmark.buildRandom    1000  avgt   60   195968.450 ±  1292.088  ns/op
[info] RedBlackTreeBenchmark.buildRandom   10000  avgt   60  3324854.275 ± 22224.404  ns/op

All in all it doesn't look like a worthwhile improvement.

@retronym
Copy link
Member

retronym commented Oct 1, 2018

Removing polymorphic dispatch of Tree methods by making Tree final is promising.

If there are only two subclasses, JIT's bimorphic inlining will likely be just as good as making this monomorphic.

@szeiger
Copy link
Contributor Author

szeiger commented Oct 2, 2018

Updated with better Builders (which avoid wrapping the RB.Tree between additions) and efficient bulk operations.

Here are the benchmark results for the bulk union operation. I didn't benchmark intersect and diff but they are also implemented and tested.

Base line: Old version of this PR minus the concat override. This should be the same as or slightly faster than the current 2.13.x:

[info] Benchmark                    (size)  Mode  Cnt         Score       Error  Units
[info] RedBlackTreeBenchmark.union       0  avgt   20       137.777 ±     1.718  ns/op
[info] RedBlackTreeBenchmark.union       1  avgt   20       154.537 ±     3.338  ns/op
[info] RedBlackTreeBenchmark.union      10  avgt   20      1808.558 ±    15.092  ns/op
[info] RedBlackTreeBenchmark.union     100  avgt   20     39168.882 ±   205.147  ns/op
[info] RedBlackTreeBenchmark.union    1000  avgt   20    838237.278 ±  5720.816  ns/op
[info] RedBlackTreeBenchmark.union   10000  avgt   20  10802699.255 ± 90692.569  ns/op

Old version of this PR including the concat override to avoid building intermediate wrappers:

[info] Benchmark                    (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.union       0  avgt   20        9.278 ±     0.058  ns/op
[info] RedBlackTreeBenchmark.union       1  avgt   20       76.321 ±     0.498  ns/op
[info] RedBlackTreeBenchmark.union      10  avgt   20     1032.589 ±     7.778  ns/op
[info] RedBlackTreeBenchmark.union     100  avgt   20    20490.622 ±    74.912  ns/op
[info] RedBlackTreeBenchmark.union    1000  avgt   20   443673.045 ±  1663.212  ns/op
[info] RedBlackTreeBenchmark.union   10000  avgt   20  5678857.481 ± 30469.581  ns/op

Optimized bulk union:

[info] Benchmark                    (size)  Mode  Cnt       Score      Error  Units
[info] RedBlackTreeBenchmark.union       0  avgt   20       6.615 ±    0.058  ns/op
[info] RedBlackTreeBenchmark.union       1  avgt   20      40.921 ±    1.035  ns/op
[info] RedBlackTreeBenchmark.union      10  avgt   20    1274.996 ±    6.919  ns/op
[info] RedBlackTreeBenchmark.union     100  avgt   20   10662.666 ±   73.296  ns/op
[info] RedBlackTreeBenchmark.union    1000  avgt   20   59756.993 ±  685.840  ns/op
[info] RedBlackTreeBenchmark.union   10000  avgt   20  313465.974 ± 3380.362  ns/op

Note that this is single-threaded but the algorithms are highly scalable. This might be a useful addition to parallel-collections.

@retronym
Copy link
Member

retronym commented Oct 2, 2018

@szeiger Great results!

It's useful also show the improvements in the gc.alloc.rate.norm counter by adding -prof gc to the JMH runs.

@retronym
Copy link
Member

retronym commented Oct 2, 2018

I think it would be worthwhile for us to implement .equals on our standard Ordering instances and change the optimization herein to be based on == rather than eq. This would be needed for Option / Iterable / TupleN.

else new TreeMap(RB.delete(tree, key))
def remove(key: K): TreeMap[K,V] = {
val t = RB.delete(tree, key)
if(t eq tree) this else new TreeMap(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this to a helper newTreeMapOrSelf.

I think there are a few more places it could be used: rangeImpl range.

@szeiger
Copy link
Contributor Author

szeiger commented Oct 4, 2018

I was just looking at the possibility of using mutable fields during construction. The caveat is that we'd have to switch to the alternate encoding with a single Tree class, otherwise there is no way to change the color of a node, the most essential change without which changing any of the other fields is not very useful.

- Optimize building TreeMap from SortedMap
- Optimize building TreeSet from SortedSet and Range
- Static value for Ordering.Int.reverse so we can check for it with
  eq and avoid allocating a new instance for each check
- Bug fix: Check for ordering when reusing existing TreeMap/TreeSet
- Compare Orderings with == instead of eq
@szeiger
Copy link
Contributor Author

szeiger commented Oct 4, 2018

I just pushed another commit with further algorithmic improvements: We can build red-black trees very efficiently in linear time and fully immutable from ordered sets/maps (including Range, which is effectively an ordered set).

Here are the new results for building from a Range:

[info] Benchmark                          (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.build             0  avgt   20       44.315 ±     0.402  ns/op
[info] RedBlackTreeBenchmark.build             1  avgt   20       52.186 ±     1.208  ns/op
[info] RedBlackTreeBenchmark.build            10  avgt   20      139.171 ±     1.224  ns/op
[info] RedBlackTreeBenchmark.build           100  avgt   20     1040.310 ±     8.710  ns/op
[info] RedBlackTreeBenchmark.build          1000  avgt   20    10074.338 ±   164.073  ns/op
[info] RedBlackTreeBenchmark.build         10000  avgt   20   114782.362 ±  3064.179  ns/op

For the largest size this is a 20x improvement over building by incremental insertion.

- `rebalance`, `compareDepth` and `NList` are no longer needed

- Use existing object where possible in TreeMap and TreeSet
@szeiger
Copy link
Contributor Author

szeiger commented Oct 5, 2018

Ending week two of red-black tree optimizations with join-based reimplementations of the existing methods that used to rebalance. This allows us to reduce the complexity a bit by removing the old rebalancing methods. Of course, it's also faster.

take before the latest commit:

[info] Benchmark                   (size)  Mode  Cnt     Score    Error  Units
[info] RedBlackTreeBenchmark.take       0  avgt   20   179.352 ±  3.123  ns/op
[info] RedBlackTreeBenchmark.take       1  avgt   20   207.764 ±  2.091  ns/op
[info] RedBlackTreeBenchmark.take      10  avgt   20   610.635 ±  6.173  ns/op
[info] RedBlackTreeBenchmark.take     100  avgt   20  1424.657 ± 15.368  ns/op
[info] RedBlackTreeBenchmark.take    1000  avgt   20  2902.166 ± 24.368  ns/op
[info] RedBlackTreeBenchmark.take   10000  avgt   20  5284.474 ± 54.396  ns/op

take after:

[info] Benchmark                   (size)  Mode  Cnt     Score    Error  Units
[info] RedBlackTreeBenchmark.take       0  avgt   20   182.284 ±  1.642  ns/op
[info] RedBlackTreeBenchmark.take       1  avgt   20   203.545 ±  4.658  ns/op
[info] RedBlackTreeBenchmark.take      10  avgt   20   499.038 ± 11.583  ns/op
[info] RedBlackTreeBenchmark.take     100  avgt   20   895.800 ± 10.562  ns/op
[info] RedBlackTreeBenchmark.take    1000  avgt   20  1587.725 ± 38.858  ns/op
[info] RedBlackTreeBenchmark.take   10000  avgt   20  2242.233 ± 31.515  ns/op

drop before:

[info] Benchmark                   (size)  Mode  Cnt     Score    Error  Units
[info] RedBlackTreeBenchmark.drop       0  avgt   20     3.399 ±  0.031  ns/op
[info] RedBlackTreeBenchmark.drop       1  avgt   20    29.445 ±  0.354  ns/op
[info] RedBlackTreeBenchmark.drop      10  avgt   20   424.322 ±  5.730  ns/op
[info] RedBlackTreeBenchmark.drop     100  avgt   20  1258.439 ± 13.895  ns/op
[info] RedBlackTreeBenchmark.drop    1000  avgt   20  2610.769 ± 35.346  ns/op
[info] RedBlackTreeBenchmark.drop   10000  avgt   20  4395.576 ± 36.938  ns/op

drop after:

[info] Benchmark                   (size)  Mode  Cnt     Score    Error  Units
[info] RedBlackTreeBenchmark.drop       0  avgt   20     3.382 ±  0.045  ns/op
[info] RedBlackTreeBenchmark.drop       1  avgt   20    40.590 ±  0.875  ns/op
[info] RedBlackTreeBenchmark.drop      10  avgt   20   343.517 ±  2.919  ns/op
[info] RedBlackTreeBenchmark.drop     100  avgt   20   866.352 ± 25.438  ns/op
[info] RedBlackTreeBenchmark.drop    1000  avgt   20  1492.308 ± 20.885  ns/op
[info] RedBlackTreeBenchmark.drop   10000  avgt   20  2142.520 ± 57.901  ns/op

slice before:

[info] Benchmark                    (size)  Mode  Cnt     Score     Error  Units
[info] RedBlackTreeBenchmark.slice       0  avgt   20   237.908 ±   2.343  ns/op
[info] RedBlackTreeBenchmark.slice       1  avgt   20   335.279 ±   3.081  ns/op
[info] RedBlackTreeBenchmark.slice      10  avgt   20  1435.217 ± 120.522  ns/op
[info] RedBlackTreeBenchmark.slice     100  avgt   20  2799.615 ±  29.352  ns/op
[info] RedBlackTreeBenchmark.slice    1000  avgt   20  4488.810 ± 107.237  ns/op
[info] RedBlackTreeBenchmark.slice   10000  avgt   20  6397.044 ±  96.573  ns/op

slice after:

[info] Benchmark                    (size)  Mode  Cnt     Score    Error  Units
[info] RedBlackTreeBenchmark.slice       0  avgt   20   232.865 ±  1.721  ns/op
[info] RedBlackTreeBenchmark.slice       1  avgt   20   342.482 ± 17.017  ns/op
[info] RedBlackTreeBenchmark.slice      10  avgt   20  1252.585 ±  8.490  ns/op
[info] RedBlackTreeBenchmark.slice     100  avgt   20  2372.730 ± 31.706  ns/op
[info] RedBlackTreeBenchmark.slice    1000  avgt   20  4053.049 ± 25.203  ns/op
[info] RedBlackTreeBenchmark.slice   10000  avgt   20  5717.099 ± 50.457  ns/op

range before:

[info] Benchmark                    (size)  Mode  Cnt      Score     Error  Units
[info] RedBlackTreeBenchmark.range       0  avgt   20      6.452 ±   0.041  ns/op
[info] RedBlackTreeBenchmark.range       1  avgt   20    559.005 ±  11.043  ns/op
[info] RedBlackTreeBenchmark.range      10  avgt   20   2259.307 ±  19.393  ns/op
[info] RedBlackTreeBenchmark.range     100  avgt   20   5882.585 ±  24.846  ns/op
[info] RedBlackTreeBenchmark.range    1000  avgt   20  10747.802 ± 202.006  ns/op
[info] RedBlackTreeBenchmark.range   10000  avgt   20  17898.227 ± 160.454  ns/op

range after:

[info] Benchmark                    (size)  Mode  Cnt      Score     Error  Units
[info] RedBlackTreeBenchmark.range       0  avgt   20      6.499 ±   0.049  ns/op
[info] RedBlackTreeBenchmark.range       1  avgt   20    557.199 ±  12.161  ns/op
[info] RedBlackTreeBenchmark.range      10  avgt   20   2247.534 ±  35.147  ns/op
[info] RedBlackTreeBenchmark.range     100  avgt   20   5486.243 ±  47.442  ns/op
[info] RedBlackTreeBenchmark.range    1000  avgt   20   8612.562 ±  82.543  ns/op
[info] RedBlackTreeBenchmark.range   10000  avgt   20  11663.786 ± 221.861  ns/op

- `filter` is based on `join`
- `transform` rebuilds the existing structure (which is never affected
  by the transformation)
@szeiger
Copy link
Contributor Author

szeiger commented Oct 8, 2018

While looking at another PR that optimizes filter for HashMap I realized that we don't have a red-black tree specific version, so I implemented it based on join. As usual, the performance difference is huge:

Before:

[info] Benchmark                            (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.filterDrop          0  avgt   20       27.744 ±     0.277  ns/op
[info] RedBlackTreeBenchmark.filterDrop          1  avgt   20       43.919 ±     0.301  ns/op
[info] RedBlackTreeBenchmark.filterDrop         10  avgt   20       79.747 ±     1.354  ns/op
[info] RedBlackTreeBenchmark.filterDrop        100  avgt   20      374.714 ±     1.503  ns/op
[info] RedBlackTreeBenchmark.filterDrop       1000  avgt   20     3110.857 ±    23.053  ns/op
[info] RedBlackTreeBenchmark.filterDrop      10000  avgt   20    37306.189 ±  1281.405  ns/op
[info] RedBlackTreeBenchmark.filterKeep          0  avgt   20       27.636 ±     0.352  ns/op
[info] RedBlackTreeBenchmark.filterKeep          1  avgt   20       66.643 ±     0.868  ns/op
[info] RedBlackTreeBenchmark.filterKeep         10  avgt   20      439.471 ±    19.799  ns/op
[info] RedBlackTreeBenchmark.filterKeep        100  avgt   20     8527.503 ±    41.508  ns/op
[info] RedBlackTreeBenchmark.filterKeep       1000  avgt   20   172379.713 ±  1408.782  ns/op
[info] RedBlackTreeBenchmark.filterKeep      10000  avgt   20  2313769.548 ± 10185.289  ns/op
[info] RedBlackTreeBenchmark.filterPartial       0  avgt   20       27.436 ±     0.244  ns/op
[info] RedBlackTreeBenchmark.filterPartial       1  avgt   20       43.689 ±     0.413  ns/op
[info] RedBlackTreeBenchmark.filterPartial      10  avgt   20      197.619 ±     2.691  ns/op
[info] RedBlackTreeBenchmark.filterPartial     100  avgt   20     3804.073 ±    26.264  ns/op
[info] RedBlackTreeBenchmark.filterPartial    1000  avgt   20    78183.333 ±   408.709  ns/op
[info] RedBlackTreeBenchmark.filterPartial   10000  avgt   20  1151320.092 ±  7191.067  ns/op

After:

[info] Benchmark                            (size)  Mode  Cnt       Score      Error  Units
[info] RedBlackTreeBenchmark.filterDrop          0  avgt   20       3.397 ±    0.031  ns/op
[info] RedBlackTreeBenchmark.filterDrop          1  avgt   20       7.898 ±    0.138  ns/op
[info] RedBlackTreeBenchmark.filterDrop         10  avgt   20      36.673 ±    0.325  ns/op
[info] RedBlackTreeBenchmark.filterDrop        100  avgt   20     306.876 ±    2.876  ns/op
[info] RedBlackTreeBenchmark.filterDrop       1000  avgt   20    4180.397 ±   69.837  ns/op
[info] RedBlackTreeBenchmark.filterDrop      10000  avgt   20   35805.440 ±  748.316  ns/op
[info] RedBlackTreeBenchmark.filterKeep          0  avgt   20       3.702 ±    0.321  ns/op
[info] RedBlackTreeBenchmark.filterKeep          1  avgt   20       4.832 ±    0.095  ns/op
[info] RedBlackTreeBenchmark.filterKeep         10  avgt   20      49.131 ±    0.695  ns/op
[info] RedBlackTreeBenchmark.filterKeep        100  avgt   20     385.484 ±    4.185  ns/op
[info] RedBlackTreeBenchmark.filterKeep       1000  avgt   20    5399.643 ±   55.518  ns/op
[info] RedBlackTreeBenchmark.filterKeep      10000  avgt   20   51359.206 ±  609.722  ns/op
[info] RedBlackTreeBenchmark.filterPartial       0  avgt   20       3.505 ±    0.037  ns/op
[info] RedBlackTreeBenchmark.filterPartial       1  avgt   20       8.367 ±    0.183  ns/op
[info] RedBlackTreeBenchmark.filterPartial      10  avgt   20     182.781 ±    1.791  ns/op
[info] RedBlackTreeBenchmark.filterPartial     100  avgt   20    1748.557 ±   21.808  ns/op
[info] RedBlackTreeBenchmark.filterPartial    1000  avgt   20   26803.034 ±  293.217  ns/op
[info] RedBlackTreeBenchmark.filterPartial   10000  avgt   20  255889.476 ± 3419.415  ns/op

We didn't have an optimized transform method, either, even though it is trivial to implement and offers huge performance benefits, so I added one.

Before:

[info] Benchmark                            (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.transformAll        0  avgt   20       19.394 ±     0.368  ns/op
[info] RedBlackTreeBenchmark.transformAll        1  avgt   20       56.436 ±     0.661  ns/op
[info] RedBlackTreeBenchmark.transformAll       10  avgt   20      466.717 ±     6.555  ns/op
[info] RedBlackTreeBenchmark.transformAll      100  avgt   20     9562.954 ±   100.270  ns/op
[info] RedBlackTreeBenchmark.transformAll     1000  avgt   20   181299.465 ±  1556.040  ns/op
[info] RedBlackTreeBenchmark.transformAll    10000  avgt   20  2543622.330 ± 26071.983  ns/op

After:

[info] RedBlackTreeBenchmark.transformAll        0  avgt   20       3.476 ±    0.083  ns/op
[info] RedBlackTreeBenchmark.transformAll        1  avgt   20      16.578 ±    0.103  ns/op
[info] RedBlackTreeBenchmark.transformAll       10  avgt   20     115.138 ±    1.307  ns/op
[info] RedBlackTreeBenchmark.transformAll      100  avgt   20    1215.023 ±    7.633  ns/op
[info] RedBlackTreeBenchmark.transformAll     1000  avgt   20   14840.987 ±  214.410  ns/op
[info] RedBlackTreeBenchmark.transformAll    10000  avgt   20  149494.743 ± 1293.007  ns/op

@szeiger
Copy link
Contributor Author

szeiger commented Oct 9, 2018

Here's an optimized partition, similar to filter. Before:

[info] Benchmark                        (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.partition       0  avgt   20       25.306 ±     0.395  ns/op
[info] RedBlackTreeBenchmark.partition       1  avgt   20       62.717 ±     1.446  ns/op
[info] RedBlackTreeBenchmark.partition      10  avgt   20      472.450 ±     9.537  ns/op
[info] RedBlackTreeBenchmark.partition     100  avgt   20     9288.349 ±   172.281  ns/op
[info] RedBlackTreeBenchmark.partition    1000  avgt   20   176483.704 ±  2749.917  ns/op
[info] RedBlackTreeBenchmark.partition   10000  avgt   20  2405263.189 ± 81249.379  ns/op

After:

[info] Benchmark                        (size)  Mode  Cnt       Score       Error  Units
[info] RedBlackTreeBenchmark.partition       0  avgt   20       7.508 ±     0.390  ns/op
[info] RedBlackTreeBenchmark.partition       1  avgt   20      12.835 ±     0.231  ns/op
[info] RedBlackTreeBenchmark.partition      10  avgt   20     312.599 ±     7.636  ns/op
[info] RedBlackTreeBenchmark.partition     100  avgt   20    3535.824 ±    30.640  ns/op
[info] RedBlackTreeBenchmark.partition    1000  avgt   20   49380.136 ±   474.097  ns/op
[info] RedBlackTreeBenchmark.partition   10000  avgt   20  521955.313 ± 30313.243  ns/op

And some smallish improvements to tail and init to avoid finding the minimum/maximum values and then comparing against them instead of simply dropping the first or last element. Tested via tails: Before:

[info] Benchmark                    (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.tails       0  avgt   20       95.931 ±     1.730  ns/op
[info] RedBlackTreeBenchmark.tails       1  avgt   20      137.485 ±     1.197  ns/op
[info] RedBlackTreeBenchmark.tails      10  avgt   20      867.208 ±    11.842  ns/op
[info] RedBlackTreeBenchmark.tails     100  avgt   20    11614.839 ±    98.074  ns/op
[info] RedBlackTreeBenchmark.tails    1000  avgt   20   154580.335 ±  1684.961  ns/op
[info] RedBlackTreeBenchmark.tails   10000  avgt   20  2016243.981 ± 23762.612  ns/op

After:

[info] Benchmark                    (size)  Mode  Cnt        Score       Error  Units
[info] RedBlackTreeBenchmark.tails       0  avgt   20       96.593 ±     1.243  ns/op
[info] RedBlackTreeBenchmark.tails       1  avgt   20      126.222 ±     3.350  ns/op
[info] RedBlackTreeBenchmark.tails      10  avgt   20      760.793 ±    75.406  ns/op
[info] RedBlackTreeBenchmark.tails     100  avgt   20    10523.537 ±   180.894  ns/op
[info] RedBlackTreeBenchmark.tails    1000  avgt   20   141315.335 ±  1535.054  ns/op
[info] RedBlackTreeBenchmark.tails   10000  avgt   20  1768062.043 ± 31771.095  ns/op

@szeiger
Copy link
Contributor Author

szeiger commented Oct 26, 2018

Any further comments on this? It would be very useful to have this code available for my HashSet / HashMap work.

@plokhotnyuk
Copy link
Contributor

plokhotnyuk commented Oct 26, 2018

@szeiger please also test them on sizes beyond 10K just to be sure that there no unexpected trends due to increasing of memory access times, etc.

Just see how RTree from the Archery library is getting too inefficient on 1M+ sizes in following benches:

https://github.com/Sizmek/rtree2d

@szeiger
Copy link
Contributor Author

szeiger commented Oct 29, 2018

I'll run it for some larger sizes over night. I don't expect any surprises though. The data structures are unchanged and the new algorithms are much better in big-O terms.

@retronym
Copy link
Member

The only thing missing IMO is a few tweaks to Ordering.{seqDerivedOrdering, OptionOrdering, Tuple{2..9} to implement equals, which would increase the chance of hitting the optimized paths in this PR for that.ordering == this.ordering.

@szeiger
Copy link
Contributor Author

szeiger commented Oct 30, 2018

Here are the results for larger sizes (up tp 10 million). I used a large 8G heap with 25 warmup iterations for good results. I did not run the buildRandom benchmarks because computing the random permutations as part of the test setup gets ridiculously slow at those sizes. All results show the old version (last commit before this PR) first, then the latest version of this PR.

TL;DR: The new version is always faster.

Building: 41x faster at the largest size:

[info] Benchmark                      (size)  Mode  Cnt           Score           Error  Units
[info] RedBlackTreeBenchmark.build        10  avgt   15         711.188 ±        16.267  ns/op
[info] RedBlackTreeBenchmark.build       100  avgt   15       11811.523 ±       440.435  ns/op
[info] RedBlackTreeBenchmark.build      1000  avgt   15      204182.394 ±      4085.703  ns/op
[info] RedBlackTreeBenchmark.build     10000  avgt   15     3261718.627 ±    388003.229  ns/op
[info] RedBlackTreeBenchmark.build    100000  avgt   15    35664332.358 ±   6004775.132  ns/op
[info] RedBlackTreeBenchmark.build   1000000  avgt   15   427856968.756 ±  67740563.336  ns/op
[info] RedBlackTreeBenchmark.build  10000000  avgt   15  5536266625.467 ± 564677819.057  ns/op

[info] Benchmark                      (size)  Mode  Cnt          Score         Error  Units
[info] RedBlackTreeBenchmark.build        10  avgt   15        122.381 ±       1.905  ns/op
[info] RedBlackTreeBenchmark.build       100  avgt   15        914.045 ±      19.837  ns/op
[info] RedBlackTreeBenchmark.build      1000  avgt   15       9213.197 ±     116.961  ns/op
[info] RedBlackTreeBenchmark.build     10000  avgt   15     109886.411 ±    1837.554  ns/op
[info] RedBlackTreeBenchmark.build    100000  avgt   15    1101563.315 ±   10684.387  ns/op
[info] RedBlackTreeBenchmark.build   1000000  avgt   15    9727175.025 ±  166790.510  ns/op
[info] RedBlackTreeBenchmark.build  10000000  avgt   15  134869012.067 ± 7917895.235  ns/op

Union: 37x faster

[info] Benchmark                      (size)  Mode  Cnt            Score           Error  Units
[info] RedBlackTreeBenchmark.union        10  avgt   15         3075.611 ±       182.648  ns/op
[info] RedBlackTreeBenchmark.union       100  avgt   15        54458.791 ±       767.080  ns/op
[info] RedBlackTreeBenchmark.union      1000  avgt   15       990185.585 ±     12214.923  ns/op
[info] RedBlackTreeBenchmark.union     10000  avgt   15     11198945.377 ±    122246.220  ns/op
[info] RedBlackTreeBenchmark.union    100000  avgt   15    140288177.548 ±   2472651.497  ns/op
[info] RedBlackTreeBenchmark.union   1000000  avgt   15   1613232843.533 ±  22015103.344  ns/op
[info] RedBlackTreeBenchmark.union  10000000  avgt   15  20062114443.800 ± 540163081.598  ns/op

[info] Benchmark                      (size)  Mode  Cnt          Score          Error  Units
[info] RedBlackTreeBenchmark.union        10  avgt   15       1000.995 ±       11.054  ns/op
[info] RedBlackTreeBenchmark.union       100  avgt   15       6786.508 ±       69.587  ns/op
[info] RedBlackTreeBenchmark.union      1000  avgt   15      46789.052 ±      326.772  ns/op
[info] RedBlackTreeBenchmark.union     10000  avgt   15     325078.281 ±     5923.496  ns/op
[info] RedBlackTreeBenchmark.union    100000  avgt   15    4552864.379 ±    44598.637  ns/op
[info] RedBlackTreeBenchmark.union   1000000  avgt   15   28317370.454 ±   528436.049  ns/op
[info] RedBlackTreeBenchmark.union  10000000  avgt   15  537936256.367 ± 24724121.695  ns/op

Take: 2.9x faster

[info] RedBlackTreeBenchmark.take                 10  avgt   15         588.952 ±          8.385  ns/op
[info] RedBlackTreeBenchmark.take                100  avgt   15        1467.738 ±         20.907  ns/op
[info] RedBlackTreeBenchmark.take               1000  avgt   15        3452.878 ±        353.205  ns/op
[info] RedBlackTreeBenchmark.take              10000  avgt   15        5304.936 ±        116.766  ns/op
[info] RedBlackTreeBenchmark.take             100000  avgt   15        7180.425 ±        107.780  ns/op
[info] RedBlackTreeBenchmark.take            1000000  avgt   15       11654.052 ±        226.116  ns/op
[info] RedBlackTreeBenchmark.take           10000000  avgt   15       20099.258 ±        679.384  ns/op

[info] RedBlackTreeBenchmark.take                 10  avgt   15         475.869 ±        9.909  ns/op
[info] RedBlackTreeBenchmark.take                100  avgt   15         859.611 ±       12.787  ns/op
[info] RedBlackTreeBenchmark.take               1000  avgt   15        1516.373 ±       27.999  ns/op
[info] RedBlackTreeBenchmark.take              10000  avgt   15        2352.771 ±       48.708  ns/op
[info] RedBlackTreeBenchmark.take             100000  avgt   15        3687.224 ±       47.136  ns/op
[info] RedBlackTreeBenchmark.take            1000000  avgt   15        5162.635 ±       91.196  ns/op
[info] RedBlackTreeBenchmark.take           10000000  avgt   15        6866.289 ±      127.400  ns/op

Drop: 2.3x faster

[info] RedBlackTreeBenchmark.drop                 10  avgt   15         355.681 ±          3.967  ns/op
[info] RedBlackTreeBenchmark.drop                100  avgt   15        1017.634 ±          8.692  ns/op
[info] RedBlackTreeBenchmark.drop               1000  avgt   15        2066.727 ±         28.714  ns/op
[info] RedBlackTreeBenchmark.drop              10000  avgt   15        3144.502 ±         32.831  ns/op
[info] RedBlackTreeBenchmark.drop             100000  avgt   15        6517.609 ±         74.523  ns/op
[info] RedBlackTreeBenchmark.drop            1000000  avgt   15       10654.289 ±        338.586  ns/op
[info] RedBlackTreeBenchmark.drop           10000000  avgt   15       17479.694 ±        245.481  ns/op

[info] RedBlackTreeBenchmark.drop                 10  avgt   15         385.351 ±        4.447  ns/op
[info] RedBlackTreeBenchmark.drop                100  avgt   15         847.702 ±       10.112  ns/op
[info] RedBlackTreeBenchmark.drop               1000  avgt   15        1402.512 ±       36.043  ns/op
[info] RedBlackTreeBenchmark.drop              10000  avgt   15        2054.680 ±       28.120  ns/op
[info] RedBlackTreeBenchmark.drop             100000  avgt   15        3354.316 ±       73.667  ns/op
[info] RedBlackTreeBenchmark.drop            1000000  avgt   15        5054.424 ±      104.858  ns/op
[info] RedBlackTreeBenchmark.drop           10000000  avgt   15        7294.908 ±      527.479  ns/op

Filter: 94x (drop all) / 77x (keep all) / 9.4x (keep every 2nd) faster

[info] RedBlackTreeBenchmark.filterDrop           10  avgt   15          77.137 ±          0.859  ns/op
[info] RedBlackTreeBenchmark.filterDrop          100  avgt   15         394.258 ±          5.205  ns/op
[info] RedBlackTreeBenchmark.filterDrop         1000  avgt   15        6037.047 ±         76.269  ns/op
[info] RedBlackTreeBenchmark.filterDrop        10000  avgt   15       92324.402 ±       1384.441  ns/op
[info] RedBlackTreeBenchmark.filterDrop       100000  avgt   15     3563947.762 ±      32260.417  ns/op
[info] RedBlackTreeBenchmark.filterDrop      1000000  avgt   15     5206697.702 ±      62029.899  ns/op
[info] RedBlackTreeBenchmark.filterDrop     10000000  avgt   15    56957946.000 ±     704852.239  ns/op
[info] RedBlackTreeBenchmark.filterKeep           10  avgt   15         542.846 ±          5.867  ns/op
[info] RedBlackTreeBenchmark.filterKeep          100  avgt   15       11847.485 ±        166.753  ns/op
[info] RedBlackTreeBenchmark.filterKeep         1000  avgt   15      187869.106 ±       3130.338  ns/op
[info] RedBlackTreeBenchmark.filterKeep        10000  avgt   15     2610730.703 ±      37123.850  ns/op
[info] RedBlackTreeBenchmark.filterKeep       100000  avgt   15    31456141.300 ±     426157.656  ns/op
[info] RedBlackTreeBenchmark.filterKeep      1000000  avgt   15   364448080.911 ±    4665789.207  ns/op
[info] RedBlackTreeBenchmark.filterKeep     10000000  avgt   15  4602944495.200 ±  347751305.269  ns/op
[info] RedBlackTreeBenchmark.filterPartial        10  avgt   15         336.769 ±          5.955  ns/op
[info] RedBlackTreeBenchmark.filterPartial       100  avgt   15        4359.843 ±         55.784  ns/op
[info] RedBlackTreeBenchmark.filterPartial      1000  avgt   15       82997.391 ±       1942.645  ns/op
[info] RedBlackTreeBenchmark.filterPartial     10000  avgt   15     1204650.389 ±      20353.520  ns/op
[info] RedBlackTreeBenchmark.filterPartial    100000  avgt   15    15320701.346 ±     248675.879  ns/op
[info] RedBlackTreeBenchmark.filterPartial   1000000  avgt   15   177109384.933 ±    2406252.011  ns/op
[info] RedBlackTreeBenchmark.filterPartial  10000000  avgt   15  2566722109.733 ±  927790021.482  ns/op


[info] RedBlackTreeBenchmark.filterDrop           10  avgt   15          34.267 ±        0.356  ns/op
[info] RedBlackTreeBenchmark.filterDrop          100  avgt   15         366.507 ±        8.824  ns/op
[info] RedBlackTreeBenchmark.filterDrop         1000  avgt   15        3259.273 ±       45.655  ns/op
[info] RedBlackTreeBenchmark.filterDrop        10000  avgt   15       46205.187 ±      731.185  ns/op
[info] RedBlackTreeBenchmark.filterDrop       100000  avgt   15      606050.382 ±    12799.063  ns/op
[info] RedBlackTreeBenchmark.filterDrop      1000000  avgt   15     9205635.558 ±   117597.917  ns/op
[info] RedBlackTreeBenchmark.filterDrop     10000000  avgt   15    53105538.738 ±   783707.233  ns/op
[info] RedBlackTreeBenchmark.filterKeep           10  avgt   15          43.944 ±        0.957  ns/op
[info] RedBlackTreeBenchmark.filterKeep          100  avgt   15         467.477 ±        7.480  ns/op
[info] RedBlackTreeBenchmark.filterKeep         1000  avgt   15        4141.684 ±       89.489  ns/op
[info] RedBlackTreeBenchmark.filterKeep        10000  avgt   15       54223.368 ±      971.947  ns/op
[info] RedBlackTreeBenchmark.filterKeep       100000  avgt   15      753479.349 ±    24517.525  ns/op
[info] RedBlackTreeBenchmark.filterKeep      1000000  avgt   15    10788600.997 ±   158311.845  ns/op
[info] RedBlackTreeBenchmark.filterKeep     10000000  avgt   15    59995034.246 ±  1005246.037  ns/op
[info] RedBlackTreeBenchmark.filterPartial        10  avgt   15         208.931 ±        2.314  ns/op
[info] RedBlackTreeBenchmark.filterPartial       100  avgt   15        1733.312 ±       20.788  ns/op
[info] RedBlackTreeBenchmark.filterPartial      1000  avgt   15       24077.149 ±      372.246  ns/op
[info] RedBlackTreeBenchmark.filterPartial     10000  avgt   15      236349.654 ±     4150.876  ns/op
[info] RedBlackTreeBenchmark.filterPartial    100000  avgt   15     1993738.380 ±    24857.514  ns/op
[info] RedBlackTreeBenchmark.filterPartial   1000000  avgt   15    27439847.624 ±   465332.576  ns/op
[info] RedBlackTreeBenchmark.filterPartial  10000000  avgt   15   271675590.117 ±  9560996.478  ns/op

Foreach: 1.09x faster

[info] RedBlackTreeBenchmark.foreach              10  avgt   15          41.365 ±          0.740  ns/op
[info] RedBlackTreeBenchmark.foreach             100  avgt   15         333.853 ±          3.033  ns/op
[info] RedBlackTreeBenchmark.foreach            1000  avgt   15        6804.852 ±         92.705  ns/op
[info] RedBlackTreeBenchmark.foreach           10000  avgt   15       93873.481 ±       1120.144  ns/op
[info] RedBlackTreeBenchmark.foreach          100000  avgt   15     3496643.943 ±      31690.742  ns/op
[info] RedBlackTreeBenchmark.foreach         1000000  avgt   15     4587336.564 ±      72821.358  ns/op
[info] RedBlackTreeBenchmark.foreach        10000000  avgt   15    51204380.753 ±     413331.600  ns/op

[info] RedBlackTreeBenchmark.foreach              10  avgt   15          42.673 ±        0.855  ns/op
[info] RedBlackTreeBenchmark.foreach             100  avgt   15         314.313 ±        3.308  ns/op
[info] RedBlackTreeBenchmark.foreach            1000  avgt   15        3759.742 ±       60.878  ns/op
[info] RedBlackTreeBenchmark.foreach           10000  avgt   15       43853.552 ±      588.069  ns/op
[info] RedBlackTreeBenchmark.foreach          100000  avgt   15      613734.402 ±    10589.963  ns/op
[info] RedBlackTreeBenchmark.foreach         1000000  avgt   15    12682488.842 ±   256017.234  ns/op
[info] RedBlackTreeBenchmark.foreach        10000000  avgt   15    47114846.013 ±   805289.152  ns/op

Iterator: 1.04x faster

[info] RedBlackTreeBenchmark.iterator             10  avgt   15          51.411 ±          0.936  ns/op
[info] RedBlackTreeBenchmark.iterator            100  avgt   15         428.030 ±          7.416  ns/op
[info] RedBlackTreeBenchmark.iterator           1000  avgt   15        6782.706 ±         90.232  ns/op
[info] RedBlackTreeBenchmark.iterator          10000  avgt   15       96244.361 ±       1258.228  ns/op
[info] RedBlackTreeBenchmark.iterator         100000  avgt   15     3693013.368 ±      32767.022  ns/op
[info] RedBlackTreeBenchmark.iterator        1000000  avgt   15     5104430.801 ±      58180.366  ns/op
[info] RedBlackTreeBenchmark.iterator       10000000  avgt   15    53395176.552 ±     482914.302  ns/op


[info] RedBlackTreeBenchmark.iterator             10  avgt   15          50.772 ±        1.233  ns/op
[info] RedBlackTreeBenchmark.iterator            100  avgt   15         362.707 ±        4.736  ns/op
[info] RedBlackTreeBenchmark.iterator           1000  avgt   15        3659.058 ±       82.604  ns/op
[info] RedBlackTreeBenchmark.iterator          10000  avgt   15       48327.687 ±      891.648  ns/op
[info] RedBlackTreeBenchmark.iterator         100000  avgt   15      708936.145 ±    14311.016  ns/op
[info] RedBlackTreeBenchmark.iterator        1000000  avgt   15    14341135.966 ±   158616.818  ns/op
[info] RedBlackTreeBenchmark.iterator       10000000  avgt   15    51173130.012 ±   653866.954  ns/op

Partition: 8.1x faster

[info] RedBlackTreeBenchmark.partition            10  avgt   15         690.260 ±          9.064  ns/op
[info] RedBlackTreeBenchmark.partition           100  avgt   15       10522.897 ±        107.106  ns/op
[info] RedBlackTreeBenchmark.partition          1000  avgt   15      164256.745 ±       1749.761  ns/op
[info] RedBlackTreeBenchmark.partition         10000  avgt   15     2246946.348 ±      35365.631  ns/op
[info] RedBlackTreeBenchmark.partition        100000  avgt   15    28960726.879 ±     320819.579  ns/op
[info] RedBlackTreeBenchmark.partition       1000000  avgt   15   329285443.722 ±    4876990.539  ns/op
[info] RedBlackTreeBenchmark.partition      10000000  avgt   15  4163763834.533 ±  273836350.106  ns/op

[info] RedBlackTreeBenchmark.partition            10  avgt   15         337.334 ±        6.140  ns/op
[info] RedBlackTreeBenchmark.partition           100  avgt   15        3523.397 ±       51.642  ns/op
[info] RedBlackTreeBenchmark.partition          1000  avgt   15       50217.933 ±      738.360  ns/op
[info] RedBlackTreeBenchmark.partition         10000  avgt   15      484961.665 ±     8294.156  ns/op
[info] RedBlackTreeBenchmark.partition        100000  avgt   15     3916470.169 ±   100548.868  ns/op
[info] RedBlackTreeBenchmark.partition       1000000  avgt   15    48870225.771 ±   802044.531  ns/op
[info] RedBlackTreeBenchmark.partition      10000000  avgt   15   512970638.656 ± 26196200.613  ns/op

Range: 2.1x faster

[info] RedBlackTreeBenchmark.range                10  avgt   15        2194.451 ±         23.759  ns/op
[info] RedBlackTreeBenchmark.range               100  avgt   15        5610.379 ±         68.131  ns/op
[info] RedBlackTreeBenchmark.range              1000  avgt   15       10903.010 ±        158.836  ns/op
[info] RedBlackTreeBenchmark.range             10000  avgt   15       17789.736 ±        201.725  ns/op
[info] RedBlackTreeBenchmark.range            100000  avgt   15       26826.856 ±        373.897  ns/op
[info] RedBlackTreeBenchmark.range           1000000  avgt   15       41769.107 ±        617.127  ns/op
[info] RedBlackTreeBenchmark.range          10000000  avgt   15       64698.003 ±        605.958  ns/op

[info] RedBlackTreeBenchmark.range                10  avgt   15        2299.682 ±       49.049  ns/op
[info] RedBlackTreeBenchmark.range               100  avgt   15        5632.357 ±       87.825  ns/op
[info] RedBlackTreeBenchmark.range              1000  avgt   15        8746.547 ±      165.859  ns/op
[info] RedBlackTreeBenchmark.range             10000  avgt   15       12314.250 ±      173.888  ns/op
[info] RedBlackTreeBenchmark.range            100000  avgt   15       16993.345 ±      292.728  ns/op
[info] RedBlackTreeBenchmark.range           1000000  avgt   15       22547.079 ±      384.337  ns/op
[info] RedBlackTreeBenchmark.range          10000000  avgt   15       30259.189 ±      574.381  ns/op

Slice: 2.0x faster

[info] RedBlackTreeBenchmark.slice                10  avgt   15        1215.410 ±         19.951  ns/op
[info] RedBlackTreeBenchmark.slice               100  avgt   15        2508.422 ±         40.045  ns/op
[info] RedBlackTreeBenchmark.slice              1000  avgt   15        5056.398 ±         51.528  ns/op
[info] RedBlackTreeBenchmark.slice             10000  avgt   15        8373.748 ±        101.898  ns/op
[info] RedBlackTreeBenchmark.slice            100000  avgt   15       13536.096 ±        238.148  ns/op
[info] RedBlackTreeBenchmark.slice           1000000  avgt   15       21628.415 ±        275.120  ns/op
[info] RedBlackTreeBenchmark.slice          10000000  avgt   15       32831.129 ±        357.205  ns/op

[info] RedBlackTreeBenchmark.slice                10  avgt   15        1202.986 ±       14.496  ns/op
[info] RedBlackTreeBenchmark.slice               100  avgt   15        2376.873 ±       43.072  ns/op
[info] RedBlackTreeBenchmark.slice              1000  avgt   15        4124.831 ±       70.123  ns/op
[info] RedBlackTreeBenchmark.slice             10000  avgt   15        5753.464 ±      109.260  ns/op
[info] RedBlackTreeBenchmark.slice            100000  avgt   15        8692.549 ±      151.454  ns/op
[info] RedBlackTreeBenchmark.slice           1000000  avgt   15       12366.837 ±      186.974  ns/op
[info] RedBlackTreeBenchmark.slice          10000000  avgt   15       16520.294 ±      300.931  ns/op

Transform: 27x (modify all), 26x (modify none), 30x (modify half) faster

[info] RedBlackTreeBenchmark.transformAll         10  avgt   15         466.315 ±          5.412  ns/op
[info] RedBlackTreeBenchmark.transformAll        100  avgt   15       11879.897 ±       1035.414  ns/op
[info] RedBlackTreeBenchmark.transformAll       1000  avgt   15      179692.888 ±       4366.262  ns/op
[info] RedBlackTreeBenchmark.transformAll      10000  avgt   15     2591993.564 ±     100518.377  ns/op
[info] RedBlackTreeBenchmark.transformAll     100000  avgt   15    30560053.012 ±     752721.186  ns/op
[info] RedBlackTreeBenchmark.transformAll    1000000  avgt   15   373663706.800 ±   12210085.004  ns/op
[info] RedBlackTreeBenchmark.transformAll   10000000  avgt   15  5096543713.733 ±  580711658.315  ns/op
[info] RedBlackTreeBenchmark.transformHalf        10  avgt   15         509.005 ±          6.457  ns/op
[info] RedBlackTreeBenchmark.transformHalf       100  avgt   15       12286.900 ±       1482.227  ns/op
[info] RedBlackTreeBenchmark.transformHalf      1000  avgt   15      211760.336 ±      13159.858  ns/op
[info] RedBlackTreeBenchmark.transformHalf     10000  avgt   15     2455527.897 ±      47906.187  ns/op
[info] RedBlackTreeBenchmark.transformHalf    100000  avgt   15    29293153.854 ±    1043991.786  ns/op
[info] RedBlackTreeBenchmark.transformHalf   1000000  avgt   15   384508388.956 ±   32539594.505  ns/op
[info] RedBlackTreeBenchmark.transformHalf  10000000  avgt   15  5876466576.467 ± 1356520421.616  ns/op
[info] RedBlackTreeBenchmark.transformNone        10  avgt   15         555.565 ±         47.950  ns/op
[info] RedBlackTreeBenchmark.transformNone       100  avgt   15       11093.593 ±        415.182  ns/op
[info] RedBlackTreeBenchmark.transformNone      1000  avgt   15      188302.505 ±       3370.389  ns/op
[info] RedBlackTreeBenchmark.transformNone     10000  avgt   15     2466951.080 ±      31336.642  ns/op
[info] RedBlackTreeBenchmark.transformNone    100000  avgt   15    29467337.032 ±     467481.806  ns/op
[info] RedBlackTreeBenchmark.transformNone   1000000  avgt   15   354662044.044 ±   14079646.586  ns/op
[info] RedBlackTreeBenchmark.transformNone  10000000  avgt   15  5152875076.000 ±  807901572.532  ns/op

[info] RedBlackTreeBenchmark.transformAll         10  avgt   15         113.746 ±        2.081  ns/op
[info] RedBlackTreeBenchmark.transformAll        100  avgt   15        1148.052 ±       22.343  ns/op
[info] RedBlackTreeBenchmark.transformAll       1000  avgt   15       14472.771 ±      273.904  ns/op
[info] RedBlackTreeBenchmark.transformAll      10000  avgt   15      142569.819 ±     2015.026  ns/op
[info] RedBlackTreeBenchmark.transformAll     100000  avgt   15     1590041.460 ±    20676.668  ns/op
[info] RedBlackTreeBenchmark.transformAll    1000000  avgt   15    16521935.629 ±   345793.448  ns/op
[info] RedBlackTreeBenchmark.transformAll   10000000  avgt   15   188981515.657 ± 12516431.468  ns/op
[info] RedBlackTreeBenchmark.transformHalf        10  avgt   15         124.066 ±        2.717  ns/op
[info] RedBlackTreeBenchmark.transformHalf       100  avgt   15        1205.826 ±       33.073  ns/op
[info] RedBlackTreeBenchmark.transformHalf      1000  avgt   15       14175.281 ±      289.948  ns/op
[info] RedBlackTreeBenchmark.transformHalf     10000  avgt   15      139875.767 ±     2637.454  ns/op
[info] RedBlackTreeBenchmark.transformHalf    100000  avgt   15     1582776.796 ±    31784.884  ns/op
[info] RedBlackTreeBenchmark.transformHalf   1000000  avgt   15    16369106.242 ±   331717.221  ns/op
[info] RedBlackTreeBenchmark.transformHalf  10000000  avgt   15   198291535.253 ± 14356862.064  ns/op
[info] RedBlackTreeBenchmark.transformNone        10  avgt   15          73.918 ±        0.725  ns/op
[info] RedBlackTreeBenchmark.transformNone       100  avgt   15         732.557 ±       12.134  ns/op
[info] RedBlackTreeBenchmark.transformNone      1000  avgt   15       13805.357 ±      323.921  ns/op
[info] RedBlackTreeBenchmark.transformNone     10000  avgt   15      144441.404 ±     3046.102  ns/op
[info] RedBlackTreeBenchmark.transformNone    100000  avgt   15     1594758.263 ±    33088.198  ns/op
[info] RedBlackTreeBenchmark.transformNone   1000000  avgt   15    16426482.772 ±   481371.058  ns/op
[info] RedBlackTreeBenchmark.transformNone  10000000  avgt   15   194896686.051 ± 15185294.678  ns/op

Tails: 1.3x faster

[info] RedBlackTreeBenchmark.tails                10  avgt   15         837.375 ±          9.579  ns/op
[info] RedBlackTreeBenchmark.tails               100  avgt   15       12082.553 ±        160.401  ns/op
[info] RedBlackTreeBenchmark.tails              1000  avgt   15      167608.187 ±       2136.101  ns/op
[info] RedBlackTreeBenchmark.tails             10000  avgt   15     2107005.035 ±      58861.760  ns/op
[info] RedBlackTreeBenchmark.tails            100000  avgt   15    25641063.778 ±     239951.041  ns/op
[info] RedBlackTreeBenchmark.tails           1000000  avgt   15   310572552.706 ±   20733303.653  ns/op
[info] RedBlackTreeBenchmark.tails          10000000  avgt   15  3660826566.667 ±  308380871.070  ns/op

[info] RedBlackTreeBenchmark.tails                10  avgt   15         700.462 ±        8.295  ns/op
[info] RedBlackTreeBenchmark.tails               100  avgt   15       10391.795 ±      178.447  ns/op
[info] RedBlackTreeBenchmark.tails              1000  avgt   15      137242.787 ±     2178.399  ns/op
[info] RedBlackTreeBenchmark.tails             10000  avgt   15     1650987.262 ±    32937.086  ns/op
[info] RedBlackTreeBenchmark.tails            100000  avgt   15    20421801.415 ±   297013.812  ns/op
[info] RedBlackTreeBenchmark.tails           1000000  avgt   15   242159188.117 ±  3833538.881  ns/op
[info] RedBlackTreeBenchmark.tails          10000000  avgt   15  2855601060.067 ± 33745810.699  ns/op

@szeiger
Copy link
Contributor Author

szeiger commented Oct 30, 2018

@retronym I didn't want to to open that can of worms any further. I'll make a separate PR for Ordering-related stuff.

@szeiger
Copy link
Contributor Author

szeiger commented Oct 30, 2018

I just realized that the times for size=1000000 in the old versions of foreach and iterator appear to be wrong. They are about 1/10 of what they should be. I have no idea what could possibly go wrong in JMH benchmarking to get such a result. I reran a subset of those benchmarks and got the expected results.

@retronym retronym merged commit 521de62 into scala:2.13.x Nov 1, 2018
@SethTisue SethTisue added the performance the need for speed. usually compiler performance, sometimes runtime performance. label Nov 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance the need for speed. usually compiler performance, sometimes runtime performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants