Optimize RedBlackTree-based immutable collections#7284
Optimize RedBlackTree-based immutable collections#7284retronym merged 7 commits intoscala:2.13.xfrom
Conversation
- Some improved inlining in RedBlackTree - Remove unnecessary `contains` check before deleting from TreeSet or TreeMap - Avoid recreating wrapper objects when appending to and deleting from TreeSets and TreeMaps without a change - Bulk building / appending without creating intermediate TreeSets or TreeMaps - Building TreeSets and TreeMaps directly without a Builder - Use null instead of () as dummy value in TreeSet
| val it = set1.iterator | ||
| var res = 0 | ||
| while(it.hasNext) | ||
| res += it.next() |
There was a problem hiding this comment.
I'd be inclined to structure the test without any auto-unboxing, e.g. by using a TreeSet[String] and just feeding the strings into the Blackhole inside the loop. I don't think it will change the results of this benchmark, but it reduces the cognitive burden :)
| override def concat(that: collection.IterableOnce[A]): TreeSet[A] = { | ||
| val it = that.iterator | ||
| var t = tree | ||
| while (it.hasNext) t = RB.update(t, it.next(), null, overwrite = false) |
There was a problem hiding this comment.
Can we do anything clever when that is another TreeSet? E.g could TreeSet((0 to 63): _*) ++ TreeSet((64 to 128): _*) perform some structural sharing?
There was a problem hiding this comment.
Yes, there are more efficient algorithms for merging two trees or building from an ordered sequence (which should be possible to generalize to inserting into an existing tree from an ordered sequence).
There was a problem hiding this comment.
These optimizations will need a way to check that they have the operands share the same Ordering. ord1 eq ord2 is a sure way to compare them, but will miss many opportunities. We could try to make == work for Ordering instances, e.g. by making TupleOrdering a case class.
There was a problem hiding this comment.
Mutable RBTree-s will still be beneficial during building, where only have a single Ordering instance.
There was a problem hiding this comment.
I spent the last 2 days implementing the algorithms from https://www.cs.cmu.edu/~guyb/papers/BFS16.pdf. The current version is very close to the paper without any optimizations for our data structures. Bulk union (or concat) for a few different scenarios combined (overlapping and non-overlapping sets of equal or different sizes) is already up to 20x faster (the larger the sets, the better).
There was a problem hiding this comment.
Make that 10x when comparing against this PR with the improved single-element concatenation.
There was a problem hiding this comment.
Wow, that's amazing! Nice work!
| def newBuilder[K : Ordering, V]: Builder[(K, V), TreeMap[K, V]] = | ||
| new ImmutableBuilder[(K, V), TreeMap[K, V]](empty) { | ||
| def addOne(elem: (K, V)): this.type = { elems = elems + elem; this } | ||
| override def addAll(xs: IterableOnce[(K, V)]): this.type = { elems = elems.concat(xs); this } |
There was a problem hiding this comment.
We could go down the path of HashMapBuilder and defer creation of the collection that wraps the RBTree until result, and make addOne just call t = RB.update(t, k, v, overwrite = true) like you've done above.
Going further, we could even use mutation of the trees during building.
/cc @joshlemer
There was a problem hiding this comment.
Do we have a good solution for the memory barrier problem in List and Vector yet? Switching to a mutable tree would get us into the same situation here.
There was a problem hiding this comment.
Yep, we're now using an explicit memory barrier in the builders of those. @joshlemer did the same recently from the Champ builders with great results.
I'd suggest that we at least pave the way for such a change by making the trees mutable with private[scala] setters in 2.13.0 (package private would be even nicer but would require us to write the data classes in Java).
|
Here are the results from the alternate encoding. The performance regression in Building in random order is faster than this PR, building in correct order is slower: All in all it doesn't look like a worthwhile improvement. |
If there are only two subclasses, JIT's bimorphic inlining will likely be just as good as making this monomorphic. |
|
Updated with better Builders (which avoid wrapping the Here are the benchmark results for the bulk Base line: Old version of this PR minus the Old version of this PR including the Optimized bulk Note that this is single-threaded but the algorithms are highly scalable. This might be a useful addition to parallel-collections. |
|
@szeiger Great results! It's useful also show the improvements in the |
|
I think it would be worthwhile for us to implement |
| else new TreeMap(RB.delete(tree, key)) | ||
| def remove(key: K): TreeMap[K,V] = { | ||
| val t = RB.delete(tree, key) | ||
| if(t eq tree) this else new TreeMap(t) |
There was a problem hiding this comment.
Extract this to a helper newTreeMapOrSelf.
I think there are a few more places it could be used: rangeImpl range.
|
I was just looking at the possibility of using mutable fields during construction. The caveat is that we'd have to switch to the alternate encoding with a single |
- Optimize building TreeMap from SortedMap - Optimize building TreeSet from SortedSet and Range - Static value for Ordering.Int.reverse so we can check for it with eq and avoid allocating a new instance for each check - Bug fix: Check for ordering when reusing existing TreeMap/TreeSet - Compare Orderings with == instead of eq
|
I just pushed another commit with further algorithmic improvements: We can build red-black trees very efficiently in linear time and fully immutable from ordered sets/maps (including Here are the new results for building from a For the largest size this is a 20x improvement over building by incremental insertion. |
- `rebalance`, `compareDepth` and `NList` are no longer needed - Use existing object where possible in TreeMap and TreeSet
|
Ending week two of red-black tree optimizations with
|
- `filter` is based on `join` - `transform` rebuilds the existing structure (which is never affected by the transformation)
|
While looking at another PR that optimizes Before: After: We didn't have an optimized Before: After: |
|
Here's an optimized After: And some smallish improvements to After: |
|
Any further comments on this? It would be very useful to have this code available for my HashSet / HashMap work. |
|
@szeiger please also test them on sizes beyond 10K just to be sure that there no unexpected trends due to increasing of memory access times, etc. Just see how RTree from the Archery library is getting too inefficient on 1M+ sizes in following benches: |
|
I'll run it for some larger sizes over night. I don't expect any surprises though. The data structures are unchanged and the new algorithms are much better in big-O terms. |
|
The only thing missing IMO is a few tweaks to |
|
Here are the results for larger sizes (up tp 10 million). I used a large 8G heap with 25 warmup iterations for good results. I did not run the TL;DR: The new version is always faster. Building: 41x faster at the largest size: Union: 37x faster Take: 2.9x faster Drop: 2.3x faster Filter: 94x (drop all) / 77x (keep all) / 9.4x (keep every 2nd) faster Foreach: 1.09x faster Iterator: 1.04x faster Partition: 8.1x faster Range: 2.1x faster Slice: 2.0x faster Transform: 27x (modify all), 26x (modify none), 30x (modify half) faster Tails: 1.3x faster |
|
@retronym I didn't want to to open that can of worms any further. I'll make a separate PR for Ordering-related stuff. |
|
I just realized that the times for size=1000000 in the old versions of |
containscheck before deleting from TreeSet or TreeMapHere's some performance data from Java 8 with HotSpot on x64. Getting reliable results was not always easy. I ran the benchmarks with
bench/jmh:run -jvmArgs "-Xms128M -Xmx128M" scala.collection.immutable.RedBlackTreeBenchmark. Giving it lots of memory led to unreliable data, with optimizations kicking in very late only when there was GC pressure, so you'd need a huge number of warmup iterations.Building from a source in the correct order and in random order, in 2.13.x:
With this change (avoiding the unnecessary intermediate
TreeSetobjects):Deleting all elements in random order in 2.13.x:
With this change (benefiting mostly from not performing the unnecessary
containschecks):Performance of iteration via
iteratorandforeachis unchanged.I spent the better part of a week on this and the resulting set of changes is rather small. There's a lot more that I tried that didn't work:
Removing
nulltrees and using a properemptytree object doesn't give you any real advantage. The way the red-black tree algorithms work you have to treat the empty tree specially in almost all cases so you may as well check for null. And even if you add anemptytree you'll still want to check for that by comparing it witheq emptybecause this is much faster than anisEmptymethod on the tree itself.Removing polymorphic dispatch of
Treemethods by makingTreefinal is promising. You have to encode the color somehow if it's not encoded in the class anymore. Adding an extra field is not a good option. The basicTreetype has a size of 32 bytes with no wasted space, just like the originalRedTreeandBlackTreetypes. Adding an extra field would increase the memory footprint by 25% (assuming 64-bit alignment), so I used the (otherwise unused) sign bit of thecountfield for the color:This works rather well in some cases:
Other benchmarks, in particular for iteration, showed decreased performance. Looking back now at the results I got when I did these tests, it's possible that they were skewed by the late optimization without limiting the heap size. I'll do another proper test run tonight. Maybe this encoding is better after all.
Inlining some of the small methods in
TreeIterator. This looks like an obvious win but it's not. Letting HotSpot take care of inlining is actually better in this case.Iteration instead of tail recursion in
foreach. We can't enforce tail recursion (marking the method@tailrecis not an option because it is also used in non-tail position) but scalac still generates better code for the tail recursion than for an explicit loop.Using a Java implementation of the tree classes. The Scala version with the
@(`inline` @getter)annotations is clunky, but so is Java. Performance is exactly the same.