Chaining hash table for repository.Index#2812
Conversation
|
As I said over at #2811, I'm not sure about the concurrent performance of this yet. I did see a second CPU core running the GC at 100% in the benchmarks. |
|
Here's a quick real-world test using restic cat, as suggested by @aawsome. $id is the id of an existing tree blob in a 700GiB repo (local disk). Current master: This PR: Here, this PR so far achieves a 23% reduction in average memory use, while #2811 gets a 35% reduction. There's a few constants left to tune, I'll see if can lower. |
f1fca35 to
44c4293
Compare
|
Test coverage for indexmap.go is 98.1%, or one return statement not covered. |
aawsome
left a comment
There was a problem hiding this comment.
Looks good so far; I have just a few small changes to propose.
Side note: How do I get information about test code coverage?
|
I would also add a changelog, as this is another massive improvement with respect to memory usage! |
|
I intend to merge the changelog entry with the one for #2781, because to a user, they're the same thing. Is that ok with you? |
Sure - I think it is a good idea to tell the overall improvement to users. |
|
Re: test coverage, |
MichaelEischer
left a comment
There was a problem hiding this comment.
I like the hash table design in general. I've been mostly pondering over two points:
How large is the GC overhead? The indexEntries and buckets arrays add a massive amount of pointers which the garbage collector has to follow during GC. It might be comparable to the previous index implementation which stored slices in the index hashmap (which contain a pointer). So, it might turn out to be no problem at all.
What are the main design choices and are there alternatives? From what I see the main points are to store pointers in the bucket map, allocate blocks of index entries (but still use them separately), chain entries for conflict resolution and use rather long chains with up to four elements on average. The first two points allow for a rather smooth increase in memory usage when a hashmap grows. The latter two points keep the overhead of bucket map pointers relatively low. (And to keep multiple entries per key instead of overwriting the old value).
Storing the index entries in the bucket array would cause the large memory usage spikes while growing the hashmap so that's not an option (unless we want to split the bucket array into parts and adjust it part by part). Just allocating individual entries leads to a lot of internal memory fragmentation and allocating everything at once doesn't work either. Using shorter chains probably isn't that helpful and an average chain length of 4 is much smaller than the usual number of indexes that currently exist in the master index. This leaves open addressing/probing as alternative to chaining. It would remove the next pointer from the indexEntry, however, in exchange the bucket array would have to be larger than the current number of entries. This becomes a problem while growing the bucket array as even with an optimal load factor and a very small grow factor this would require keeping more than two pointers per entry (three for a growth factor of 2), which is a lot more than the chaining variant needs. The main benefit of probing would be that the hashmap wouldn't have to chase linked list entries located at random places in memory but could get away with an access pattern that offers better locality.
I was wondering whether it would be possible to move/reorder entries in the grow method in such a way that the linked elements of a bucket are laid out consecutively in memory. That would at keep the last half of the chain elements next to each other. But half of the chain elements are just two index entries, so it's probably not worth the trouble...
I've yet to benchmark this.
The main alternatives that I considered are a probing table, which has the problems you spotted, or cuckoo hashing, which can handle very high load factors. The latter looked to be much more complicated and I quickly gave up on it. Thinking out load, there is perhaps a middle ground: a probing table that stores summaries of the keys and a pointer to the entry, viz. type bucket struct {
hash uint
entry *indexEntry // no longer needs a next pointer
}Say we set the maximum load factor at 80% and grow in powers of two. Then the table will be 60% full on average and the worst case overhead will be 16*(1+2/3)*n = 26.7n bytes. Dropping the next pointer saves eight bytes, and we pay 74.7 bytes per entry. The benefit would be the access pattern. The buckets get packed four to a cache line (on amd64/arm64), so lookup might be faster despite the probing. Insert would no longer be worst-case O(1), but nearly all the pointer indirection is gone. I'll see if I can get this working.
Also, the second entry in a pair necessary spills over into the next cache line because the SHA-256 is 32 bytes. |
No need to save the hash as it can be easily computed. And that is the main drawback of hashmaps with load factor smaller than 1 - you get memory spikes when growing... |
The hash is saved so four of them will be in the L1 cache before we follow a single pointer. That's also the only benefit of this proposed scheme. |
The standard go hash maps have a tophash field in their buckets which seems to be the standard way to avoid following unnecessary pointers/comparing entries: The field just contains the top 8 bits of an entry's hash (ignoring a few reserved values). That way after checking just that single byte you only have a 1/256 chance for a hash collision. I think we can live with a slightly slower hashmap in exchange for the better memory efficiency. After all a merged hashmap is fast enough that the lookup-heavy |
Completely agree! The most important point is that we use a hashmap that does not have significant memory overhead even in edge cases like directly after growing and which is not too slow. This proposed solution has the following properties:
So if we should spend more energy in working on this solution IMO the only point worth is to benchmark and test if we missed some severe performance issues in nasty cases (like the GC topic or with concurrency) |
|
I just did the following with master and this PR: This gives 221MiB max resident vs. 232 on master, while the elapsed time is practically the same. That's a small margin, but note that just running |
I've just run a rather simple test with the check command (with #2328 merged), which issues lots of With #2818 the lookup costs inside
|
|
My money is on the tophash field for those 13s. Map calls memequal too, but we do four calls on average for a lookup, map one plus some byte comparisons. My attempts to optimize the comparison have not born much result. Either that, or we're seeing a cache/TLB effect. |
|
@greatroar The standard hashmap uses an In the chained hashtable there's no benefit to expect from an additional tophash field, as the tophash field would be part of the indexEntry and therefore end up in the same cacheline as the blob ID (well, a separate array would also be possible but that way the hashmap would always have to look at two places; grouping the topHash at the start of a batch allocation has basically the same problem. Also the chained entries are very likely each stored in a different allocation batch.). And then it's easier to simply check the first byte of the full entry ID (which is effectively identical to the tophash) and avoid spending one byte for the tophash (and there are no alignment issues). What might be worth a try is to move the |
|
No significant difference. Note that comparing two ids may not be done byte-for-byte due to vectorization. |
There was a problem hiding this comment.
The overall hashmap design seems to be settled by now. The two main open points I see right now is regarding the amount of parallelism in grow and the entryAllocBatch size.
In my opinion the indexMap should get a few simple tests for add, len, foreach, foreachWithID and get. The Index and MasterIndex test probably already cover most of the code, but I'm not sure whether these would catch all problems in the indexMap.
MichaelEischer
left a comment
There was a problem hiding this comment.
There are a few more tiny test issues that need addressing. But besides those, the only remaining issue which I want to address before merging is the entryAllocBatch size. I'm currently running a few more benchmark. The results currently look like most runtime variations and unexpected slowdowns between 8 or 256 disappear when I increase pass -benchtime 10s to the benchmark, which indicates that the different allocation sizes trigger different GC behavior/interleaving with the running code.
17e4e7c to
9025f9f
Compare
|
I just reran the latest benchmarks (fast createRandomIndex, proper size for DecodeIndex*) with entryAllocBatch = 8 as old and 256 as new, without changing benchtime. These are the results: In summary, constructing indexes gets faster, while lookups get a bit slower. Still nothing that merged indexes couldn't solve, I think, given that the loss is mainly in the lookups of unknown blobs. The overallocation is statistically significant but still tiny. |
|
I've also run a few of the lookup benchmarks with So I'd say we use |
|
Switched to 256, rebased everything and re-ran all the benchmarks (see commit messages; this took several hours). If the tests pass, I'm done. |
createRandomIndex was using the global RNG, which locks on every call It was also using twice as many random numbers as necessary and doing a float division in every iteration of the inner loop. BenchmarkDecodeIndex was using too short an input, especially for a parallel version. (It may now be using one that is a bit large.) Results on linux/amd64, -benchtime=3s -count=20: name old time/op new time/op delta PackerManager-8 178ms ± 0% 178ms ± 0% ~ (p=0.165 n=20+20) DecodeIndex-8 13.6µs ± 2% 4539886.8µs ± 0% +33293901.38% (p=0.000 n=20+18) IndexHasUnknown-8 44.4ns ± 7% 44.4ns ± 5% ~ (p=0.873 n=20+19) IndexHasKnown-8 49.2ns ± 3% 48.3ns ± 0% -1.86% (p=0.000 n=20+16) IndexAlloc-8 802ms ± 1% 758ms ± 1% -5.51% (p=0.000 n=20+19) MasterIndexLookupSingleIndex-8 124ns ± 1% 122ns ± 0% -1.41% (p=0.000 n=20+14) MasterIndexLookupMultipleIndex-8 373ns ± 2% 369ns ± 2% -1.13% (p=0.001 n=20+20) MasterIndexLookupSingleIndexUnknown-8 67.8ns ± 3% 68.4ns ± 5% ~ (p=0.753 n=20+20) MasterIndexLookupMultipleIndexUnknown-8 316ns ± 3% 315ns ± 3% ~ (p=0.846 n=20+20) SaveAndEncrypt-8 30.5ms ± 1% 30.2ms ± 1% -1.09% (p=0.000 n=19+19) LoadTree-8 527µs ± 1% 540µs ± 1% +2.37% (p=0.000 n=19+20) LoadBlob-8 5.65ms ± 0% 5.64ms ± 0% -0.21% (p=0.000 n=19+18) LoadAndDecrypt-8 7.07ms ± 2% 5.93ms ± 0% -16.15% (p=0.000 n=19+20) LoadIndex-8 32.1ms ± 2% 25.1ms ± 0% -21.64% (p=0.000 n=20+18) name old speed new speed delta PackerManager-8 296MB/s ± 0% 296MB/s ± 0% ~ (p=0.159 n=20+20) SaveAndEncrypt-8 138MB/s ± 1% 139MB/s ± 1% +1.10% (p=0.000 n=19+19) LoadBlob-8 177MB/s ± 0% 177MB/s ± 0% +0.21% (p=0.000 n=19+18) LoadAndDecrypt-8 141MB/s ± 2% 169MB/s ± 0% +19.24% (p=0.000 n=19+20) name old alloc/op new alloc/op delta PackerManager-8 91.8kB ± 0% 91.8kB ± 0% ~ (p=0.826 n=19+12) IndexAlloc-8 786MB ± 0% 786MB ± 0% +0.01% (p=0.000 n=20+20) SaveAndEncrypt-8 21.0MB ± 0% 21.0MB ± 0% -0.00% (p=0.012 n=20+19) name old allocs/op new allocs/op delta PackerManager-8 1.41k ± 0% 1.41k ± 0% ~ (all equal) IndexAlloc-8 977k ± 0% 977k ± 0% +0.01% (p=0.022 n=20+20) SaveAndEncrypt-8 73.0 ± 0% 73.0 ± 0% ~ (all equal)
These are faster to construct but slower to access. The allocation rate is halved, the peak memory usage almost halved compared to standard map. Benchmark results on linux/amd64, -benchtime=3s -count=20: name old time/op new time/op delta PackerManager-8 178ms ± 0% 178ms ± 0% ~ (p=0.231 n=20+20) DecodeIndex-8 4.54s ± 0% 4.30s ± 0% -5.20% (p=0.000 n=18+17) DecodeIndexParallel-8 4.54s ± 0% 4.30s ± 0% -5.22% (p=0.000 n=19+18) IndexHasUnknown-8 44.4ns ± 5% 50.5ns ±11% +13.82% (p=0.000 n=19+17) IndexHasKnown-8 48.3ns ± 0% 51.5ns ±12% +6.68% (p=0.001 n=16+20) IndexAlloc-8 758ms ± 1% 616ms ± 1% -18.69% (p=0.000 n=19+19) IndexAllocParallel-8 234ms ± 3% 204ms ± 2% -12.60% (p=0.000 n=20+18) MasterIndexLookupSingleIndex-8 122ns ± 0% 145ns ± 9% +18.44% (p=0.000 n=14+20) MasterIndexLookupMultipleIndex-8 369ns ± 2% 429ns ± 8% +16.27% (p=0.000 n=20+20) MasterIndexLookupSingleIndexUnknown-8 68.4ns ± 5% 74.9ns ±13% +9.47% (p=0.000 n=20+20) MasterIndexLookupMultipleIndexUnknown-8 315ns ± 3% 369ns ±11% +17.14% (p=0.000 n=20+20) MasterIndexLookupParallel/known,indices=5-8 743ns ± 1% 816ns ± 2% +9.87% (p=0.000 n=17+17) MasterIndexLookupParallel/unknown,indices=5-8 238ns ± 1% 260ns ± 2% +9.14% (p=0.000 n=19+20) MasterIndexLookupParallel/known,indices=10-8 1.01µs ± 3% 1.11µs ± 2% +9.79% (p=0.000 n=19+20) MasterIndexLookupParallel/unknown,indices=10-8 222ns ± 0% 269ns ± 2% +20.83% (p=0.000 n=16+20) MasterIndexLookupParallel/known,indices=20-8 1.06µs ± 2% 1.19µs ± 2% +12.95% (p=0.000 n=19+18) MasterIndexLookupParallel/unknown,indices=20-8 413ns ± 1% 530ns ± 1% +28.19% (p=0.000 n=18+20) SaveAndEncrypt-8 30.2ms ± 1% 30.4ms ± 0% +0.71% (p=0.000 n=19+19) LoadTree-8 540µs ± 1% 576µs ± 1% +6.73% (p=0.000 n=20+20) LoadBlob-8 5.64ms ± 0% 5.64ms ± 0% ~ (p=0.883 n=18+17) LoadAndDecrypt-8 5.93ms ± 0% 5.95ms ± 1% ~ (p=0.247 n=20+19) LoadIndex-8 25.1ms ± 0% 24.5ms ± 1% -2.54% (p=0.000 n=18+17) name old speed new speed delta PackerManager-8 296MB/s ± 0% 296MB/s ± 0% ~ (p=0.229 n=20+20) SaveAndEncrypt-8 139MB/s ± 1% 138MB/s ± 0% -0.71% (p=0.000 n=19+19) LoadBlob-8 177MB/s ± 0% 177MB/s ± 0% ~ (p=0.890 n=18+17) LoadAndDecrypt-8 169MB/s ± 0% 168MB/s ± 1% ~ (p=0.227 n=20+19) name old alloc/op new alloc/op delta PackerManager-8 91.8kB ± 0% 91.8kB ± 0% ~ (p=0.772 n=12+19) IndexAlloc-8 786MB ± 0% 400MB ± 0% -49.04% (p=0.000 n=20+18) IndexAllocParallel-8 786MB ± 0% 401MB ± 0% -49.04% (p=0.000 n=19+15) SaveAndEncrypt-8 21.0MB ± 0% 21.0MB ± 0% +0.00% (p=0.000 n=19+19) name old allocs/op new allocs/op delta PackerManager-8 1.41k ± 0% 1.41k ± 0% ~ (all equal) IndexAlloc-8 977k ± 0% 907k ± 0% -7.18% (p=0.000 n=20+20) IndexAllocParallel-8 977k ± 0% 907k ± 0% -7.17% (p=0.000 n=19+15) SaveAndEncrypt-8 73.0 ± 0% 73.0 ± 0% ~ (all equal)
MichaelEischer
left a comment
There was a problem hiding this comment.
LGTM. Thanks a lot @greatroar and @aawsome for all the discussions and work on this!
|
Thank you both for the thorough review! |
What is the purpose of this change? What does it change?
This implements a custom chaining hash table for repository.Index.
Index merging (#2799) should be easy to add to this PR.
Was the change discussed in an issue or in the forum before?
Alternative to #2811, with better performance on the benchmarks.
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits