Skip to content

Chaining hash table for repository.Index#2812

Merged
MichaelEischer merged 2 commits intorestic:masterfrom
greatroar:chaining
Jul 20, 2020
Merged

Chaining hash table for repository.Index#2812
MichaelEischer merged 2 commits intorestic:masterfrom
greatroar:chaining

Conversation

@greatroar
Copy link
Copy Markdown
Contributor

@greatroar greatroar commented Jun 28, 2020

What is the purpose of this change? What does it change?

This implements a custom chaining hash table for repository.Index.

Index merging (#2799) should be easy to add to this PR.

Was the change discussed in an issue or in the forum before?

Alternative to #2811, with better performance on the benchmarks.

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@greatroar
Copy link
Copy Markdown
Contributor Author

As I said over at #2811, I'm not sure about the concurrent performance of this yet. I did see a second CPU core running the GC at 100% in the benchmarks.

@greatroar
Copy link
Copy Markdown
Contributor Author

Here's a quick real-world test using restic cat, as suggested by @aawsome. $id is the id of an existing tree blob in a 700GiB repo (local disk).

Current master:

$ /usr/bin/time ./restic cat blob $id >/dev/null 
12.77user 0.32system 0:04.39elapsed 298%CPU (0avgtext+0avgdata 705296maxresident)k
0inputs+0outputs (0major+125964minor)pagefaults 0swaps
$ /usr/bin/time ./restic -cat blob $id >/dev/null 
12.70user 0.35system 0:04.01elapsed 325%CPU (0avgtext+0avgdata 717884maxresident)k
0inputs+0outputs (0major+127844minor)pagefaults 0swaps
$ /usr/bin/time ./restic cat blob $id >/dev/null 
12.90user 0.34system 0:04.10elapsed 322%CPU (0avgtext+0avgdata 727872maxresident)k
0inputs+0outputs (0major+139900minor)pagefaults 0swaps

This PR:

$ /usr/bin/time ./restic cat blob $id >/dev/null 
13.77user 0.35system 0:04.38elapsed 322%CPU (0avgtext+0avgdata 569968maxresident)k
0inputs+0outputs (0major+136284minor)pagefaults 0swaps
$ /usr/bin/time ./restic cat blob $id >/dev/null 
13.60user 0.28system 0:03.99elapsed 348%CPU (0avgtext+0avgdata 533716maxresident)k
0inputs+0outputs (0major+123454minor)pagefaults 0swaps
$ /usr/bin/time ./restic cat blob $id >/dev/null 
13.91user 0.31system 0:04.07elapsed 349%CPU (0avgtext+0avgdata 570852maxresident)k
0inputs+0outputs (0major+137047minor)pagefaults 0swaps

#2811 at 64ecc66:

$ /usr/bin/time ./restic -p ~/.restic/pw cat blob $id >/dev/null 
13.34user 0.22system 0:04.87elapsed 278%CPU (0avgtext+0avgdata 467384maxresident)k
0inputs+0outputs (0major+105581minor)pagefaults 0swaps
$ /usr/bin/time ./restic -p ~/.restic/pw cat blob $id >/dev/null 
13.22user 0.27system 0:04.18elapsed 322%CPU (0avgtext+0avgdata 478512maxresident)k
0inputs+0outputs (0major+109278minor)pagefaults 0swaps
$ /usr/bin/time ./restic -p ~/.restic/pw cat blob $id >/dev/null 
13.27user 0.24system 0:04.15elapsed 325%CPU (0avgtext+0avgdata 468200maxresident)k
0inputs+0outputs (0major+104110minor)pagefaults 0swaps

Here, this PR so far achieves a 23% reduction in average memory use, while #2811 gets a 35% reduction. There's a few constants left to tune, I'll see if can lower.

@greatroar greatroar force-pushed the chaining branch 2 times, most recently from f1fca35 to 44c4293 Compare June 28, 2020 14:06
@greatroar
Copy link
Copy Markdown
Contributor Author

greatroar commented Jun 28, 2020

Following @aawsome's advice, I reorganised the duplicates and increased the maximum load factor to make more efficient use of memory. In the first restic cat tests, I now get a 32% reduction and I suppose I can get up to the level of #2811 by increasing it further.

@greatroar
Copy link
Copy Markdown
Contributor Author

Test coverage for indexmap.go is 98.1%, or one return statement not covered.

Copy link
Copy Markdown
Contributor

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far; I have just a few small changes to propose.

Side note: How do I get information about test code coverage?

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jun 28, 2020

I would also add a changelog, as this is another massive improvement with respect to memory usage!

@greatroar
Copy link
Copy Markdown
Contributor Author

I intend to merge the changelog entry with the one for #2781, because to a user, they're the same thing. Is that ok with you?

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jun 28, 2020

I intend to merge the changelog entry with the one for #2781, because to a user, they're the same thing. Is that ok with you?

Sure - I think it is a good idea to tell the overall improvement to users.

@greatroar
Copy link
Copy Markdown
Contributor Author

Re: test coverage,

go test -coverprofile=cov
go tool cover -html cov

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the hash table design in general. I've been mostly pondering over two points:

How large is the GC overhead? The indexEntries and buckets arrays add a massive amount of pointers which the garbage collector has to follow during GC. It might be comparable to the previous index implementation which stored slices in the index hashmap (which contain a pointer). So, it might turn out to be no problem at all.

What are the main design choices and are there alternatives? From what I see the main points are to store pointers in the bucket map, allocate blocks of index entries (but still use them separately), chain entries for conflict resolution and use rather long chains with up to four elements on average. The first two points allow for a rather smooth increase in memory usage when a hashmap grows. The latter two points keep the overhead of bucket map pointers relatively low. (And to keep multiple entries per key instead of overwriting the old value).

Storing the index entries in the bucket array would cause the large memory usage spikes while growing the hashmap so that's not an option (unless we want to split the bucket array into parts and adjust it part by part). Just allocating individual entries leads to a lot of internal memory fragmentation and allocating everything at once doesn't work either. Using shorter chains probably isn't that helpful and an average chain length of 4 is much smaller than the usual number of indexes that currently exist in the master index. This leaves open addressing/probing as alternative to chaining. It would remove the next pointer from the indexEntry, however, in exchange the bucket array would have to be larger than the current number of entries. This becomes a problem while growing the bucket array as even with an optimal load factor and a very small grow factor this would require keeping more than two pointers per entry (three for a growth factor of 2), which is a lot more than the chaining variant needs. The main benefit of probing would be that the hashmap wouldn't have to chase linked list entries located at random places in memory but could get away with an access pattern that offers better locality.

I was wondering whether it would be possible to move/reorder entries in the grow method in such a way that the linked elements of a bucket are laid out consecutively in memory. That would at keep the last half of the chain elements next to each other. But half of the chain elements are just two index entries, so it's probably not worth the trouble...

@greatroar
Copy link
Copy Markdown
Contributor Author

How large is the GC overhead? The indexEntries and buckets arrays add a massive amount of pointers which the garbage collector has to follow during GC. It might be comparable to the previous index implementation which stored slices in the index hashmap (which contain a pointer). So, it might turn out to be no problem at all.

I've yet to benchmark this.

What are the main design choices and are there alternatives? From what I see the main points are to store pointers in the bucket map, allocate blocks of index entries (but still use them separately), chain entries for conflict resolution and use rather long chains with up to four elements on average. The first two points allow for a rather smooth increase in memory usage when a hashmap grows. The latter two points keep the overhead of bucket map pointers relatively low. (And to keep multiple entries per key instead of overwriting the old value).

The main alternatives that I considered are a probing table, which has the problems you spotted, or cuckoo hashing, which can handle very high load factors. The latter looked to be much more complicated and I quickly gave up on it.

Thinking out load, there is perhaps a middle ground: a probing table that stores summaries of the keys and a pointer to the entry, viz.

type bucket struct {
	hash  uint
	entry *indexEntry // no longer needs a next pointer
}

Say we set the maximum load factor at 80% and grow in powers of two. Then the table will be 60% full on average and the worst case overhead will be 16*(1+2/3)*n = 26.7n bytes. Dropping the next pointer saves eight bytes, and we pay 74.7 bytes per entry.

The benefit would be the access pattern. The buckets get packed four to a cache line (on amd64/arm64), so lookup might be faster despite the probing. Insert would no longer be worst-case O(1), but nearly all the pointer indirection is gone.

I'll see if I can get this working.

I was wondering whether it would be possible to move/reorder entries in the grow method in such a way that the linked elements of a bucket are laid out consecutively in memory. That would at keep the last half of the chain elements next to each other. But half of the chain elements are just two index entries, so it's probably not worth the trouble...

Also, the second entry in a pair necessary spills over into the next cache line because the SHA-256 is 32 bytes.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jul 3, 2020

type bucket struct {
	hash  uint
	entry *indexEntry // no longer needs a next pointer
}

Say we set the maximum load factor at 80% and grow in powers of two. Then the table will be 60% full on average and the worst case overhead will be 16*(1+2/3)*n = 26.7n bytes. Dropping the next pointer saves eight bytes, and we pay 74.7 bytes per entry.

No need to save the hash as it can be easily computed.
If you save *indexEntry in your bucket and use probing, you gain 8 bytes per index entry. In your actual implementation your buckets need between 2 and 4 bytes per index entry. So if you could use a probing table with load factor between 2/3 and 4/5 (that should be the maximum load factor!), you will get the identical memory usage. However, if you grow a hash table with load factor 80% by doubling its buckets, it will result in a hash table with load factor 40%. Then the buckets will use 5/2 * 8 bytes = 20 bytes per entry.

And that is the main drawback of hashmaps with load factor smaller than 1 - you get memory spikes when growing...

@greatroar
Copy link
Copy Markdown
Contributor Author

No need to save the hash as it can be easily computed.

The hash is saved so four of them will be in the L1 cache before we follow a single pointer. That's also the only benefit of this proposed scheme.

@MichaelEischer
Copy link
Copy Markdown
Member

No need to save the hash as it can be easily computed.

The hash is saved so four of them will be in the L1 cache before we follow a single pointer. That's also the only benefit of this proposed scheme.

The standard go hash maps have a tophash field in their buckets which seems to be the standard way to avoid following unnecessary pointers/comparing entries: The field just contains the top 8 bits of an entry's hash (ignoring a few reserved values). That way after checking just that single byte you only have a 1/256 chance for a hash collision.

I think we can live with a slightly slower hashmap in exchange for the better memory efficiency. After all a merged hashmap is fast enough that the lookup-heavy check command only spends a tiny fraction of it's time there.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jul 4, 2020

I think we can live with a slightly slower hashmap in exchange for the better memory efficiency. After all a merged hashmap is fast enough that the lookup-heavy check command only spends a tiny fraction of it's time there.

Completely agree! The most important point is that we use a hashmap that does not have significant memory overhead even in edge cases like directly after growing and which is not too slow.

This proposed solution has the following properties:

  • it uses much less memory and allocations compared to the standard go hashmap
  • it handles duplicates very smoothly
  • it has a very "constant" total overhead (between 10 and 12 byte per entry)
  • it is perfect to implement merging as this can be done almost without newly allocating memory
  • it is already faster than than the actual implementation in some tests and way faster once the indexes are merged

So if we should spend more energy in working on this solution IMO the only point worth is to benchmark and test if we missed some severe performance issues in nasty cases (like the GC topic or with concurrency)

@greatroar
Copy link
Copy Markdown
Contributor Author

I just did the following with master and this PR:

restic init
echo 3 | sudo tee /proc/sys/vm/drop_caches # flush kernel page cache
/usr/bin/time restic backup 83GiBofrandomdata/

This gives 221MiB max resident vs. 232 on master, while the elapsed time is practically the same. That's a small margin, but note that just running /usr/bin/time restic with no arguments takes 10MiB and the indices that this produces are only 8MiB on disk. So at the very least there is no regression.

@MichaelEischer
Copy link
Copy Markdown
Member

How large is the GC overhead? The indexEntries and buckets arrays add a massive amount of pointers which the garbage collector has to follow during GC. It might be comparable to the previous index implementation which stored slices in the index hashmap (which contain a pointer). So, it might turn out to be no problem at all.

I've yet to benchmark this.

I've just run a rather simple test with the check command (with #2328 merged), which issues lots of LookupSize calls. The old index (the hashmap with slices) seems to be the fastest regarding index lookup performance: 12.5 seconds + 3s background GC marking. The current master takes 16.5s without noticeable GC marking time. And the chaining hash map takes 24s + 4s background GC marking. Interestingly most of the time is spent in memeqbody (13s), that is e.id == id. The hash function comes in next with 3s for siphash. And the for-loop in get also takes around 3s according to pprof.

With #2818 the lookup costs insideLookupSize (should) drop a lot, so this slightly more expensive index lookup should be no problem.

/usr/bin/time/ shows me the following maximum resident set size for my check runs:
644MB for this PR, 715MB for the current index on master and 891MB for the old index.

@greatroar
Copy link
Copy Markdown
Contributor Author

greatroar commented Jul 5, 2020

My money is on the tophash field for those 13s. Map calls memequal too, but we do four calls on average for a lookup, map one plus some byte comparisons. My attempts to optimize the comparison have not born much result. Either that, or we're seeing a cache/TLB effect.

@MichaelEischer
Copy link
Copy Markdown
Member

MichaelEischer commented Jul 11, 2020

@greatroar The standard hashmap uses an [8]uint8 array to store the tophash for the entries in a bucket. That way all tophash fields are loaded at once into the cache and thus it's really cheap to check whether any of the slots in a bucket might match. Actually the only reason I see why/how the tophash can help performance is cache effects.

In the chained hashtable there's no benefit to expect from an additional tophash field, as the tophash field would be part of the indexEntry and therefore end up in the same cacheline as the blob ID (well, a separate array would also be possible but that way the hashmap would always have to look at two places; grouping the topHash at the start of a batch allocation has basically the same problem. Also the chained entries are very likely each stored in a different allocation batch.). And then it's easier to simply check the first byte of the full entry ID (which is effectively identical to the tophash) and avoid spending one byte for the tophash (and there are no alignment issues). What might be worth a try is to move the next pointer in the indexEntry before the id field. If the id of an index entry does not match, then the hashmap only has to compare the first (few) byte(s) of the id before following the next pointer. By moving the next pointer and the start of the blob ID closer together this should reduce the chance that both values end up in different cache lines. (8 byte distance vs. 32 byte). (Or maybe there's enough prefetching magic happening in modern processors that a short jump backwards ends up being more expensive than a longer one forwards.)

@greatroar
Copy link
Copy Markdown
Contributor Author

No significant difference. Note that comparing two ids may not be done byte-for-byte due to vectorization.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall hashmap design seems to be settled by now. The two main open points I see right now is regarding the amount of parallelism in grow and the entryAllocBatch size.

In my opinion the indexMap should get a few simple tests for add, len, foreach, foreachWithID and get. The Index and MasterIndex test probably already cover most of the code, but I'm not sure whether these would catch all problems in the indexMap.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more tiny test issues that need addressing. But besides those, the only remaining issue which I want to address before merging is the entryAllocBatch size. I'm currently running a few more benchmark. The results currently look like most runtime variations and unexpected slowdowns between 8 or 256 disappear when I increase pass -benchtime 10s to the benchmark, which indicates that the different allocation sizes trigger different GC behavior/interleaving with the running code.

@greatroar greatroar force-pushed the chaining branch 2 times, most recently from 17e4e7c to 9025f9f Compare July 17, 2020 06:33
@greatroar
Copy link
Copy Markdown
Contributor Author

greatroar commented Jul 17, 2020

I just reran the latest benchmarks (fast createRandomIndex, proper size for DecodeIndex*) with entryAllocBatch = 8 as old and 256 as new, without changing benchtime. These are the results:

name                                            old time/op    new time/op    delta
IndexAlloc-8                                       688ms ± 8%     656ms ± 5%   -4.68%  (p=0.006 n=20+20)
IndexAllocParallel-8                               306ms ±16%     233ms ±14%  -23.73%  (p=0.000 n=20+20)
IndexMapHash-8                                    3.61µs ± 0%    3.61µs ± 1%     ~     (p=0.484 n=9+10)
PackerManager-8                                    181ms ± 1%     179ms ± 0%   -0.77%  (p=0.000 n=10+9)
DecodeIndex-8                                      4.34s ± 0%     4.34s ± 1%     ~     (p=0.604 n=9+10)
DecodeIndexParallel-8                              4.34s ± 0%     4.34s ± 1%     ~     (p=0.842 n=10+9)
IndexHasUnknown-8                                 49.4ns ±12%    53.0ns ±16%     ~     (p=0.196 n=10+10)
IndexHasKnown-8                                   54.3ns ±17%    51.7ns ± 9%     ~     (p=0.210 n=10+10)
MasterIndexLookupSingleIndex-8                     144ns ± 5%     138ns ± 3%   -4.38%  (p=0.007 n=10+9)
MasterIndexLookupMultipleIndex-8                   419ns ± 6%     424ns ± 4%     ~     (p=0.424 n=10+10)
MasterIndexLookupSingleIndexUnknown-8             73.5ns ± 9%    70.7ns ± 8%     ~     (p=0.305 n=10+9)
MasterIndexLookupMultipleIndexUnknown-8            351ns ±10%     353ns ± 3%     ~     (p=0.328 n=9+9)
MasterIndexLookupParallel/known,indices=5-8        859ns ± 2%     879ns ± 5%   +2.30%  (p=0.011 n=10+10)
MasterIndexLookupParallel/unknown,indices=5-8      278ns ± 2%     295ns ± 1%   +6.31%  (p=0.000 n=10+8)
MasterIndexLookupParallel/known,indices=10-8      1.17µs ± 1%    1.14µs ± 1%   -2.57%  (p=0.000 n=10+10)
MasterIndexLookupParallel/unknown,indices=10-8     261ns ± 1%     253ns ± 1%   -3.07%  (p=0.000 n=10+10)
MasterIndexLookupParallel/known,indices=20-8      1.29µs ± 0%    1.21µs ± 2%   -5.71%  (p=0.000 n=7+10)
MasterIndexLookupParallel/unknown,indices=20-8     564ns ± 1%     533ns ± 1%   -5.63%  (p=0.000 n=10+10)
SaveAndEncrypt-8                                  98.8ns ± 1%    98.5ns ± 0%     ~     (p=0.348 n=9+9)
LoadTree-8                                         577µs ± 1%     575µs ± 0%   -0.50%  (p=0.011 n=9+9)
LoadBlob-8                                        5.70ms ± 2%    5.72ms ± 2%     ~     (p=0.280 n=10+10)
LoadAndDecrypt-8                                  5.93ms ± 0%    5.93ms ± 0%     ~     (p=0.243 n=9+10)
LoadIndex-8                                       24.7ms ± 0%    24.6ms ± 1%   -0.50%  (p=0.003 n=10+10)

name                                            old alloc/op   new alloc/op   delta
IndexAlloc-8                                       401MB ± 0%     401MB ± 0%   +0.00%  (p=0.000 n=18+19)
IndexAllocParallel-8                               400MB ± 0%     400MB ± 0%   +0.00%  (p=0.000 n=20+20)
IndexMapHash-8                                     0.00B          0.00B          ~     (all equal)
PackerManager-8                                   91.6kB ± 0%    91.6kB ± 0%     ~     (p=0.291 n=9+9)
SaveAndEncrypt-8                                   1.00B ± 0%     1.00B ± 0%     ~     (all equal)

name                                            old allocs/op  new allocs/op  delta
IndexAlloc-8                                       1.12M ± 0%     0.91M ± 0%  -18.79%  (p=0.000 n=19+18)
IndexAllocParallel-8                               1.12M ± 0%     0.91M ± 0%  -18.79%  (p=0.000 n=20+20)
IndexMapHash-8                                      0.00           0.00          ~     (all equal)
PackerManager-8                                    1.41k ± 0%     1.41k ± 0%     ~     (p=0.173 n=10+10)
SaveAndEncrypt-8                                    0.00           0.00          ~     (all equal)

name                                            old speed      new speed      delta
IndexMapHash-8                                  1.13GB/s ± 0%  1.13GB/s ± 1%     ~     (p=0.497 n=9+10)
PackerManager-8                                  291MB/s ± 1%   294MB/s ± 0%   +0.77%  (p=0.000 n=10+9)
SaveAndEncrypt-8                                42.5TB/s ± 1%  42.6TB/s ± 0%     ~     (p=0.297 n=9+9)
LoadBlob-8                                       176MB/s ± 2%   175MB/s ± 2%     ~     (p=0.280 n=10+10)
LoadAndDecrypt-8                                 169MB/s ± 0%   169MB/s ± 0%     ~     (p=0.242 n=9+10)

In summary, constructing indexes gets faster, while lookups get a bit slower. Still nothing that merged indexes couldn't solve, I think, given that the loss is mainly in the lookups of unknown blobs. The overallocation is statistically significant but still tiny.

@MichaelEischer
Copy link
Copy Markdown
Member

I've also run a few of the lookup benchmarks with -benchtime 10s and for those there's nearly no difference between a batchsize of 8 or 256, with slightly better performance for a batchsize of 256.

name                                            old time/op  new time/op  delta
MasterIndexLookupSingleIndex-8                   134ns ± 6%   131ns ± 6%    ~     (p=0.173 n=20+20)
MasterIndexLookupMultipleIndex-8                 376ns ± 7%   364ns ± 7%  -3.26%  (p=0.011 n=19+19)
MasterIndexLookupSingleIndexUnknown-8           60.2ns ±11%  60.5ns ±11%    ~     (p=0.700 n=20+20)
MasterIndexLookupMultipleIndexUnknown-8          285ns ± 7%   287ns ±13%    ~     (p=0.808 n=19+20)
MasterIndexLookupParallel/known,indices=5-8      342ns ± 5%   340ns ± 5%    ~     (p=0.440 n=20+20)
MasterIndexLookupParallel/unknown,indices=5-8    232ns ± 8%   232ns ± 6%    ~     (p=0.995 n=20+20)
MasterIndexLookupParallel/known,indices=10-8     684ns ± 7%   673ns ± 4%    ~     (p=0.122 n=20+20)
MasterIndexLookupParallel/unknown,indices=10-8   318ns ± 4%   316ns ± 9%    ~     (p=0.399 n=20+19)
MasterIndexLookupParallel/known,indices=20-8     720ns ± 8%   707ns ± 9%    ~     (p=0.168 n=20+19)
MasterIndexLookupParallel/unknown,indices=20-8   487ns ± 4%   490ns ± 4%    ~     (p=0.285 n=20+20)

So I'd say we use 256 as batch size, which also has the benefit of a slightly lower memory usage.
@greatroar Do you want to rebase / squash / clean up some of the commits before merging? Or did I miss any points which still need discussion?

@greatroar
Copy link
Copy Markdown
Contributor Author

Switched to 256, rebased everything and re-ran all the benchmarks (see commit messages; this took several hours). If the tests pass, I'm done.

greatroar added 2 commits July 19, 2020 13:58
createRandomIndex was using the global RNG, which locks on every call
It was also using twice as many random numbers as necessary and doing
a float division in every iteration of the inner loop.

BenchmarkDecodeIndex was using too short an input, especially for a
parallel version. (It may now be using one that is a bit large.)

Results on linux/amd64, -benchtime=3s -count=20:

name                                     old time/op    new time/op       delta
PackerManager-8                             178ms ± 0%        178ms ± 0%           ~     (p=0.165 n=20+20)
DecodeIndex-8                              13.6µs ± 2%  4539886.8µs ± 0%  +33293901.38%  (p=0.000 n=20+18)
IndexHasUnknown-8                          44.4ns ± 7%       44.4ns ± 5%           ~     (p=0.873 n=20+19)
IndexHasKnown-8                            49.2ns ± 3%       48.3ns ± 0%         -1.86%  (p=0.000 n=20+16)
IndexAlloc-8                                802ms ± 1%        758ms ± 1%         -5.51%  (p=0.000 n=20+19)
MasterIndexLookupSingleIndex-8              124ns ± 1%        122ns ± 0%         -1.41%  (p=0.000 n=20+14)
MasterIndexLookupMultipleIndex-8            373ns ± 2%        369ns ± 2%         -1.13%  (p=0.001 n=20+20)
MasterIndexLookupSingleIndexUnknown-8      67.8ns ± 3%       68.4ns ± 5%           ~     (p=0.753 n=20+20)
MasterIndexLookupMultipleIndexUnknown-8     316ns ± 3%        315ns ± 3%           ~     (p=0.846 n=20+20)
SaveAndEncrypt-8                           30.5ms ± 1%       30.2ms ± 1%         -1.09%  (p=0.000 n=19+19)
LoadTree-8                                  527µs ± 1%        540µs ± 1%         +2.37%  (p=0.000 n=19+20)
LoadBlob-8                                 5.65ms ± 0%       5.64ms ± 0%         -0.21%  (p=0.000 n=19+18)
LoadAndDecrypt-8                           7.07ms ± 2%       5.93ms ± 0%        -16.15%  (p=0.000 n=19+20)
LoadIndex-8                                32.1ms ± 2%       25.1ms ± 0%        -21.64%  (p=0.000 n=20+18)

name                                     old speed      new speed         delta
PackerManager-8                           296MB/s ± 0%      296MB/s ± 0%           ~     (p=0.159 n=20+20)
SaveAndEncrypt-8                          138MB/s ± 1%      139MB/s ± 1%         +1.10%  (p=0.000 n=19+19)
LoadBlob-8                                177MB/s ± 0%      177MB/s ± 0%         +0.21%  (p=0.000 n=19+18)
LoadAndDecrypt-8                          141MB/s ± 2%      169MB/s ± 0%        +19.24%  (p=0.000 n=19+20)

name                                     old alloc/op   new alloc/op      delta
PackerManager-8                            91.8kB ± 0%       91.8kB ± 0%           ~     (p=0.826 n=19+12)
IndexAlloc-8                                786MB ± 0%        786MB ± 0%         +0.01%  (p=0.000 n=20+20)
SaveAndEncrypt-8                           21.0MB ± 0%       21.0MB ± 0%         -0.00%  (p=0.012 n=20+19)

name                                     old allocs/op  new allocs/op     delta
PackerManager-8                             1.41k ± 0%        1.41k ± 0%           ~     (all equal)
IndexAlloc-8                                 977k ± 0%         977k ± 0%         +0.01%  (p=0.022 n=20+20)
SaveAndEncrypt-8                             73.0 ± 0%         73.0 ± 0%           ~     (all equal)
These are faster to construct but slower to access. The allocation rate
is halved, the peak memory usage almost halved compared to standard map.

Benchmark results on linux/amd64, -benchtime=3s -count=20:

name                                            old time/op    new time/op    delta
PackerManager-8                                    178ms ± 0%     178ms ± 0%     ~     (p=0.231 n=20+20)
DecodeIndex-8                                      4.54s ± 0%     4.30s ± 0%   -5.20%  (p=0.000 n=18+17)
DecodeIndexParallel-8                              4.54s ± 0%     4.30s ± 0%   -5.22%  (p=0.000 n=19+18)
IndexHasUnknown-8                                 44.4ns ± 5%    50.5ns ±11%  +13.82%  (p=0.000 n=19+17)
IndexHasKnown-8                                   48.3ns ± 0%    51.5ns ±12%   +6.68%  (p=0.001 n=16+20)
IndexAlloc-8                                       758ms ± 1%     616ms ± 1%  -18.69%  (p=0.000 n=19+19)
IndexAllocParallel-8                               234ms ± 3%     204ms ± 2%  -12.60%  (p=0.000 n=20+18)
MasterIndexLookupSingleIndex-8                     122ns ± 0%     145ns ± 9%  +18.44%  (p=0.000 n=14+20)
MasterIndexLookupMultipleIndex-8                   369ns ± 2%     429ns ± 8%  +16.27%  (p=0.000 n=20+20)
MasterIndexLookupSingleIndexUnknown-8             68.4ns ± 5%    74.9ns ±13%   +9.47%  (p=0.000 n=20+20)
MasterIndexLookupMultipleIndexUnknown-8            315ns ± 3%     369ns ±11%  +17.14%  (p=0.000 n=20+20)
MasterIndexLookupParallel/known,indices=5-8        743ns ± 1%     816ns ± 2%   +9.87%  (p=0.000 n=17+17)
MasterIndexLookupParallel/unknown,indices=5-8      238ns ± 1%     260ns ± 2%   +9.14%  (p=0.000 n=19+20)
MasterIndexLookupParallel/known,indices=10-8      1.01µs ± 3%    1.11µs ± 2%   +9.79%  (p=0.000 n=19+20)
MasterIndexLookupParallel/unknown,indices=10-8     222ns ± 0%     269ns ± 2%  +20.83%  (p=0.000 n=16+20)
MasterIndexLookupParallel/known,indices=20-8      1.06µs ± 2%    1.19µs ± 2%  +12.95%  (p=0.000 n=19+18)
MasterIndexLookupParallel/unknown,indices=20-8     413ns ± 1%     530ns ± 1%  +28.19%  (p=0.000 n=18+20)
SaveAndEncrypt-8                                  30.2ms ± 1%    30.4ms ± 0%   +0.71%  (p=0.000 n=19+19)
LoadTree-8                                         540µs ± 1%     576µs ± 1%   +6.73%  (p=0.000 n=20+20)
LoadBlob-8                                        5.64ms ± 0%    5.64ms ± 0%     ~     (p=0.883 n=18+17)
LoadAndDecrypt-8                                  5.93ms ± 0%    5.95ms ± 1%     ~     (p=0.247 n=20+19)
LoadIndex-8                                       25.1ms ± 0%    24.5ms ± 1%   -2.54%  (p=0.000 n=18+17)

name                                            old speed      new speed      delta
PackerManager-8                                  296MB/s ± 0%   296MB/s ± 0%     ~     (p=0.229 n=20+20)
SaveAndEncrypt-8                                 139MB/s ± 1%   138MB/s ± 0%   -0.71%  (p=0.000 n=19+19)
LoadBlob-8                                       177MB/s ± 0%   177MB/s ± 0%     ~     (p=0.890 n=18+17)
LoadAndDecrypt-8                                 169MB/s ± 0%   168MB/s ± 1%     ~     (p=0.227 n=20+19)

name                                            old alloc/op   new alloc/op   delta
PackerManager-8                                   91.8kB ± 0%    91.8kB ± 0%     ~     (p=0.772 n=12+19)
IndexAlloc-8                                       786MB ± 0%     400MB ± 0%  -49.04%  (p=0.000 n=20+18)
IndexAllocParallel-8                               786MB ± 0%     401MB ± 0%  -49.04%  (p=0.000 n=19+15)
SaveAndEncrypt-8                                  21.0MB ± 0%    21.0MB ± 0%   +0.00%  (p=0.000 n=19+19)

name                                            old allocs/op  new allocs/op  delta
PackerManager-8                                    1.41k ± 0%     1.41k ± 0%     ~     (all equal)
IndexAlloc-8                                        977k ± 0%      907k ± 0%   -7.18%  (p=0.000 n=20+20)
IndexAllocParallel-8                                977k ± 0%      907k ± 0%   -7.17%  (p=0.000 n=19+15)
SaveAndEncrypt-8                                    73.0 ± 0%      73.0 ± 0%     ~     (all equal)
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot @greatroar and @aawsome for all the discussions and work on this!

@MichaelEischer MichaelEischer merged commit 82c9088 into restic:master Jul 20, 2020
@greatroar greatroar deleted the chaining branch July 21, 2020 08:02
@greatroar
Copy link
Copy Markdown
Contributor Author

Thank you both for the thorough review!

@greatroar greatroar mentioned this pull request Sep 18, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants