Chaining hash table for repository.Index by greatroar · Pull Request #2812 · restic/restic

greatroar · 2020-06-28T06:34:02Z

What is the purpose of this change? What does it change?

This implements a custom chaining hash table for repository.Index.

Index merging (#2799) should be easy to add to this PR.

Was the change discussed in an issue or in the forum before?

Alternative to #2811, with better performance on the benchmarks.

Checklist

I have read the Contribution Guidelines
I have enabled maintainer edits for this PR
I have added tests for all changes in this PR
I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

greatroar · 2020-06-28T06:38:39Z

As I said over at #2811, I'm not sure about the concurrent performance of this yet. I did see a second CPU core running the GC at 100% in the benchmarks.

greatroar · 2020-06-28T10:41:44Z

Here's a quick real-world test using restic cat, as suggested by @aawsome. $id is the id of an existing tree blob in a 700GiB repo (local disk).

Current master:

$ /usr/bin/time ./restic cat blob $id >/dev/null 
12.77user 0.32system 0:04.39elapsed 298%CPU (0avgtext+0avgdata 705296maxresident)k
0inputs+0outputs (0major+125964minor)pagefaults 0swaps
$ /usr/bin/time ./restic -cat blob $id >/dev/null 
12.70user 0.35system 0:04.01elapsed 325%CPU (0avgtext+0avgdata 717884maxresident)k
0inputs+0outputs (0major+127844minor)pagefaults 0swaps
$ /usr/bin/time ./restic cat blob $id >/dev/null 
12.90user 0.34system 0:04.10elapsed 322%CPU (0avgtext+0avgdata 727872maxresident)k
0inputs+0outputs (0major+139900minor)pagefaults 0swaps

This PR:

$ /usr/bin/time ./restic cat blob $id >/dev/null 
13.77user 0.35system 0:04.38elapsed 322%CPU (0avgtext+0avgdata 569968maxresident)k
0inputs+0outputs (0major+136284minor)pagefaults 0swaps
$ /usr/bin/time ./restic cat blob $id >/dev/null 
13.60user 0.28system 0:03.99elapsed 348%CPU (0avgtext+0avgdata 533716maxresident)k
0inputs+0outputs (0major+123454minor)pagefaults 0swaps
$ /usr/bin/time ./restic cat blob $id >/dev/null 
13.91user 0.31system 0:04.07elapsed 349%CPU (0avgtext+0avgdata 570852maxresident)k
0inputs+0outputs (0major+137047minor)pagefaults 0swaps

#2811 at 64ecc66:

$ /usr/bin/time ./restic -p ~/.restic/pw cat blob $id >/dev/null 
13.34user 0.22system 0:04.87elapsed 278%CPU (0avgtext+0avgdata 467384maxresident)k
0inputs+0outputs (0major+105581minor)pagefaults 0swaps
$ /usr/bin/time ./restic -p ~/.restic/pw cat blob $id >/dev/null 
13.22user 0.27system 0:04.18elapsed 322%CPU (0avgtext+0avgdata 478512maxresident)k
0inputs+0outputs (0major+109278minor)pagefaults 0swaps
$ /usr/bin/time ./restic -p ~/.restic/pw cat blob $id >/dev/null 
13.27user 0.24system 0:04.15elapsed 325%CPU (0avgtext+0avgdata 468200maxresident)k
0inputs+0outputs (0major+104110minor)pagefaults 0swaps

Here, this PR so far achieves a 23% reduction in average memory use, while #2811 gets a 35% reduction. There's a few constants left to tune, I'll see if can lower.

greatroar · 2020-06-28T14:09:33Z

Following @aawsome's advice, I reorganised the duplicates and increased the maximum load factor to make more efficient use of memory. In the first restic cat tests, I now get a 32% reduction and I suppose I can get up to the level of #2811 by increasing it further.

greatroar · 2020-06-28T14:12:24Z

Test coverage for indexmap.go is 98.1%, or one return statement not covered.

aawsome

Looks good so far; I have just a few small changes to propose.

Side note: How do I get information about test code coverage?

internal/repository/indexmap.go

internal/repository/index.go

internal/repository/indexmap.go

aawsome · 2020-06-28T15:13:43Z

I would also add a changelog, as this is another massive improvement with respect to memory usage!

greatroar · 2020-06-28T15:39:59Z

I intend to merge the changelog entry with the one for #2781, because to a user, they're the same thing. Is that ok with you?

aawsome · 2020-06-28T15:42:33Z

I intend to merge the changelog entry with the one for #2781, because to a user, they're the same thing. Is that ok with you?

Sure - I think it is a good idea to tell the overall improvement to users.

greatroar · 2020-06-28T16:33:51Z

Re: test coverage,

go test -coverprofile=cov
go tool cover -html cov

internal/repository/indexmap.go

internal/repository/index.go

MichaelEischer

I like the hash table design in general. I've been mostly pondering over two points:

How large is the GC overhead? The indexEntries and buckets arrays add a massive amount of pointers which the garbage collector has to follow during GC. It might be comparable to the previous index implementation which stored slices in the index hashmap (which contain a pointer). So, it might turn out to be no problem at all.

What are the main design choices and are there alternatives? From what I see the main points are to store pointers in the bucket map, allocate blocks of index entries (but still use them separately), chain entries for conflict resolution and use rather long chains with up to four elements on average. The first two points allow for a rather smooth increase in memory usage when a hashmap grows. The latter two points keep the overhead of bucket map pointers relatively low. (And to keep multiple entries per key instead of overwriting the old value).

Storing the index entries in the bucket array would cause the large memory usage spikes while growing the hashmap so that's not an option (unless we want to split the bucket array into parts and adjust it part by part). Just allocating individual entries leads to a lot of internal memory fragmentation and allocating everything at once doesn't work either. Using shorter chains probably isn't that helpful and an average chain length of 4 is much smaller than the usual number of indexes that currently exist in the master index. This leaves open addressing/probing as alternative to chaining. It would remove the next pointer from the indexEntry, however, in exchange the bucket array would have to be larger than the current number of entries. This becomes a problem while growing the bucket array as even with an optimal load factor and a very small grow factor this would require keeping more than two pointers per entry (three for a growth factor of 2), which is a lot more than the chaining variant needs. The main benefit of probing would be that the hashmap wouldn't have to chase linked list entries located at random places in memory but could get away with an access pattern that offers better locality.

I was wondering whether it would be possible to move/reorder entries in the grow method in such a way that the linked elements of a bucket are laid out consecutively in memory. That would at keep the last half of the chain elements next to each other. But half of the chain elements are just two index entries, so it's probably not worth the trouble...

internal/repository/indexmap.go

changelog/unreleased/pull-2781

internal/repository/indexmap.go

greatroar · 2020-07-03T07:34:23Z

How large is the GC overhead? The indexEntries and buckets arrays add a massive amount of pointers which the garbage collector has to follow during GC. It might be comparable to the previous index implementation which stored slices in the index hashmap (which contain a pointer). So, it might turn out to be no problem at all.

I've yet to benchmark this.

What are the main design choices and are there alternatives? From what I see the main points are to store pointers in the bucket map, allocate blocks of index entries (but still use them separately), chain entries for conflict resolution and use rather long chains with up to four elements on average. The first two points allow for a rather smooth increase in memory usage when a hashmap grows. The latter two points keep the overhead of bucket map pointers relatively low. (And to keep multiple entries per key instead of overwriting the old value).

The main alternatives that I considered are a probing table, which has the problems you spotted, or cuckoo hashing, which can handle very high load factors. The latter looked to be much more complicated and I quickly gave up on it.

Thinking out load, there is perhaps a middle ground: a probing table that stores summaries of the keys and a pointer to the entry, viz.

type bucket struct {
	hash  uint
	entry *indexEntry // no longer needs a next pointer
}

Say we set the maximum load factor at 80% and grow in powers of two. Then the table will be 60% full on average and the worst case overhead will be 16*(1+2/3)*n = 26.7n bytes. Dropping the next pointer saves eight bytes, and we pay 74.7 bytes per entry.

The benefit would be the access pattern. The buckets get packed four to a cache line (on amd64/arm64), so lookup might be faster despite the probing. Insert would no longer be worst-case O(1), but nearly all the pointer indirection is gone.

I'll see if I can get this working.

I was wondering whether it would be possible to move/reorder entries in the grow method in such a way that the linked elements of a bucket are laid out consecutively in memory. That would at keep the last half of the chain elements next to each other. But half of the chain elements are just two index entries, so it's probably not worth the trouble...

Also, the second entry in a pair necessary spills over into the next cache line because the SHA-256 is 32 bytes.

aawsome · 2020-07-03T08:48:18Z

type bucket struct {
	hash  uint
	entry *indexEntry // no longer needs a next pointer
}
Say we set the maximum load factor at 80% and grow in powers of two. Then the table will be 60% full on average and the worst case overhead will be 16*(1+2/3)*n = 26.7n bytes. Dropping the next pointer saves eight bytes, and we pay 74.7 bytes per entry.

No need to save the hash as it can be easily computed.
If you save *indexEntry in your bucket and use probing, you gain 8 bytes per index entry. In your actual implementation your buckets need between 2 and 4 bytes per index entry. So if you could use a probing table with load factor between 2/3 and 4/5 (that should be the maximum load factor!), you will get the identical memory usage. However, if you grow a hash table with load factor 80% by doubling its buckets, it will result in a hash table with load factor 40%. Then the buckets will use 5/2 * 8 bytes = 20 bytes per entry.

And that is the main drawback of hashmaps with load factor smaller than 1 - you get memory spikes when growing...

greatroar · 2020-07-03T09:35:13Z

No need to save the hash as it can be easily computed.

The hash is saved so four of them will be in the L1 cache before we follow a single pointer. That's also the only benefit of this proposed scheme.

MichaelEischer · 2020-07-03T21:34:26Z

No need to save the hash as it can be easily computed.

The hash is saved so four of them will be in the L1 cache before we follow a single pointer. That's also the only benefit of this proposed scheme.

The standard go hash maps have a tophash field in their buckets which seems to be the standard way to avoid following unnecessary pointers/comparing entries: The field just contains the top 8 bits of an entry's hash (ignoring a few reserved values). That way after checking just that single byte you only have a 1/256 chance for a hash collision.

I think we can live with a slightly slower hashmap in exchange for the better memory efficiency. After all a merged hashmap is fast enough that the lookup-heavy check command only spends a tiny fraction of it's time there.

aawsome · 2020-07-04T05:02:15Z

I think we can live with a slightly slower hashmap in exchange for the better memory efficiency. After all a merged hashmap is fast enough that the lookup-heavy check command only spends a tiny fraction of it's time there.

Completely agree! The most important point is that we use a hashmap that does not have significant memory overhead even in edge cases like directly after growing and which is not too slow.

This proposed solution has the following properties:

it uses much less memory and allocations compared to the standard go hashmap
it handles duplicates very smoothly
it has a very "constant" total overhead (between 10 and 12 byte per entry)
it is perfect to implement merging as this can be done almost without newly allocating memory
it is already faster than than the actual implementation in some tests and way faster once the indexes are merged

So if we should spend more energy in working on this solution IMO the only point worth is to benchmark and test if we missed some severe performance issues in nasty cases (like the GC topic or with concurrency)

greatroar · 2020-07-05T12:49:03Z

I just did the following with master and this PR:

restic init
echo 3 | sudo tee /proc/sys/vm/drop_caches # flush kernel page cache
/usr/bin/time restic backup 83GiBofrandomdata/

This gives 221MiB max resident vs. 232 on master, while the elapsed time is practically the same. That's a small margin, but note that just running /usr/bin/time restic with no arguments takes 10MiB and the indices that this produces are only 8MiB on disk. So at the very least there is no regression.

MichaelEischer · 2020-07-05T14:39:17Z

How large is the GC overhead? The indexEntries and buckets arrays add a massive amount of pointers which the garbage collector has to follow during GC. It might be comparable to the previous index implementation which stored slices in the index hashmap (which contain a pointer). So, it might turn out to be no problem at all.

I've yet to benchmark this.

I've just run a rather simple test with the check command (with #2328 merged), which issues lots of LookupSize calls. The old index (the hashmap with slices) seems to be the fastest regarding index lookup performance: 12.5 seconds + 3s background GC marking. The current master takes 16.5s without noticeable GC marking time. And the chaining hash map takes 24s + 4s background GC marking. Interestingly most of the time is spent in memeqbody (13s), that is e.id == id. The hash function comes in next with 3s for siphash. And the for-loop in get also takes around 3s according to pprof.

With #2818 the lookup costs insideLookupSize (should) drop a lot, so this slightly more expensive index lookup should be no problem.

/usr/bin/time/ shows me the following maximum resident set size for my check runs:
644MB for this PR, 715MB for the current index on master and 891MB for the old index.

greatroar · 2020-07-05T20:02:30Z

My money is on the tophash field for those 13s. Map calls memequal too, but we do four calls on average for a lookup, map one plus some byte comparisons. My attempts to optimize the comparison have not born much result. Either that, or we're seeing a cache/TLB effect.

MichaelEischer · 2020-07-11T16:17:46Z

@greatroar The standard hashmap uses an [8]uint8 array to store the tophash for the entries in a bucket. That way all tophash fields are loaded at once into the cache and thus it's really cheap to check whether any of the slots in a bucket might match. Actually the only reason I see why/how the tophash can help performance is cache effects.

In the chained hashtable there's no benefit to expect from an additional tophash field, as the tophash field would be part of the indexEntry and therefore end up in the same cacheline as the blob ID (well, a separate array would also be possible but that way the hashmap would always have to look at two places; grouping the topHash at the start of a batch allocation has basically the same problem. Also the chained entries are very likely each stored in a different allocation batch.). And then it's easier to simply check the first byte of the full entry ID (which is effectively identical to the tophash) and avoid spending one byte for the tophash (and there are no alignment issues). What might be worth a try is to move the next pointer in the indexEntry before the id field. If the id of an index entry does not match, then the hashmap only has to compare the first (few) byte(s) of the id before following the next pointer. By moving the next pointer and the start of the blob ID closer together this should reduce the chance that both values end up in different cache lines. (8 byte distance vs. 32 byte). (Or maybe there's enough prefetching magic happening in modern processors that a short jump backwards ends up being more expensive than a longer one forwards.)

greatroar · 2020-07-12T10:10:29Z

No significant difference. Note that comparing two ids may not be done byte-for-byte due to vectorization.

MichaelEischer

The overall hashmap design seems to be settled by now. The two main open points I see right now is regarding the amount of parallelism in grow and the entryAllocBatch size.

In my opinion the indexMap should get a few simple tests for add, len, foreach, foreachWithID and get. The Index and MasterIndex test probably already cover most of the code, but I'm not sure whether these would catch all problems in the indexMap.

internal/repository/index.go

internal/repository/index_test.go

internal/repository/indexmap.go

internal/repository/index.go

internal/repository/indexmap.go

internal/repository/index_test.go

internal/repository/indexmap.go

MichaelEischer

There are a few more tiny test issues that need addressing. But besides those, the only remaining issue which I want to address before merging is the entryAllocBatch size. I'm currently running a few more benchmark. The results currently look like most runtime variations and unexpected slowdowns between 8 or 256 disappear when I increase pass -benchtime 10s to the benchmark, which indicates that the different allocation sizes trigger different GC behavior/interleaving with the running code.

internal/repository/indexmap_test.go

internal/repository/index_test.go

greatroar · 2020-07-17T16:09:32Z

I just reran the latest benchmarks (fast createRandomIndex, proper size for DecodeIndex*) with entryAllocBatch = 8 as old and 256 as new, without changing benchtime. These are the results:

name                                            old time/op    new time/op    delta
IndexAlloc-8                                       688ms ± 8%     656ms ± 5%   -4.68%  (p=0.006 n=20+20)
IndexAllocParallel-8                               306ms ±16%     233ms ±14%  -23.73%  (p=0.000 n=20+20)
IndexMapHash-8                                    3.61µs ± 0%    3.61µs ± 1%     ~     (p=0.484 n=9+10)
PackerManager-8                                    181ms ± 1%     179ms ± 0%   -0.77%  (p=0.000 n=10+9)
DecodeIndex-8                                      4.34s ± 0%     4.34s ± 1%     ~     (p=0.604 n=9+10)
DecodeIndexParallel-8                              4.34s ± 0%     4.34s ± 1%     ~     (p=0.842 n=10+9)
IndexHasUnknown-8                                 49.4ns ±12%    53.0ns ±16%     ~     (p=0.196 n=10+10)
IndexHasKnown-8                                   54.3ns ±17%    51.7ns ± 9%     ~     (p=0.210 n=10+10)
MasterIndexLookupSingleIndex-8                     144ns ± 5%     138ns ± 3%   -4.38%  (p=0.007 n=10+9)
MasterIndexLookupMultipleIndex-8                   419ns ± 6%     424ns ± 4%     ~     (p=0.424 n=10+10)
MasterIndexLookupSingleIndexUnknown-8             73.5ns ± 9%    70.7ns ± 8%     ~     (p=0.305 n=10+9)
MasterIndexLookupMultipleIndexUnknown-8            351ns ±10%     353ns ± 3%     ~     (p=0.328 n=9+9)
MasterIndexLookupParallel/known,indices=5-8        859ns ± 2%     879ns ± 5%   +2.30%  (p=0.011 n=10+10)
MasterIndexLookupParallel/unknown,indices=5-8      278ns ± 2%     295ns ± 1%   +6.31%  (p=0.000 n=10+8)
MasterIndexLookupParallel/known,indices=10-8      1.17µs ± 1%    1.14µs ± 1%   -2.57%  (p=0.000 n=10+10)
MasterIndexLookupParallel/unknown,indices=10-8     261ns ± 1%     253ns ± 1%   -3.07%  (p=0.000 n=10+10)
MasterIndexLookupParallel/known,indices=20-8      1.29µs ± 0%    1.21µs ± 2%   -5.71%  (p=0.000 n=7+10)
MasterIndexLookupParallel/unknown,indices=20-8     564ns ± 1%     533ns ± 1%   -5.63%  (p=0.000 n=10+10)
SaveAndEncrypt-8                                  98.8ns ± 1%    98.5ns ± 0%     ~     (p=0.348 n=9+9)
LoadTree-8                                         577µs ± 1%     575µs ± 0%   -0.50%  (p=0.011 n=9+9)
LoadBlob-8                                        5.70ms ± 2%    5.72ms ± 2%     ~     (p=0.280 n=10+10)
LoadAndDecrypt-8                                  5.93ms ± 0%    5.93ms ± 0%     ~     (p=0.243 n=9+10)
LoadIndex-8                                       24.7ms ± 0%    24.6ms ± 1%   -0.50%  (p=0.003 n=10+10)

name                                            old alloc/op   new alloc/op   delta
IndexAlloc-8                                       401MB ± 0%     401MB ± 0%   +0.00%  (p=0.000 n=18+19)
IndexAllocParallel-8                               400MB ± 0%     400MB ± 0%   +0.00%  (p=0.000 n=20+20)
IndexMapHash-8                                     0.00B          0.00B          ~     (all equal)
PackerManager-8                                   91.6kB ± 0%    91.6kB ± 0%     ~     (p=0.291 n=9+9)
SaveAndEncrypt-8                                   1.00B ± 0%     1.00B ± 0%     ~     (all equal)

name                                            old allocs/op  new allocs/op  delta
IndexAlloc-8                                       1.12M ± 0%     0.91M ± 0%  -18.79%  (p=0.000 n=19+18)
IndexAllocParallel-8                               1.12M ± 0%     0.91M ± 0%  -18.79%  (p=0.000 n=20+20)
IndexMapHash-8                                      0.00           0.00          ~     (all equal)
PackerManager-8                                    1.41k ± 0%     1.41k ± 0%     ~     (p=0.173 n=10+10)
SaveAndEncrypt-8                                    0.00           0.00          ~     (all equal)

name                                            old speed      new speed      delta
IndexMapHash-8                                  1.13GB/s ± 0%  1.13GB/s ± 1%     ~     (p=0.497 n=9+10)
PackerManager-8                                  291MB/s ± 1%   294MB/s ± 0%   +0.77%  (p=0.000 n=10+9)
SaveAndEncrypt-8                                42.5TB/s ± 1%  42.6TB/s ± 0%     ~     (p=0.297 n=9+9)
LoadBlob-8                                       176MB/s ± 2%   175MB/s ± 2%     ~     (p=0.280 n=10+10)
LoadAndDecrypt-8                                 169MB/s ± 0%   169MB/s ± 0%     ~     (p=0.242 n=9+10)

In summary, constructing indexes gets faster, while lookups get a bit slower. Still nothing that merged indexes couldn't solve, I think, given that the loss is mainly in the lookups of unknown blobs. The overallocation is statistically significant but still tiny.

MichaelEischer · 2020-07-18T14:49:01Z

I've also run a few of the lookup benchmarks with -benchtime 10s and for those there's nearly no difference between a batchsize of 8 or 256, with slightly better performance for a batchsize of 256.

name                                            old time/op  new time/op  delta
MasterIndexLookupSingleIndex-8                   134ns ± 6%   131ns ± 6%    ~     (p=0.173 n=20+20)
MasterIndexLookupMultipleIndex-8                 376ns ± 7%   364ns ± 7%  -3.26%  (p=0.011 n=19+19)
MasterIndexLookupSingleIndexUnknown-8           60.2ns ±11%  60.5ns ±11%    ~     (p=0.700 n=20+20)
MasterIndexLookupMultipleIndexUnknown-8          285ns ± 7%   287ns ±13%    ~     (p=0.808 n=19+20)
MasterIndexLookupParallel/known,indices=5-8      342ns ± 5%   340ns ± 5%    ~     (p=0.440 n=20+20)
MasterIndexLookupParallel/unknown,indices=5-8    232ns ± 8%   232ns ± 6%    ~     (p=0.995 n=20+20)
MasterIndexLookupParallel/known,indices=10-8     684ns ± 7%   673ns ± 4%    ~     (p=0.122 n=20+20)
MasterIndexLookupParallel/unknown,indices=10-8   318ns ± 4%   316ns ± 9%    ~     (p=0.399 n=20+19)
MasterIndexLookupParallel/known,indices=20-8     720ns ± 8%   707ns ± 9%    ~     (p=0.168 n=20+19)
MasterIndexLookupParallel/unknown,indices=20-8   487ns ± 4%   490ns ± 4%    ~     (p=0.285 n=20+20)

So I'd say we use 256 as batch size, which also has the benefit of a slightly lower memory usage.
@greatroar Do you want to rebase / squash / clean up some of the commits before merging? Or did I miss any points which still need discussion?

greatroar · 2020-07-19T11:41:30Z

Switched to 256, rebased everything and re-ran all the benchmarks (see commit messages; this took several hours). If the tests pass, I'm done.

createRandomIndex was using the global RNG, which locks on every call It was also using twice as many random numbers as necessary and doing a float division in every iteration of the inner loop. BenchmarkDecodeIndex was using too short an input, especially for a parallel version. (It may now be using one that is a bit large.) Results on linux/amd64, -benchtime=3s -count=20: name old time/op new time/op delta PackerManager-8 178ms ± 0% 178ms ± 0% ~ (p=0.165 n=20+20) DecodeIndex-8 13.6µs ± 2% 4539886.8µs ± 0% +33293901.38% (p=0.000 n=20+18) IndexHasUnknown-8 44.4ns ± 7% 44.4ns ± 5% ~ (p=0.873 n=20+19) IndexHasKnown-8 49.2ns ± 3% 48.3ns ± 0% -1.86% (p=0.000 n=20+16) IndexAlloc-8 802ms ± 1% 758ms ± 1% -5.51% (p=0.000 n=20+19) MasterIndexLookupSingleIndex-8 124ns ± 1% 122ns ± 0% -1.41% (p=0.000 n=20+14) MasterIndexLookupMultipleIndex-8 373ns ± 2% 369ns ± 2% -1.13% (p=0.001 n=20+20) MasterIndexLookupSingleIndexUnknown-8 67.8ns ± 3% 68.4ns ± 5% ~ (p=0.753 n=20+20) MasterIndexLookupMultipleIndexUnknown-8 316ns ± 3% 315ns ± 3% ~ (p=0.846 n=20+20) SaveAndEncrypt-8 30.5ms ± 1% 30.2ms ± 1% -1.09% (p=0.000 n=19+19) LoadTree-8 527µs ± 1% 540µs ± 1% +2.37% (p=0.000 n=19+20) LoadBlob-8 5.65ms ± 0% 5.64ms ± 0% -0.21% (p=0.000 n=19+18) LoadAndDecrypt-8 7.07ms ± 2% 5.93ms ± 0% -16.15% (p=0.000 n=19+20) LoadIndex-8 32.1ms ± 2% 25.1ms ± 0% -21.64% (p=0.000 n=20+18) name old speed new speed delta PackerManager-8 296MB/s ± 0% 296MB/s ± 0% ~ (p=0.159 n=20+20) SaveAndEncrypt-8 138MB/s ± 1% 139MB/s ± 1% +1.10% (p=0.000 n=19+19) LoadBlob-8 177MB/s ± 0% 177MB/s ± 0% +0.21% (p=0.000 n=19+18) LoadAndDecrypt-8 141MB/s ± 2% 169MB/s ± 0% +19.24% (p=0.000 n=19+20) name old alloc/op new alloc/op delta PackerManager-8 91.8kB ± 0% 91.8kB ± 0% ~ (p=0.826 n=19+12) IndexAlloc-8 786MB ± 0% 786MB ± 0% +0.01% (p=0.000 n=20+20) SaveAndEncrypt-8 21.0MB ± 0% 21.0MB ± 0% -0.00% (p=0.012 n=20+19) name old allocs/op new allocs/op delta PackerManager-8 1.41k ± 0% 1.41k ± 0% ~ (all equal) IndexAlloc-8 977k ± 0% 977k ± 0% +0.01% (p=0.022 n=20+20) SaveAndEncrypt-8 73.0 ± 0% 73.0 ± 0% ~ (all equal)

These are faster to construct but slower to access. The allocation rate is halved, the peak memory usage almost halved compared to standard map. Benchmark results on linux/amd64, -benchtime=3s -count=20: name old time/op new time/op delta PackerManager-8 178ms ± 0% 178ms ± 0% ~ (p=0.231 n=20+20) DecodeIndex-8 4.54s ± 0% 4.30s ± 0% -5.20% (p=0.000 n=18+17) DecodeIndexParallel-8 4.54s ± 0% 4.30s ± 0% -5.22% (p=0.000 n=19+18) IndexHasUnknown-8 44.4ns ± 5% 50.5ns ±11% +13.82% (p=0.000 n=19+17) IndexHasKnown-8 48.3ns ± 0% 51.5ns ±12% +6.68% (p=0.001 n=16+20) IndexAlloc-8 758ms ± 1% 616ms ± 1% -18.69% (p=0.000 n=19+19) IndexAllocParallel-8 234ms ± 3% 204ms ± 2% -12.60% (p=0.000 n=20+18) MasterIndexLookupSingleIndex-8 122ns ± 0% 145ns ± 9% +18.44% (p=0.000 n=14+20) MasterIndexLookupMultipleIndex-8 369ns ± 2% 429ns ± 8% +16.27% (p=0.000 n=20+20) MasterIndexLookupSingleIndexUnknown-8 68.4ns ± 5% 74.9ns ±13% +9.47% (p=0.000 n=20+20) MasterIndexLookupMultipleIndexUnknown-8 315ns ± 3% 369ns ±11% +17.14% (p=0.000 n=20+20) MasterIndexLookupParallel/known,indices=5-8 743ns ± 1% 816ns ± 2% +9.87% (p=0.000 n=17+17) MasterIndexLookupParallel/unknown,indices=5-8 238ns ± 1% 260ns ± 2% +9.14% (p=0.000 n=19+20) MasterIndexLookupParallel/known,indices=10-8 1.01µs ± 3% 1.11µs ± 2% +9.79% (p=0.000 n=19+20) MasterIndexLookupParallel/unknown,indices=10-8 222ns ± 0% 269ns ± 2% +20.83% (p=0.000 n=16+20) MasterIndexLookupParallel/known,indices=20-8 1.06µs ± 2% 1.19µs ± 2% +12.95% (p=0.000 n=19+18) MasterIndexLookupParallel/unknown,indices=20-8 413ns ± 1% 530ns ± 1% +28.19% (p=0.000 n=18+20) SaveAndEncrypt-8 30.2ms ± 1% 30.4ms ± 0% +0.71% (p=0.000 n=19+19) LoadTree-8 540µs ± 1% 576µs ± 1% +6.73% (p=0.000 n=20+20) LoadBlob-8 5.64ms ± 0% 5.64ms ± 0% ~ (p=0.883 n=18+17) LoadAndDecrypt-8 5.93ms ± 0% 5.95ms ± 1% ~ (p=0.247 n=20+19) LoadIndex-8 25.1ms ± 0% 24.5ms ± 1% -2.54% (p=0.000 n=18+17) name old speed new speed delta PackerManager-8 296MB/s ± 0% 296MB/s ± 0% ~ (p=0.229 n=20+20) SaveAndEncrypt-8 139MB/s ± 1% 138MB/s ± 0% -0.71% (p=0.000 n=19+19) LoadBlob-8 177MB/s ± 0% 177MB/s ± 0% ~ (p=0.890 n=18+17) LoadAndDecrypt-8 169MB/s ± 0% 168MB/s ± 1% ~ (p=0.227 n=20+19) name old alloc/op new alloc/op delta PackerManager-8 91.8kB ± 0% 91.8kB ± 0% ~ (p=0.772 n=12+19) IndexAlloc-8 786MB ± 0% 400MB ± 0% -49.04% (p=0.000 n=20+18) IndexAllocParallel-8 786MB ± 0% 401MB ± 0% -49.04% (p=0.000 n=19+15) SaveAndEncrypt-8 21.0MB ± 0% 21.0MB ± 0% +0.00% (p=0.000 n=19+19) name old allocs/op new allocs/op delta PackerManager-8 1.41k ± 0% 1.41k ± 0% ~ (all equal) IndexAlloc-8 977k ± 0% 907k ± 0% -7.18% (p=0.000 n=20+20) IndexAllocParallel-8 977k ± 0% 907k ± 0% -7.17% (p=0.000 n=19+15) SaveAndEncrypt-8 73.0 ± 0% 73.0 ± 0% ~ (all equal)

MichaelEischer

LGTM. Thanks a lot @greatroar and @aawsome for all the discussions and work on this!

greatroar · 2020-07-21T08:25:51Z

Thank you both for the thorough review!

greatroar mentioned this pull request Jun 28, 2020

Use low memory map for in-memory index - about 45% less memory for index #2811

Closed

8 tasks

greatroar force-pushed the chaining branch from 584ca17 to 38933fc Compare June 28, 2020 07:21

greatroar force-pushed the chaining branch 2 times, most recently from f1fca35 to 44c4293 Compare June 28, 2020 14:06

aawsome suggested changes Jun 28, 2020

View reviewed changes

aawsome mentioned this pull request Jun 28, 2020

Merge in-memory indexes; use multiple maps #2799

Closed

8 tasks

aawsome reviewed Jun 28, 2020

View reviewed changes

internal/repository/indexmap.go Show resolved Hide resolved

greatroar commented Jun 29, 2020

View reviewed changes

internal/repository/index.go Outdated Show resolved Hide resolved

MichaelEischer reviewed Jul 2, 2020

View reviewed changes

greatroar force-pushed the chaining branch from 7897775 to 4cfd2fc Compare July 3, 2020 13:07

aawsome mentioned this pull request Jul 4, 2020

Merge index (based on chaining index implementation) #2818

Merged

8 tasks

greatroar force-pushed the chaining branch from 3294386 to 3e2f21b Compare July 5, 2020 20:04

MichaelEischer reviewed Jul 15, 2020

View reviewed changes

MichaelEischer reviewed Jul 16, 2020

View reviewed changes

internal/repository/indexmap_test.go Outdated Show resolved Hide resolved

internal/repository/indexmap_test.go Outdated Show resolved Hide resolved

internal/repository/index_test.go Show resolved Hide resolved

greatroar force-pushed the chaining branch 2 times, most recently from 17e4e7c to 9025f9f Compare July 17, 2020 06:33

greatroar force-pushed the chaining branch from 9025f9f to 79f20bd Compare July 19, 2020 11:40

greatroar added 2 commits July 19, 2020 13:58

greatroar force-pushed the chaining branch from 79f20bd to 7bda28f Compare July 19, 2020 11:58

MichaelEischer approved these changes Jul 20, 2020

View reviewed changes

MichaelEischer merged commit 82c9088 into restic:master Jul 20, 2020

MichaelEischer mentioned this pull request Jul 20, 2020

Reimplement LookupBlobSize without loop + allocation #2788

Closed

8 tasks

greatroar deleted the chaining branch July 21, 2020 08:02

greatroar mentioned this pull request Sep 18, 2021

Replace siphash by hash/maphash #3519

Merged

8 tasks

Conversation

greatroar commented Jun 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of this change? What does it change?

Was the change discussed in an issue or in the forum before?

Checklist

Uh oh!

greatroar commented Jun 28, 2020

Uh oh!

greatroar commented Jun 28, 2020

Uh oh!

greatroar commented Jun 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greatroar commented Jun 28, 2020

Uh oh!

aawsome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aawsome commented Jun 28, 2020

Uh oh!

greatroar commented Jun 28, 2020

Uh oh!

aawsome commented Jun 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greatroar commented Jun 28, 2020

Uh oh!

Uh oh!

Uh oh!

MichaelEischer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greatroar commented Jul 3, 2020

Uh oh!

aawsome commented Jul 3, 2020

Uh oh!

greatroar commented Jul 3, 2020

Uh oh!

MichaelEischer commented Jul 3, 2020

Uh oh!

aawsome commented Jul 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greatroar commented Jul 5, 2020

Uh oh!

MichaelEischer commented Jul 5, 2020

Uh oh!

greatroar commented Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelEischer commented Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greatroar commented Jul 12, 2020

Uh oh!

MichaelEischer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greatroar commented Jun 28, 2020 •

edited

Loading

greatroar commented Jun 28, 2020 •

edited

Loading

aawsome commented Jun 28, 2020 •

edited

Loading

aawsome commented Jul 4, 2020 •

edited

Loading

greatroar commented Jul 5, 2020 •

edited

Loading

MichaelEischer commented Jul 11, 2020 •

edited

Loading

MichaelEischer left a comment •

edited

Loading

greatroar commented Jul 17, 2020 •

edited

Loading