Use low memory map for in-memory index - about 45% less memory for index by aawsome · Pull Request #2811 · restic/restic

aawsome · 2020-06-27T19:37:46Z

UPDATE: This PR is only left for reference; IMO #2812 should be followed which gives slightly worse memory results but much better handling and performance.

What is the purpose of this change? What does it change?

Use a low memory map for the in-memory index.

In fact there are a few changes which come with this PR:

Use a separate map for treeBlobs and dataBlobs as this saves 1 byte per index entry.
Use a complete replacement for go's map which uses much less memory and doesn't show worst-case behavior on memory usage. The tradeoff is that this map is slower, needs an extra sorting step and there are performance spikes when inserting (growing is done all at once). IMO these drawbacks are acceptable within restic.
Remove the extra handling of duplicate blobs as this "feature" is now integrated within the map.

For more details about the low-memory map, see internal/repository/index_inmem.go.

Here is the comparison to master (where #2781 is already merged!) of the repository benchmarks:

name                                     old time/op    new time/op    delta
PackerManager-4                             288ms ± 5%     262ms ± 0%    -8.94%  (p=0.000 n=10+8)
DecodeIndex-4                              18.4µs ± 4%    23.6µs ± 2%   +28.55%  (p=0.000 n=10+10)
IndexHasUnknown-4                          61.7ns ± 8%   247.0ns ± 2%  +300.19%  (p=0.000 n=10+8)
IndexHasKnown-4                            63.3ns ± 2%   279.8ns ± 2%  +341.68%  (p=0.000 n=9+9)
IndexAlloc-4                                1.08s ± 3%     1.23s ± 2%   +14.51%  (p=0.000 n=10+10)
MasterIndexLookupSingleIndex-4              173ns ± 5%     430ns ± 1%  +148.97%  (p=0.000 n=10+9)
MasterIndexLookupMultipleIndex-4            515ns ± 8%    1855ns ± 6%  +260.19%  (p=0.000 n=9+10)
MasterIndexLookupSingleIndexUnknown-4      90.9ns ± 6%   268.4ns ± 5%  +195.24%  (p=0.000 n=10+10)
MasterIndexLookupMultipleIndexUnknown-4     435ns ± 5%    1643ns ± 5%  +277.42%  (p=0.002 n=4+10)
LoadTree-4                                  459µs ± 1%     460µs ± 2%      ~     (p=0.481 n=10+10)
LoadBlob-4                                 8.19ms ± 1%    8.18ms ± 0%      ~     (p=0.549 n=10+9)
LoadAndDecrypt-4                           10.2ms ± 4%    10.1ms ± 3%      ~     (p=0.190 n=10+10)
LoadIndex-4                                41.6ms ± 2%    42.8ms ± 3%    +2.94%  (p=0.000 n=10+10)
SaveAndEncrypt-4                            127ns ± 1%     204ns ± 2%   +60.66%  (p=0.000 n=9+10)

name                                     old speed      new speed      delta
PackerManager-4                           183MB/s ± 5%   201MB/s ± 0%    +9.74%  (p=0.000 n=10+8)
LoadBlob-4                                122MB/s ± 1%   122MB/s ± 0%      ~     (p=0.534 n=10+9)
LoadAndDecrypt-4                         97.9MB/s ± 4%  99.3MB/s ± 3%      ~     (p=0.190 n=10+10)
SaveAndEncrypt-4                         33.0TB/s ± 1%  20.6TB/s ± 1%   -37.54%  (p=0.000 n=10+9)

name                                     old alloc/op   new alloc/op   delta
PackerManager-4                            91.0kB ± 0%    91.0kB ± 0%      ~     (p=1.117 n=9+10)
IndexAlloc-4                                786MB ± 0%     430MB ± 0%   -45.28%  (p=0.000 n=10+10)
SaveAndEncrypt-4                            2.00B ± 0%     3.00B ± 0%   +50.00%  (p=0.000 n=10+10)

name                                     old allocs/op  new allocs/op  delta
PackerManager-4                             1.41k ± 0%     1.41k ± 0%      ~     (p=0.917 n=9+10)
IndexAlloc-4                                 977k ± 0%      995k ± 0%    +1.90%  (p=0.000 n=10+10)
SaveAndEncrypt-4                             0.00           0.00           ~     (all equal)

As expected, some timings are getting worse, but these are internal timings and I didn't see impacts with restic commands (but didn't measure these, either). Note the -45.28% on memory usage.

Also note, that with this PR huge index speed improvements are possible when merging the indexes, see #2799 .

Was the change discussed in an issue or in the forum before?

The ideas come from #2523 and #2799.

Checklist

I have read the Contribution Guidelines
I have enabled maintainer edits for this PR
I have added tests for all changes in this PR
I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

greatroar · 2020-06-28T05:30:14Z

Interesting.

For comparison, I've been trying custom hash tables as well, but started with a simple chaning one. That also seemed to solve the memory use issue, but may come with some subtle performance issues: sequential performance was down between 8% and 31% on the benchmarks, and I could see a second core running the GC at 100%, so concurrent performance might be worse.

internal/restic/id.go

internal/repository/index_inmem.go

greatroar · 2020-06-28T06:45:57Z

I put up my chaining version as an alternative in #2812 because its allocation rate is comparable, but it's lookup speed is a lot better than this version. It also admits merging more easily, I think, if we still want to try that some day.

aawsome · 2020-06-28T07:40:41Z

I put up my chaining version as an alternative in #2812 because its allocation rate is comparable, but it's lookup speed is a lot better than this version. It also admits merging more easily, I think, if we still want to try that some day.

Your linked list looks quite promising (and IMO memory consumption could be much further decreased by not using a separate "duplicates") I'm however a bit irritated: my approach should lead to at most around 5 bytes overhead per index entry, while your approach (with separate duplicates) should yield 32 extra bytes.
So I think the "new allocs" do not indicate memory usage, but just new allocates.

Or I did miss some issues within my implementation...

greatroar · 2020-06-28T08:26:29Z

allocs/op is the number of calls to the memory allocator. alloc/op is the total amount of memory requested, not the peak memory usage. I'm currently trying to benchmark both PRs against master on some realistic tasks, and I'm getting conflicting results: sometimes mine wins, sometimes yours. Throughput isn't much different. Do you know what restic command is most likely to allocate indexes and not much else?

You're right about the duplicates handling, that's the part I couldn't quite get right.

greatroar · 2020-06-28T08:51:59Z

Also, I don't really get your 5 vs. 32 bytes comment, care to elaborate?

aawsome · 2020-06-28T09:33:41Z

Do you know what restic command is most likely to allocate indexes and not much else?

I would say restic cat blob is your candidate 😉

You're right about the duplicates handling, that's the part I couldn't quite get right.

IMO you should just include them into your chained list (aka linked list) and implement a get and getAll method like I did in this PR. This will save you 24 bytes per index entry.

aawsome · 2020-06-28T09:45:34Z

Also, I don't really get your 5 vs. 32 bytes comment, care to elaborate?

The entry size for this PR is 32 bytes + 16 bytes = 48 bytes. (49 bytes for your approach as you also save the BlobType)

For this PR: on average 1/2 byte of overhead per index entry + 1/2 unused page per 48 total pages. 1/2 page is 16 * 48 bytes and 48 pages will have around 32 * 48 entries. This is another 2 bytes overhead.

So I actually expect 2,5 bytes overhead per entry but worst case will be about double that amount.

For your approach I just took into account the extra pointer (8 bytes) and the duplicates slice (24 bytes). This is already 32 bytes per entry. On addition there is the overhead of managing the buckets which is 8 bytes per bucket and on average 2/3 buckets per entry. So total overhead should be 37 bytes on average. If you remove the duplicates slice, it will still be 13 bytes per entry on average and a bit more in the worst case.

greatroar · 2020-06-28T09:52:03Z

I think you're missing some bytes of malloc overhead. The Go malloc, IIIRC, allocates in 64-byte multiples, so your buckets are bigger than you think. My use of a free list takes care of some of that.

Anyway, I found a way to store the duplicates in a separate list and get the indexEntry down to 64 bytes. Debugging as we speak :)

aawsome · 2020-06-28T10:41:05Z

I think you're missing some bytes of malloc overhead. The Go malloc, IIIRC, allocates in 64-byte multiples, so your buckets are bigger than you think. My use of a free list takes care of some of that.

I think it is either 8 bytes or 16 byte multiples up to 288 bytes or 32 byte multiples up to 512 bytes.

EDIT: 16 bytes multiples up to 256 bytes 😉

greatroar · 2020-06-28T10:56:35Z

Judging from the malloc code, you're right about the tiny sizes. Good to know.

greatroar · 2020-06-28T11:40:28Z

internal/repository/index.go

-	packs      restic.IDs
-	treePacks  restic.IDs
+	m         sync.Mutex
+	byType    map[restic.BlobType]*lowMemMap


This little map may well be a performance killer on the benchmark. In #2812, I inline two lazily-initialized header structures in the Index.

It didn't contribute too much to the performance. However, using byType []*lowMemMap is the way better option. This can be directly accessed with byType[tpe] where tpe is restic.BlobType.
I changed the implementation, so that you can use this as reference for modifying your PR (also made a comment there).

aawsome · 2020-06-28T14:38:43Z

After also adding a linked list variant (approach of @greatroar) , IMO that approach should be used. I change this to draft for reference and will close it as soon as #2812 is merged.

aawsome · 2020-07-21T08:37:58Z

closed as #2812 is merged.

use low memory map for in-memory index

3c66989

aawsome mentioned this pull request Jun 27, 2020

Merge in-memory indexes; use multiple maps #2799

Closed

8 tasks

remove obsolete Index.Stats()

64ecc66