Skip to content

Use low memory map for in-memory index - about 45% less memory for index#2811

Closed
aawsome wants to merge 4 commits intorestic:masterfrom
aawsome:index-lowmem
Closed

Use low memory map for in-memory index - about 45% less memory for index#2811
aawsome wants to merge 4 commits intorestic:masterfrom
aawsome:index-lowmem

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Jun 27, 2020

UPDATE: This PR is only left for reference; IMO #2812 should be followed which gives slightly worse memory results but much better handling and performance.

What is the purpose of this change? What does it change?

Use a low memory map for the in-memory index.

In fact there are a few changes which come with this PR:

  • Use a separate map for treeBlobs and dataBlobs as this saves 1 byte per index entry.
  • Use a complete replacement for go's map which uses much less memory and doesn't show worst-case behavior on memory usage. The tradeoff is that this map is slower, needs an extra sorting step and there are performance spikes when inserting (growing is done all at once). IMO these drawbacks are acceptable within restic.
  • Remove the extra handling of duplicate blobs as this "feature" is now integrated within the map.

For more details about the low-memory map, see internal/repository/index_inmem.go.

Here is the comparison to master (where #2781 is already merged!) of the repository benchmarks:

name                                     old time/op    new time/op    delta
PackerManager-4                             288ms ± 5%     262ms ± 0%    -8.94%  (p=0.000 n=10+8)
DecodeIndex-4                              18.4µs ± 4%    23.6µs ± 2%   +28.55%  (p=0.000 n=10+10)
IndexHasUnknown-4                          61.7ns ± 8%   247.0ns ± 2%  +300.19%  (p=0.000 n=10+8)
IndexHasKnown-4                            63.3ns ± 2%   279.8ns ± 2%  +341.68%  (p=0.000 n=9+9)
IndexAlloc-4                                1.08s ± 3%     1.23s ± 2%   +14.51%  (p=0.000 n=10+10)
MasterIndexLookupSingleIndex-4              173ns ± 5%     430ns ± 1%  +148.97%  (p=0.000 n=10+9)
MasterIndexLookupMultipleIndex-4            515ns ± 8%    1855ns ± 6%  +260.19%  (p=0.000 n=9+10)
MasterIndexLookupSingleIndexUnknown-4      90.9ns ± 6%   268.4ns ± 5%  +195.24%  (p=0.000 n=10+10)
MasterIndexLookupMultipleIndexUnknown-4     435ns ± 5%    1643ns ± 5%  +277.42%  (p=0.002 n=4+10)
LoadTree-4                                  459µs ± 1%     460µs ± 2%      ~     (p=0.481 n=10+10)
LoadBlob-4                                 8.19ms ± 1%    8.18ms ± 0%      ~     (p=0.549 n=10+9)
LoadAndDecrypt-4                           10.2ms ± 4%    10.1ms ± 3%      ~     (p=0.190 n=10+10)
LoadIndex-4                                41.6ms ± 2%    42.8ms ± 3%    +2.94%  (p=0.000 n=10+10)
SaveAndEncrypt-4                            127ns ± 1%     204ns ± 2%   +60.66%  (p=0.000 n=9+10)

name                                     old speed      new speed      delta
PackerManager-4                           183MB/s ± 5%   201MB/s ± 0%    +9.74%  (p=0.000 n=10+8)
LoadBlob-4                                122MB/s ± 1%   122MB/s ± 0%      ~     (p=0.534 n=10+9)
LoadAndDecrypt-4                         97.9MB/s ± 4%  99.3MB/s ± 3%      ~     (p=0.190 n=10+10)
SaveAndEncrypt-4                         33.0TB/s ± 1%  20.6TB/s ± 1%   -37.54%  (p=0.000 n=10+9)

name                                     old alloc/op   new alloc/op   delta
PackerManager-4                            91.0kB ± 0%    91.0kB ± 0%      ~     (p=1.117 n=9+10)
IndexAlloc-4                                786MB ± 0%     430MB ± 0%   -45.28%  (p=0.000 n=10+10)
SaveAndEncrypt-4                            2.00B ± 0%     3.00B ± 0%   +50.00%  (p=0.000 n=10+10)

name                                     old allocs/op  new allocs/op  delta
PackerManager-4                             1.41k ± 0%     1.41k ± 0%      ~     (p=0.917 n=9+10)
IndexAlloc-4                                 977k ± 0%      995k ± 0%    +1.90%  (p=0.000 n=10+10)
SaveAndEncrypt-4                             0.00           0.00           ~     (all equal)

As expected, some timings are getting worse, but these are internal timings and I didn't see impacts with restic commands (but didn't measure these, either). Note the -45.28% on memory usage.

Also note, that with this PR huge index speed improvements are possible when merging the indexes, see #2799 .

Was the change discussed in an issue or in the forum before?

The ideas come from #2523 and #2799.

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@greatroar
Copy link
Copy Markdown
Contributor

greatroar commented Jun 28, 2020

Interesting.

For comparison, I've been trying custom hash tables as well, but started with a simple chaning one. That also seemed to solve the memory use issue, but may come with some subtle performance issues: sequential performance was down between 8% and 31% on the benchmarks, and I could see a second core running the GC at 100%, so concurrent performance might be worse.

@greatroar
Copy link
Copy Markdown
Contributor

I put up my chaining version as an alternative in #2812 because its allocation rate is comparable, but it's lookup speed is a lot better than this version. It also admits merging more easily, I think, if we still want to try that some day.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 28, 2020

I put up my chaining version as an alternative in #2812 because its allocation rate is comparable, but it's lookup speed is a lot better than this version. It also admits merging more easily, I think, if we still want to try that some day.

Your linked list looks quite promising (and IMO memory consumption could be much further decreased by not using a separate "duplicates") I'm however a bit irritated: my approach should lead to at most around 5 bytes overhead per index entry, while your approach (with separate duplicates) should yield 32 extra bytes.
So I think the "new allocs" do not indicate memory usage, but just new allocates.

Or I did miss some issues within my implementation...

@greatroar
Copy link
Copy Markdown
Contributor

allocs/op is the number of calls to the memory allocator. alloc/op is the total amount of memory requested, not the peak memory usage. I'm currently trying to benchmark both PRs against master on some realistic tasks, and I'm getting conflicting results: sometimes mine wins, sometimes yours. Throughput isn't much different. Do you know what restic command is most likely to allocate indexes and not much else?

You're right about the duplicates handling, that's the part I couldn't quite get right.

@greatroar
Copy link
Copy Markdown
Contributor

Also, I don't really get your 5 vs. 32 bytes comment, care to elaborate?

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 28, 2020

Do you know what restic command is most likely to allocate indexes and not much else?

I would say restic cat blob is your candidate 😉

You're right about the duplicates handling, that's the part I couldn't quite get right.

IMO you should just include them into your chained list (aka linked list) and implement a get and getAll method like I did in this PR. This will save you 24 bytes per index entry.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 28, 2020

Also, I don't really get your 5 vs. 32 bytes comment, care to elaborate?

The entry size for this PR is 32 bytes + 16 bytes = 48 bytes. (49 bytes for your approach as you also save the BlobType)

For this PR: on average 1/2 byte of overhead per index entry + 1/2 unused page per 48 total pages. 1/2 page is 16 * 48 bytes and 48 pages will have around 32 * 48 entries. This is another 2 bytes overhead.

So I actually expect 2,5 bytes overhead per entry but worst case will be about double that amount.

For your approach I just took into account the extra pointer (8 bytes) and the duplicates slice (24 bytes). This is already 32 bytes per entry. On addition there is the overhead of managing the buckets which is 8 bytes per bucket and on average 2/3 buckets per entry. So total overhead should be 37 bytes on average. If you remove the duplicates slice, it will still be 13 bytes per entry on average and a bit more in the worst case.

@greatroar
Copy link
Copy Markdown
Contributor

I think you're missing some bytes of malloc overhead. The Go malloc, IIIRC, allocates in 64-byte multiples, so your buckets are bigger than you think. My use of a free list takes care of some of that.

Anyway, I found a way to store the duplicates in a separate list and get the indexEntry down to 64 bytes. Debugging as we speak :)

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 28, 2020

I think you're missing some bytes of malloc overhead. The Go malloc, IIIRC, allocates in 64-byte multiples, so your buckets are bigger than you think. My use of a free list takes care of some of that.

I think it is either 8 bytes or 16 byte multiples up to 288 bytes or 32 byte multiples up to 512 bytes.

EDIT: 16 bytes multiples up to 256 bytes 😉

@greatroar
Copy link
Copy Markdown
Contributor

Judging from the malloc code, you're right about the tiny sizes. Good to know.

packs restic.IDs
treePacks restic.IDs
m sync.Mutex
byType map[restic.BlobType]*lowMemMap
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This little map may well be a performance killer on the benchmark. In #2812, I inline two lazily-initialized header structures in the Index.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't contribute too much to the performance. However, using byType []*lowMemMap is the way better option. This can be directly accessed with byType[tpe] where tpe is restic.BlobType.
I changed the implementation, so that you can use this as reference for modifying your PR (also made a comment there).

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 28, 2020

After also adding a linked list variant (approach of @greatroar) , IMO that approach should be used. I change this to draft for reference and will close it as soon as #2812 is merged.

@aawsome aawsome marked this pull request as draft June 28, 2020 14:39
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jul 21, 2020

closed as #2812 is merged.

@aawsome aawsome closed this Jul 21, 2020
@aawsome aawsome deleted the index-lowmem branch November 7, 2020 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants