Optimize in-memory index memory consumption by MichaelEischer · Pull Request #2338 · restic/restic

MichaelEischer · 2019-07-13T19:45:38Z

What is the purpose of this change? What does it change?

Slightly reduce the memory usage of the in-memory index that is used by most restic commands. The memory usage should be 10-15% lower than before. This is achieved by optimizing for the usual situation where each blob is contained in a single pack file.

The pull request still needs tests which trigger the code paths with multiple packs. I'm not sure whether there are currently any tests that explicitly generate blobs which are stored in multiple packs.

Was the change discussed in an issue or in the forum before?

The problem of high memory usage shows up in several issues #2284 , #1988 , #1723 . However, the approach of this pull request was not yet discussed in an issue. The underlying scalability issue is not tackled by this pull request.

Checklist

I have read the Contribution Guidelines
I have added tests for all changes in this PR
No user visible parts. I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

codecov-io · 2019-07-13T19:51:38Z

Codecov Report

Merging #2338 into master will decrease coverage by 4.01%.
The diff coverage is 71.11%.

@@            Coverage Diff             @@
##           master    #2338      +/-   ##
==========================================
- Coverage   50.95%   46.93%   -4.02%     
==========================================
  Files         178      178              
  Lines       14548    14562      +14     
==========================================
- Hits         7413     6835     -578     
- Misses       6069     6704     +635     
+ Partials     1066     1023      -43

Impacted Files	Coverage Δ
internal/repository/index.go	`64.33% <71.11%> (-3.15%)`	⬇️
internal/backend/b2/b2.go	`0% <0%> (-79.55%)`	⬇️
internal/backend/swift/swift.go	`0% <0%> (-78.83%)`	⬇️
internal/backend/gs/gs.go	`0% <0%> (-74%)`	⬇️
internal/backend/azure/azure.go	`0% <0%> (-69.46%)`	⬇️
internal/backend/swift/config.go	`34.69% <0%> (-57.15%)`	⬇️
internal/archiver/file_saver.go	`85.96% <0%> (-2.64%)`	⬇️
internal/backend/test/tests.go	`59.95% <0%> (-0.66%)`	⬇️
internal/archiver/blob_saver.go	`97.53% <0%> (+2.46%)`	⬆️
internal/checker/checker.go	`70.25% <0%> (+3.66%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e9eefa...6be0d42. Read the comment docs.

As go is able to fully inline and optimize function calls the code just becomes shorter and less duplicated.

Optimize for the default case where each blob is stored in exactly one blob. Duplicate blobs can only appear if a backup manages to upload a pack file but no corresponding index. In addition, the prune command will remove these duplicated blobs. Therefore, only a small fraction of blobs will ever be present in multiple packs. Previously the blob-handle-to-pack-id-map stored a slice pointing to a list of pack entries. The slice part which allows storing multiple entries is, however, rarely needed. This patch replaces it with a pointer to a single pack entry descriptor and a separate map to handle "overflows", that is multiple pack entries for a single blob handle. The simpler option of directly storing the entry descriptor in the map would require more memory in the common case: A map contains a lot of unused buckets which also consume memory. On amd64 a pointer to an entry is 8 bytes, a slice is 24 bytes in size, and a pack entry requires 48 bytes. Maps in Go have a max. load factor L of 6.5/8 which after growing will result in a load factor L of 3.25/8. With N being the number of map slots this yields the following memory usage: Previous: N*24 + N*L*48 Pointer: N*8 + N*L*48 Inline: N*48 The Pointer variant obviously needs less memory than Previous. Inline needs more memory than the Pointer variant if Inline - Pointer > 0 <=> N * (48 - 8 - L*48) > 0 <=> L < 5/6 As 5/6 > max. load factor 6.5/8 the Pointer variant always requires less memory. Note that shrinking a pack entry to 40 bytes would make the Inline variant preferable for L > 2/3. The Pointer variant still needs to allocate 48 bytes as the Go memory allocator has no pool for 40byte objects and therefore the next biggest, that is 48 bytes, will be used. L > 2/3 would cover approximately one third of the usual load factor range, such that the Pointer variant would still be preferable on average.

aawsome · 2020-01-12T19:41:45Z

Thanks @MichaelEischer for this proposal! I agree that we need a solution to the memory consumption problem and you correctly have identified the index data structure as the core point to improve 👍
This is also what I tried to discuss in #1988 and #2523.

About your solution I'm not confident that 10-15% improvement is enough. I like the idea of not storing duplicate information of blobs being in the index more than once (and this situation should anyway be cleaned by the next prune run or its alternatives, see #2513).
I would however prefer a solution where duplicates are not even saved in the in-memory index data structure. I did hardly find any code where the []PackedBlob in Lookup was really needed (I think it was the find command) and would propose to change the interface to Lookup(ID, BlobType) (*PackedBlob, bool). Then your PR would become much simpler while still achiving the same savings.

But I think there are even better solutions as the PackID for packs containing many blobs is still multiple times saved in memory and there is a notable overhead of all the maps which I think can be removed without loosing too much performance. Please have a look at index_lowmem2.go in index-alternatives. This is however still WIP and needs more testing.

Maybe some of the restic maintainers can give directions where to work on...

MichaelEischer · 2020-01-12T20:58:01Z

The []PackedBlob array is also used in Repository.loadBlob to try to get the blob from an already cached pack. It could also be useful, when trying to fix a repository with damaged files. However, the simplified interface would make the code a lot simpler.

@aawsome : Yes, I know that 10-15% improvement is not that much. The PackIDs can definitely be stored separately which should improve memory usage by quite a lot. However, as the pull request is already several months old I can't remember why I've ignored that optimization.

MichaelEischer · 2020-02-10T21:24:51Z

Closing as this pull request is superseded by either https://github.com/aawsome/restic/tree/refactor-index or https://github.com/MichaelEischer/restic/tree/opt-hash-idx

tcurdt · 2020-02-10T23:56:31Z

@MichaelEischer are there PRs for those to follow?

aawsome · 2020-02-11T06:43:40Z

@tcurdt For the refactor-index branch see #2523 for the discussion.
And yes a PR will follow...

aawsome mentioned this pull request Dec 17, 2019

reduce index memory usage #1988

Open

MichaelEischer added 2 commits January 12, 2020 17:38

repository: Factor out PacketBlob assembly

808911e

As go is able to fully inline and optimize function calls the code just becomes shorter and less duplicated.

MichaelEischer force-pushed the smaller-inmemory-index branch from 6be0d42 to 2327960 Compare January 12, 2020 16:38

MichaelEischer closed this Feb 10, 2020

MichaelEischer deleted the smaller-inmemory-index branch February 10, 2020 21:25

aawsome mentioned this pull request Jun 12, 2020

Reduce index memory #2781

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize in-memory index memory consumption#2338

Optimize in-memory index memory consumption#2338
MichaelEischer wants to merge 2 commits intorestic:masterfrom
MichaelEischer:smaller-inmemory-index

MichaelEischer commented Jul 13, 2019

Uh oh!

codecov-io commented Jul 13, 2019 •

edited

Loading

Uh oh!

aawsome commented Jan 12, 2020

Uh oh!

MichaelEischer commented Jan 12, 2020

Uh oh!

MichaelEischer commented Feb 10, 2020

Uh oh!

tcurdt commented Feb 10, 2020

Uh oh!

aawsome commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MichaelEischer commented Jul 13, 2019

What is the purpose of this change? What does it change?

Was the change discussed in an issue or in the forum before?

Checklist

Uh oh!

codecov-io commented Jul 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aawsome commented Jan 12, 2020

Uh oh!

MichaelEischer commented Jan 12, 2020

Uh oh!

MichaelEischer commented Feb 10, 2020

Uh oh!

tcurdt commented Feb 10, 2020

Uh oh!

aawsome commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Jul 13, 2019 •

edited

Loading