Skip to content

Optimize in-memory index memory consumption#2338

Closed
MichaelEischer wants to merge 2 commits intorestic:masterfrom
MichaelEischer:smaller-inmemory-index
Closed

Optimize in-memory index memory consumption#2338
MichaelEischer wants to merge 2 commits intorestic:masterfrom
MichaelEischer:smaller-inmemory-index

Conversation

@MichaelEischer
Copy link
Copy Markdown
Member

What is the purpose of this change? What does it change?

Slightly reduce the memory usage of the in-memory index that is used by most restic commands. The memory usage should be 10-15% lower than before. This is achieved by optimizing for the usual situation where each blob is contained in a single pack file.

The pull request still needs tests which trigger the code paths with multiple packs. I'm not sure whether there are currently any tests that explicitly generate blobs which are stored in multiple packs.

Was the change discussed in an issue or in the forum before?

The problem of high memory usage shows up in several issues #2284 , #1988 , #1723 . However, the approach of this pull request was not yet discussed in an issue. The underlying scalability issue is not tackled by this pull request.

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • No user visible parts. I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@codecov-io
Copy link
Copy Markdown

codecov-io commented Jul 13, 2019

Codecov Report

Merging #2338 into master will decrease coverage by 4.01%.
The diff coverage is 71.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2338      +/-   ##
==========================================
- Coverage   50.95%   46.93%   -4.02%     
==========================================
  Files         178      178              
  Lines       14548    14562      +14     
==========================================
- Hits         7413     6835     -578     
- Misses       6069     6704     +635     
+ Partials     1066     1023      -43
Impacted Files Coverage Δ
internal/repository/index.go 64.33% <71.11%> (-3.15%) ⬇️
internal/backend/b2/b2.go 0% <0%> (-79.55%) ⬇️
internal/backend/swift/swift.go 0% <0%> (-78.83%) ⬇️
internal/backend/gs/gs.go 0% <0%> (-74%) ⬇️
internal/backend/azure/azure.go 0% <0%> (-69.46%) ⬇️
internal/backend/swift/config.go 34.69% <0%> (-57.15%) ⬇️
internal/archiver/file_saver.go 85.96% <0%> (-2.64%) ⬇️
internal/backend/test/tests.go 59.95% <0%> (-0.66%) ⬇️
internal/archiver/blob_saver.go 97.53% <0%> (+2.46%) ⬆️
internal/checker/checker.go 70.25% <0%> (+3.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e9eefa...6be0d42. Read the comment docs.

As go is able to fully inline and optimize function calls the code just
becomes shorter and less duplicated.
Optimize for the default case where each blob is stored in exactly
one blob. Duplicate blobs can only appear if a backup manages to upload
a pack file but no corresponding index. In addition, the prune command
will remove these duplicated blobs. Therefore, only a small fraction of
blobs will ever be present in multiple packs.

Previously the blob-handle-to-pack-id-map stored a slice pointing to a
list of pack entries. The slice part which allows storing multiple
entries is, however, rarely needed. This patch replaces it with a
pointer to a single pack entry descriptor and a separate map to handle
"overflows", that is multiple pack entries for a single blob handle.

The simpler option of directly storing the entry descriptor in the map
would require more memory in the common case: A map contains a lot of
unused buckets which also consume memory. On amd64 a pointer to an
entry is 8 bytes, a slice is 24 bytes in size, and a pack entry requires
48 bytes. Maps in Go have a max. load factor L of 6.5/8 which after
growing will result in a load factor L of 3.25/8. With N being the
number of map slots this yields the following memory usage:

Previous: N*24 + N*L*48
Pointer: N*8 + N*L*48
Inline: N*48

The Pointer variant obviously needs less memory than Previous.
Inline needs more memory than the Pointer variant if
Inline - Pointer > 0 <=> N * (48 - 8 - L*48) > 0 <=> L < 5/6
As 5/6 > max. load factor 6.5/8 the Pointer variant always requires less
memory.

Note that shrinking a pack entry to 40 bytes would make the Inline
variant preferable for L > 2/3. The Pointer variant still needs to
allocate 48 bytes as the Go memory allocator has no pool for 40byte
objects and therefore the next biggest, that is 48 bytes, will be used.
L > 2/3 would cover approximately one third of the usual load factor
range, such that the Pointer variant would still be preferable on
average.
@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jan 12, 2020

Thanks @MichaelEischer for this proposal! I agree that we need a solution to the memory consumption problem and you correctly have identified the index data structure as the core point to improve 👍
This is also what I tried to discuss in #1988 and #2523.

About your solution I'm not confident that 10-15% improvement is enough. I like the idea of not storing duplicate information of blobs being in the index more than once (and this situation should anyway be cleaned by the next prune run or its alternatives, see #2513).
I would however prefer a solution where duplicates are not even saved in the in-memory index data structure. I did hardly find any code where the []PackedBlob in Lookup was really needed (I think it was the find command) and would propose to change the interface to Lookup(ID, BlobType) (*PackedBlob, bool). Then your PR would become much simpler while still achiving the same savings.

But I think there are even better solutions as the PackID for packs containing many blobs is still multiple times saved in memory and there is a notable overhead of all the maps which I think can be removed without loosing too much performance. Please have a look at index_lowmem2.go in index-alternatives. This is however still WIP and needs more testing.

Maybe some of the restic maintainers can give directions where to work on...

@MichaelEischer
Copy link
Copy Markdown
Member Author

The []PackedBlob array is also used in Repository.loadBlob to try to get the blob from an already cached pack. It could also be useful, when trying to fix a repository with damaged files. However, the simplified interface would make the code a lot simpler.

@aawsome : Yes, I know that 10-15% improvement is not that much. The PackIDs can definitely be stored separately which should improve memory usage by quite a lot. However, as the pull request is already several months old I can't remember why I've ignored that optimization.

@MichaelEischer
Copy link
Copy Markdown
Member Author

@MichaelEischer MichaelEischer deleted the smaller-inmemory-index branch February 10, 2020 21:25
@tcurdt
Copy link
Copy Markdown

tcurdt commented Feb 10, 2020

@MichaelEischer are there PRs for those to follow?

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Feb 11, 2020

@tcurdt For the refactor-index branch see #2523 for the discussion.
And yes a PR will follow...

@aawsome aawsome mentioned this pull request Jun 12, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants