Skip to content

Change condition for full index#2749

Merged
MichaelEischer merged 1 commit intorestic:masterfrom
aawsome:fix-fullindex
Jun 10, 2020
Merged

Change condition for full index#2749
MichaelEischer merged 1 commit intorestic:masterfrom
aawsome:fix-fullindex

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented May 24, 2020

What is the purpose of this change? What does it change?

Index files are saved when they are "full". This is determined by the number of blobs within and the age of the index. The current coding translates to the following decision table:

age < MinAge age in [MinAge, MaxAge] age > MaxAge
#blobs < MinBlobs not full not full full
#blobs in [MinBlobs, MaxBlobs] not full not full full
#blobs > MaxBlobs not full full full

This is a bit counter-intuitive and has the following shortcomings:

  • MinBlobs is specified, but effectively not used as it does not change the behavior
  • an index is never full if it has not reached the age MinAge - despite it can can possible contain millions of index entries.

This PR changes the decision table to:

EDIT: condition has been even more simplified, see below in the comments

age < MinAge age in [MinAge, MaxAge] age > MaxAge
#blobs < MinBlobs not full not full full
#blobs in [MinBlobs, MaxBlobs] not full full full
#blobs > MaxBlobs full full full

Moreover it changes the value of MinBlobs to 50 and MaxBlobs to 50.000 (was 2.000)

Was the change discussed in an issue or in the forum before?

No, but the discussion came up in #2718.

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have not added tests for all changes in this PR - no extra tests needed.
  • I have not added documentation for the changes (in the manual) - no extra docu needed.
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here) - do we really need one for this minor change?
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@aawsome aawsome mentioned this pull request May 24, 2020
11 tasks
@aawsome aawsome mentioned this pull request Jun 6, 2020
7 tasks
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the intuition behind saving an index once it's more than two minutes old (MinAge) and contains at least 50 blobs (MinBlobs). Assuming you have at least 1MByte/s upload speed and backup a large file. Then with 1.5MB blobs (the average size) this would still be 80 blobs within 2 minutes and therefore immediately trigger an index upload. So this would degrade to Full() == (age>MinAge or blobs>MaxBlobs).

I'd propose to simplify this whole method instead and just check for (age>MaxAge or blobs>MaxBlobs) (notice the MaxAge instead of MinAge). And then maybe reduce MaxAge to 10 minutes.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 7, 2020

I don't really understand the intuition behind saving an index once it's more than two minutes old (MinAge) and contains at least 50 blobs (MinBlobs). Assuming you have at least 1MByte/s upload speed and backup a large file. Then with 1.5MB blobs (the average size) this would still be 80 blobs within 2 minutes and therefore immediately trigger an index upload. So this would degrade to Full() == (age>MinAge or blobs>MaxBlobs).

I also didn't understand why upper and lower limits were needed 😉 Just wanted to fix it in a way that it more consistent and intuitive...

I'd propose to simplify this whole method instead and just check for (age>MaxAge or blobs>MaxBlobs) (notice the MaxAge instead of MinAge). And then maybe reduce MaxAge to 10 minutes.

I agree and will change that soon..

@aawsome aawsome force-pushed the fix-fullindex branch 2 times, most recently from 536837f to 785c63b Compare June 7, 2020 18:52
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 7, 2020

I changed the condition as suggested.

@MichaelEischer
Copy link
Copy Markdown
Member

The new limit of MaxBlobs = 50k might seem to be just a round number that was picked arbitrarily, but actually it provides a bunch of useful properties:

  • One blob entry in an index requires approx. 120 bytes, each listed pack file adds another 85 bytes. For small blobs we can basically ignore the per pack overhead, while for large blobs that overhead applies approximately once every for blobs. This yield 120-141 bytes per blob on average. With 50k blobs the resulting index has between 6 and 7 MB. This is slightly larger than the minimum pack size of 4MB.
  • It reduces the memory required to load the (intermediate) index files. Go hashmaps use a load factor of 6.5/8 and allocate an array whose size is a power of two. For 50k entries this requires an array with 65536 slots. The maximum number of entries for that array size would be 53248, which is pretty close to the minimum blobs per pack of 50k. As pack files are added as a whole to the index and a pack can contain hundreds of blobs, we need the wiggle room of a few thousand blobs. This PR also obsoletes Optimize indexMaxBlobs for lower memory usage #2631 .

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the patch!

@MichaelEischer MichaelEischer merged commit 6856d1e into restic:master Jun 10, 2020
@aawsome aawsome deleted the fix-fullindex branch June 11, 2020 06:18
@aawsome aawsome mentioned this pull request Jun 12, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants