Don't save exact duplicates in merged index by aawsome · Pull Request #2863 · restic/restic

aawsome · 2020-08-02T07:13:37Z

What does this PR change? What problem does it solve?

Don't save exact duplicates (i.e. same blob ID, same pack, same offset and same length) in index. This situation can happen in rare cases where superseded index files are not deleted. Saving exact duplicates doesn't help anywhere but may pose some problems (more memory usage; and possible trouble within #2842)

Was the change discussed in an issue or in the forum before?

see #2839

Checklist

I have read the Contribution Guidelines
I have enabled maintainer edits for this PR
I have added tests for all changes in this PR
I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

internal/repository/indexmap.go

greatroar · 2020-08-03T10:43:27Z

I haven't investigated the problem, but the code looks good. Insertion will be more expensive, but is still O(1) expected. Merge reduces to add so that is covered as well (maybe add a test for that?).

internal/repository/indexmap_test.go

aawsome · 2020-08-04T05:24:17Z

There is another issue I just came around when thinking about this change once again: The packIDs are not deduplicated so far..
I need to think about how to best do this and change this to WIP in meanwhile..

aawsome · 2020-08-04T14:48:15Z

I changed this such that now the check is performed in Index.merge(). IndexMap` stays unchanged now.

MichaelEischer

@aawsome Did you check the masterIndex benchmarks regarding the performance impact of the additional foreach loop?

internal/repository/index.go

internal/repository/master_index_test.go

aawsome · 2020-08-05T04:59:47Z

@aawsome Did you check the masterIndex benchmarks regarding the performance impact of the additional foreach loop?

Note that the additional foreachID loop is O(1) itself (at least if the undelying hash is not degenerated). Hence the total complexity of the merge is still roughly (neglecting hash growing of idx) O(n) with n being the number of elements of the index to be merged, i.e. idx2.

Moreover this change only affects the merge itself, and in real-world-scenarios this occurs during

loading the index from index files into memory (should be dominated by loading and decrypting/deserializing the files)
saving a new finished index during backup (this should happen quite rarely as backup itself should be mainly dominated by actually saving data)

So I do not expect any impact on real-life usage.

About benchmarking: Seems to me that none of the current benchmarks actually includes merging of indexes in the measurements.
I added the benchmark MasterIndexAlloc which creates and merges Indexes under measurement.
Here are the results comparing 5e63294 with b112533:

name                old time/op    new time/op    delta
MasterIndexAlloc-4     334ms ± 1%     460ms ± 1%  +37.68%  (p=0.000 n=9+8)

name                old alloc/op   new alloc/op   delta
MasterIndexAlloc-4     258MB ± 0%     258MB ± 0%     ~     (p=0.101 n=10+8)

name                old allocs/op  new allocs/op  delta
MasterIndexAlloc-4      317k ± 0%      317k ± 0%     ~     (p=0.267 n=10+6)

MichaelEischer

LGTM.

Note that the additional foreachID loop is O(1) itself (at least if the undelying hash is not degenerated). Hence the total complexity of the merge is still roughly (neglecting hash growing of idx) O(n) with n being the number of elements of the index to be merged, i.e. idx2.

foreachID could end up a rather large value if a single blob exists lots of times in different packs. But in that case the repository is probably really broken. So amortized O(1) would be slightly more precise (not that it matters).

The overhead of 40% seems fine for me, I was worried a bit that we could end up with an overhead of maybe 2x-3x.

aawsome mentioned this pull request Aug 2, 2020

Rebuild index in prune by using in-memory index #2842

Merged

8 tasks

greatroar reviewed Aug 3, 2020

View reviewed changes

internal/repository/indexmap.go Outdated Show resolved Hide resolved

MichaelEischer reviewed Aug 3, 2020

View reviewed changes

internal/repository/indexmap_test.go Outdated Show resolved Hide resolved

aawsome marked this pull request as draft August 4, 2020 05:24

aawsome force-pushed the index-no-duplicates branch from 5fa36d8 to 6ad5808 Compare August 4, 2020 14:44

aawsome marked this pull request as ready for review August 4, 2020 14:47

aawsome changed the title ~~Don't save exact duplicates in indexmap~~ Don't save exact duplicates in merged index Aug 4, 2020

MichaelEischer reviewed Aug 4, 2020

View reviewed changes

internal/repository/index.go Outdated Show resolved Hide resolved

internal/repository/master_index_test.go Outdated Show resolved Hide resolved

aawsome added 2 commits August 5, 2020 06:32

Add benchmark MasterIndexAlloc

5e63294

Don't save exact duplicates when merging indexes

b112533

aawsome force-pushed the index-no-duplicates branch from 6ad5808 to b112533 Compare August 5, 2020 04:50

MichaelEischer approved these changes Aug 8, 2020

View reviewed changes

MichaelEischer merged commit eca0f0a into restic:master Aug 8, 2020

aawsome deleted the index-no-duplicates branch August 31, 2020 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't save exact duplicates in merged index#2863

Don't save exact duplicates in merged index#2863
MichaelEischer merged 2 commits intorestic:masterfrom
aawsome:index-no-duplicates

aawsome commented Aug 2, 2020 •

edited

Loading

Uh oh!

Uh oh!

greatroar commented Aug 3, 2020

Uh oh!

Uh oh!

aawsome commented Aug 4, 2020

Uh oh!

aawsome commented Aug 4, 2020

Uh oh!

MichaelEischer left a comment

Uh oh!

Uh oh!

Uh oh!

aawsome commented Aug 5, 2020 •

edited

Loading

Uh oh!

MichaelEischer left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aawsome commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR change? What problem does it solve?

Was the change discussed in an issue or in the forum before?

Checklist

Uh oh!

Uh oh!

greatroar commented Aug 3, 2020

Uh oh!

Uh oh!

aawsome commented Aug 4, 2020

Uh oh!

aawsome commented Aug 4, 2020

Uh oh!

MichaelEischer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aawsome commented Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelEischer left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aawsome commented Aug 2, 2020 •

edited

Loading

aawsome commented Aug 5, 2020 •

edited

Loading