Skip to content

Don't save exact duplicates in merged index#2863

Merged
MichaelEischer merged 2 commits intorestic:masterfrom
aawsome:index-no-duplicates
Aug 8, 2020
Merged

Don't save exact duplicates in merged index#2863
MichaelEischer merged 2 commits intorestic:masterfrom
aawsome:index-no-duplicates

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Aug 2, 2020

What does this PR change? What problem does it solve?

Don't save exact duplicates (i.e. same blob ID, same pack, same offset and same length) in index. This situation can happen in rare cases where superseded index files are not deleted. Saving exact duplicates doesn't help anywhere but may pose some problems (more memory usage; and possible trouble within #2842)

Was the change discussed in an issue or in the forum before?

see #2839

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@greatroar
Copy link
Copy Markdown
Contributor

I haven't investigated the problem, but the code looks good. Insertion will be more expensive, but is still O(1) expected. Merge reduces to add so that is covered as well (maybe add a test for that?).

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 4, 2020

There is another issue I just came around when thinking about this change once again: The packIDs are not deduplicated so far..
I need to think about how to best do this and change this to WIP in meanwhile..

@aawsome aawsome marked this pull request as draft August 4, 2020 05:24
@aawsome aawsome force-pushed the index-no-duplicates branch from 5fa36d8 to 6ad5808 Compare August 4, 2020 14:44
@aawsome aawsome marked this pull request as ready for review August 4, 2020 14:47
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 4, 2020

I changed this such that now the check is performed in Index.merge(). IndexMap` stays unchanged now.

@aawsome aawsome changed the title Don't save exact duplicates in indexmap Don't save exact duplicates in merged index Aug 4, 2020
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aawsome Did you check the masterIndex benchmarks regarding the performance impact of the additional foreach loop?

@aawsome aawsome force-pushed the index-no-duplicates branch from 6ad5808 to b112533 Compare August 5, 2020 04:50
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 5, 2020

@aawsome Did you check the masterIndex benchmarks regarding the performance impact of the additional foreach loop?

Note that the additional foreachID loop is O(1) itself (at least if the undelying hash is not degenerated). Hence the total complexity of the merge is still roughly (neglecting hash growing of idx) O(n) with n being the number of elements of the index to be merged, i.e. idx2.

Moreover this change only affects the merge itself, and in real-world-scenarios this occurs during

  • loading the index from index files into memory (should be dominated by loading and decrypting/deserializing the files)
  • saving a new finished index during backup (this should happen quite rarely as backup itself should be mainly dominated by actually saving data)

So I do not expect any impact on real-life usage.

About benchmarking: Seems to me that none of the current benchmarks actually includes merging of indexes in the measurements.
I added the benchmark MasterIndexAlloc which creates and merges Indexes under measurement.
Here are the results comparing 5e63294 with b112533:

name                old time/op    new time/op    delta
MasterIndexAlloc-4     334ms ± 1%     460ms ± 1%  +37.68%  (p=0.000 n=9+8)

name                old alloc/op   new alloc/op   delta
MasterIndexAlloc-4     258MB ± 0%     258MB ± 0%     ~     (p=0.101 n=10+8)

name                old allocs/op  new allocs/op  delta
MasterIndexAlloc-4      317k ± 0%      317k ± 0%     ~     (p=0.267 n=10+6)

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Note that the additional foreachID loop is O(1) itself (at least if the undelying hash is not degenerated). Hence the total complexity of the merge is still roughly (neglecting hash growing of idx) O(n) with n being the number of elements of the index to be merged, i.e. idx2.

foreachID could end up a rather large value if a single blob exists lots of times in different packs. But in that case the repository is probably really broken. So amortized O(1) would be slightly more precise (not that it matters).

The overhead of 40% seems fine for me, I was worried a bit that we could end up with an overhead of maybe 2x-3x.

@MichaelEischer MichaelEischer merged commit eca0f0a into restic:master Aug 8, 2020
@aawsome aawsome deleted the index-no-duplicates branch August 31, 2020 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants