Don't save exact duplicates in merged index#2863
Don't save exact duplicates in merged index#2863MichaelEischer merged 2 commits intorestic:masterfrom
Conversation
|
I haven't investigated the problem, but the code looks good. Insertion will be more expensive, but is still O(1) expected. Merge reduces to add so that is covered as well (maybe add a test for that?). |
|
There is another issue I just came around when thinking about this change once again: The packIDs are not deduplicated so far.. |
5fa36d8 to
6ad5808
Compare
|
I changed this such that now the check is performed in |
MichaelEischer
left a comment
There was a problem hiding this comment.
@aawsome Did you check the masterIndex benchmarks regarding the performance impact of the additional foreach loop?
6ad5808 to
b112533
Compare
Note that the additional Moreover this change only affects the merge itself, and in real-world-scenarios this occurs during
So I do not expect any impact on real-life usage. About benchmarking: Seems to me that none of the current benchmarks actually includes merging of indexes in the measurements. |
MichaelEischer
left a comment
There was a problem hiding this comment.
LGTM.
Note that the additional
foreachIDloop isO(1)itself (at least if the undelying hash is not degenerated). Hence the total complexity of the merge is still roughly (neglecting hash growing ofidx)O(n)withnbeing the number of elements of the index to be merged, i.e.idx2.
foreachID could end up a rather large value if a single blob exists lots of times in different packs. But in that case the repository is probably really broken. So amortized O(1) would be slightly more precise (not that it matters).
The overhead of 40% seems fine for me, I was worried a bit that we could end up with an overhead of maybe 2x-3x.
What does this PR change? What problem does it solve?
Don't save exact duplicates (i.e. same blob ID, same pack, same offset and same length) in index. This situation can happen in rare cases where superseded index files are not deleted. Saving exact duplicates doesn't help anywhere but may pose some problems (more memory usage; and possible trouble within #2842)
Was the change discussed in an issue or in the forum before?
see #2839
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits