Skip to content

Rebuild index in prune by using in-memory index#2842

Merged
MichaelEischer merged 4 commits intorestic:masterfrom
aawsome:rebuild-index-inmem
Nov 6, 2020
Merged

Rebuild index in prune by using in-memory index#2842
MichaelEischer merged 4 commits intorestic:masterfrom
aawsome:rebuild-index-inmem

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Jul 19, 2020

What is the purpose of this change? What does it change?

Optimize the rebuilding of the index during prune by using the in-memory index instead of reading all pack headers.

Was the change discussed in an issue or in the forum before?

This was a part of #2718 that has been extracted to be merged after #2718 has been merged.
To not save exact duplicates multiple times in the rebuilt index, either #2839 or #2863 should also be merged.

closes #1599
closes #3049

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have updated the documentation for the changes (in the manual)
  • There's an updated file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@aawsome aawsome mentioned this pull request Jul 19, 2020
11 tasks
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jul 19, 2020

Note that actually new index files are checked for "fullness" and saved only after having processed a complete index. This is not totally correct and will be a problem, if this is used on one big merged index. (see #2818 )

@aawsome aawsome force-pushed the rebuild-index-inmem branch from 7506450 to ba714d2 Compare July 24, 2020 07:37
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jul 24, 2020

After #2818 has been merged, I rebased this PR. Now also the newly generated index is checked for "fullness" after writing the contents of pack.

@aawsome aawsome force-pushed the rebuild-index-inmem branch 2 times, most recently from bf6692c to af8ddbd Compare July 28, 2020 17:12
@aawsome aawsome force-pushed the rebuild-index-inmem branch from af8ddbd to e038163 Compare August 2, 2020 05:16
@aawsome aawsome mentioned this pull request Aug 2, 2020
8 tasks
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Aug 2, 2020

rebased after #2840 is merged.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented below the current implementation can forget blobs in rewritten packs and will cause check to warn about packs which are listed in multiple indexes. Please fix this.

This PR also needs tests. There is currently no test coverage for RebuildIndex. I think we also need a changelog entry, nothing too fancy as it will eventually be merged with the one for the main prune PR.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 5, 2020

@MichaelEischer Thanks for your feedback!
If it's ok for you, I would suggest we first focus on #2718 and deal with this PR after.
Then the changelog of #2718 can be extended here.

Also I'm a bit short on time ATM and would like to spend this making #2718 merge-ready.

@MichaelEischer
Copy link
Copy Markdown
Member

Also I'm a bit short on time ATM and would like to spend this making #2718 merge-ready.

There's no need to hurry (too much). I actually hope to merge the VSS support (which shouldn't take too long anymore) before the prune rewrite PRs. That way we could release the VSS support soon (hopefully) and give the prune changes several weeks on master before getting them into a release.

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving prune a try with the optimizations from this PR, #2718 and #2941, I've noticed that the index rebuild is missing a progress bar, please add one (Maybe based on the number of packs which are already added?). And obviously the performance is far better than before :-) .

@aawsome aawsome force-pushed the rebuild-index-inmem branch from e038163 to 41a2135 Compare October 9, 2020 18:02
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 9, 2020

Ok, I just realized that (as you can see in the failing test - one of my "edge case repos") this PR should really only be merged after #2718!

The reason is that blobs not existing in the index are processed by current prune (as the first step is to rebuild the index) but then not saved in the rebuilt index (as this only saves the contents loaded from the index files).

This situation will change with #2718 which would simply fail for this setting and would request a manual rebuild-index. So I'll build this PR on top of #2718.

@aawsome aawsome force-pushed the rebuild-index-inmem branch from 41a2135 to 65097bf Compare October 10, 2020 06:36
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 10, 2020

I rebased this to #2718
Now there are two open things left: The test case and the progress bar.

@aawsome aawsome force-pushed the rebuild-index-inmem branch 2 times, most recently from 8099cc1 to 384fa80 Compare October 10, 2020 17:08
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 10, 2020

I added the progress bar. Using packs as counter needs information from Repack how many packs have been written - which is unfortunately not available. I used blobs as counter. This information is also not available as we would need the information how many blobs have been actually written by Repack. However, if there are no duplicates or all duplicates have been repacked (one of those should be the usual case), prune knows the exact number.

I'm a bit struggling about the test case. Of course, the code is used in the integration tests and works there. As this saves the index to the repository, I cannot think about another good test except some kind of integration test. Which extra test case did you have in mind?

@MichaelEischer
Copy link
Copy Markdown
Member

MichaelEischer commented Oct 11, 2020

It would be possible to recalculate the number of packs by iterating over the index to collect all packs and then removing the rewritten ones. I haven't tested the blob counter yet, so i can't judge whether that's too noisy or not.

For the test case I have something like internal/index.TestIndexSave in mind.

@aawsome aawsome force-pushed the rebuild-index-inmem branch from 384fa80 to e511551 Compare October 13, 2020 04:18
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 13, 2020

It would be possible to recalculate the number of packs by iterating over the index to collect all packs and then removing the rewritten ones. I haven't tested the blob counter yet, so i can't judge whether that's too noisy or not.

I wanted to avoid the iteration and now count blobs before and after repacking to get the exact number of blobs that Repack created. This allows to exactly know how many blobs Save must save.
I don't know if I understood your "noisyness" concern correctly, but the progress bar will only update itself at most every minTickerTime (set to 1/60 s).

For the test case I have something like internal/index.TestIndexSave in mind.

I boldly copied (and adapted) that test function into internal/repositry 😉

@aawsome aawsome force-pushed the rebuild-index-inmem branch 2 times, most recently from b276c25 to 6dbca52 Compare October 16, 2020 18:16
@MichaelEischer
Copy link
Copy Markdown
Member

@aawsome I've added a hook some time ago that allows injecting / wrapping a backend into OpenRepository. Search for env.gopts.backendTestHook in the integration tests.

@aawsome aawsome force-pushed the rebuild-index-inmem branch from 6d606bf to 4e51b0a Compare October 17, 2020 21:12
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 17, 2020

@aawsome I've added a hook some time ago that allows injecting / wrapping a backend into OpenRepository. Search for env.gopts.backendTestHook in the integration tests.

Great, that worked for me! The test is added (also tested that the test it failes without this PR; data is listed twice) and I also mention it in the changelog.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Oct 18, 2020

@MichaelEischer Struggling with the test yesterday, I forgot to mention another change:
When changing the progress bar from blobs to packs, I also changed internal/index.EachByPack to yield a structure such that idx.StorePack can be used in MasterIndex.Save instead of idx.Store - this is bit faster and it doesn't fill the packIDToIndex map.

@aawsome aawsome force-pushed the rebuild-index-inmem branch 3 times, most recently from 373e986 to 504eeef Compare October 21, 2020 04:49
@aawsome aawsome mentioned this pull request Oct 28, 2020
8 tasks
@aawsome aawsome force-pushed the rebuild-index-inmem branch from 504eeef to 662ce76 Compare November 3, 2020 20:23
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Nov 3, 2020

rebased this to the new version of #2718 (only docu and the PruneOptions in test was affected)

@aawsome aawsome force-pushed the rebuild-index-inmem branch from 662ce76 to 028105a Compare November 3, 2020 20:36
@aawsome aawsome force-pushed the rebuild-index-inmem branch from 028105a to c33a1cf Compare November 5, 2020 10:16
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Nov 5, 2020

rebased after #2718 was merged.

@aawsome aawsome force-pushed the rebuild-index-inmem branch from c33a1cf to 3a0f03d Compare November 5, 2020 10:24
Copy link
Copy Markdown
Member

@fd0 fd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment about the computation of packs added, apart from that I'm happy. Thanks for your work!

@fd0 fd0 force-pushed the rebuild-index-inmem branch from 7f4802f to 47277c4 Compare November 6, 2020 19:24
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High download bandwidth usage on prune Small changes, long prune time

4 participants