Skip to content

Optimize index#2507

Closed
aawsome wants to merge 5 commits intorestic:masterfrom
aawsome:optimize-index
Closed

Optimize index#2507
aawsome wants to merge 5 commits intorestic:masterfrom
aawsome:optimize-index

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Dec 9, 2019

What is the purpose of this change? What does it change?

Optimize creation of new index so that already existing index files are used.
Creation of new index is used at the moment in prune and rebuild-index.

With this pull request the index creation is much faster, especially for remote repositories.

Was the change discussed in an issue or in the forum before?

The change is similar to one part of PR #2340 and the problem is discussed in many issues, e.g. #2162

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

Alexander Weiss added 3 commits December 8, 2019 21:57
Optimize generation of a new index: First load the actual index and use the
information where possible.
This should give an enormous speedup as usually packs are already included
in the index and moreover the index is cached.

Commands which are affected and should speed up:
- rebuild-index
- prune
…Ds from oldIDx

Use the size given by List for packs.
Additionally remove the used ids from oldIdx so that GC could free memory.
fix AddPack so that multiple calls with same ID are allowed.
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Dec 9, 2019

Seems like the original PR broke some tests with packs in more than one index. This should be fixed now.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Dec 17, 2019

While still being confident that this PR can improve the current situation for some users, I now believe that on the long run a refactoring of rebuild-index and prune (and thus removing internal/index) should be favored.

@irasnyd
Copy link
Copy Markdown

irasnyd commented Jan 3, 2020

Hi @aawsome,

I used this PR in conjunction with #2513 during my prune of a large (12M object / ~55TB) AWS S3 backed repository. I don't know how to tell if it was faster or slower than normal, since this was my first prune of this very large repository. As far as I can tell, it worked great. Backups have continued normally without any issue since the repository was pruned (almost 2 weeks ago).

It isn't much feedback, but it's the best I can do. I wanted to give something back, since I had used and tested this PR. Thanks!

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jan 5, 2020

@irasnyd

I used this PR in conjunction with #2513 during my prune of a large (12M object / ~55TB) AWS S3 backed repository.

Just to clarify: As far as I understood you used cleanup-index and cleanup-packs from #2513 instead of prune, right?
In this case this PR was not used at all. These two commands do not use internal/index as there is already an index data structure in repository/index. Using both data structure multiplies the memory usage and is one of the reasons why prune has a memory issue (besides the performance issue due to reading all packs which can be improved by using this PR)

@irasnyd
Copy link
Copy Markdown

irasnyd commented Jan 6, 2020

@irasnyd

I used this PR in conjunction with #2513 during my prune of a large (12M object / ~55TB) AWS S3 backed repository.

Just to clarify: As far as I understood you used cleanup-index and cleanup-packs from #2513 instead of prune, right?

Yes, that is correct. I only used the cleanup-index and cleanup-packs commands from #2513. I did not use prune.

In this case this PR was not used at all. These two commands do not use internal/index as there is already an index data structure in repository/index. Using both data structure multiplies the memory usage and is one of the reasons why prune has a memory issue (besides the performance issue due to reading all packs which can be improved by using this PR)

Apologies for my confusion. I understand better what is going on now. Thanks!

return nil
}

if res, ok := oldIdx.Packs[id]; ok {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a race condition with the for res := range outputCh loop. Two coroutines are working on idx and I don't see any synchronization in the AddPack method.

Replacing repo.ListPack with a lookup into the old index should avoid that race condition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better: Just send the pack from the oldIdx over the outputCh channel

if err := idx.AddPack(res.ID, size, res.Entries); err != nil {
return err
}
if err := oldIdx.RemovePack(id); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove the old pack? Is this done to reduce memory usage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was what I was thinking. I actually don't know if the GC is really freeing something in practise, though...

func (idx *Index) AddPack(id restic.ID, size int64, entries []restic.Blob) error {
if _, ok := idx.Packs[id]; ok {
return errors.Errorf("pack %v already present in the index", id.Str())
debug.Log("pack %v already present in the index", id.Str())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change? According to my understand of the code it just replaces loading a pack with reusing the old index for that pack. This shouldn't lead to any pack being added twice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the call in New this change is needed for but the call in Load. Seems there are some integration tests with index where duplicate packs are present in the index (either within one index file or over different index files)...

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Feb 5, 2020

I close this PR for two reasons:

@aawsome aawsome closed this Feb 5, 2020
@aawsome aawsome deleted the optimize-index branch February 5, 2020 14:09
@aawsome aawsome mentioned this pull request May 14, 2020
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants