Conversation
Optimize generation of a new index: First load the actual index and use the information where possible. This should give an enormous speedup as usually packs are already included in the index and moreover the index is cached. Commands which are affected and should speed up: - rebuild-index - prune
…Ds from oldIDx Use the size given by List for packs. Additionally remove the used ids from oldIdx so that GC could free memory.
fix AddPack so that multiple calls with same ID are allowed.
|
Seems like the original PR broke some tests with packs in more than one index. This should be fixed now. |
|
While still being confident that this PR can improve the current situation for some users, I now believe that on the long run a refactoring of |
|
Hi @aawsome, I used this PR in conjunction with #2513 during my prune of a large (12M object / ~55TB) AWS S3 backed repository. I don't know how to tell if it was faster or slower than normal, since this was my first prune of this very large repository. As far as I can tell, it worked great. Backups have continued normally without any issue since the repository was pruned (almost 2 weeks ago). It isn't much feedback, but it's the best I can do. I wanted to give something back, since I had used and tested this PR. Thanks! |
Just to clarify: As far as I understood you used |
Yes, that is correct. I only used the
Apologies for my confusion. I understand better what is going on now. Thanks! |
internal/index/index.go
Outdated
| return nil | ||
| } | ||
|
|
||
| if res, ok := oldIdx.Packs[id]; ok { |
There was a problem hiding this comment.
This looks like a race condition with the for res := range outputCh loop. Two coroutines are working on idx and I don't see any synchronization in the AddPack method.
Replacing repo.ListPack with a lookup into the old index should avoid that race condition.
There was a problem hiding this comment.
Or better: Just send the pack from the oldIdx over the outputCh channel
| if err := idx.AddPack(res.ID, size, res.Entries); err != nil { | ||
| return err | ||
| } | ||
| if err := oldIdx.RemovePack(id); err != nil { |
There was a problem hiding this comment.
Why do you remove the old pack? Is this done to reduce memory usage?
There was a problem hiding this comment.
Yes that was what I was thinking. I actually don't know if the GC is really freeing something in practise, though...
| func (idx *Index) AddPack(id restic.ID, size int64, entries []restic.Blob) error { | ||
| if _, ok := idx.Packs[id]; ok { | ||
| return errors.Errorf("pack %v already present in the index", id.Str()) | ||
| debug.Log("pack %v already present in the index", id.Str()) |
There was a problem hiding this comment.
What is the reason for this change? According to my understand of the code it just replaces loading a pack with reusing the old index for that pack. This shouldn't lead to any pack being added twice.
There was a problem hiding this comment.
It is not the call in New this change is needed for but the call in Load. Seems there are some integration tests with index where duplicate packs are present in the index (either within one index file or over different index files)...
|
I close this PR for two reasons:
|
What is the purpose of this change? What does it change?
Optimize creation of new index so that already existing index files are used.
Creation of new index is used at the moment in prune and rebuild-index.
With this pull request the index creation is much faster, especially for remote repositories.
Was the change discussed in an issue or in the forum before?
The change is similar to one part of PR #2340 and the problem is discussed in many issues, e.g. #2162
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits