Optimize index by aawsome · Pull Request #2507 · restic/restic

aawsome · 2019-12-09T13:06:36Z

What is the purpose of this change? What does it change?

Optimize creation of new index so that already existing index files are used.
Creation of new index is used at the moment in prune and rebuild-index.

With this pull request the index creation is much faster, especially for remote repositories.

Was the change discussed in an issue or in the forum before?

The change is similar to one part of PR #2340 and the problem is discussed in many issues, e.g. #2162

Checklist

I have read the Contribution Guidelines
I have added tests for all changes in this PR
I have added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

Optimize generation of a new index: First load the actual index and use the information where possible. This should give an enormous speedup as usually packs are already included in the index and moreover the index is cached. Commands which are affected and should speed up: - rebuild-index - prune

…Ds from oldIDx Use the size given by List for packs. Additionally remove the used ids from oldIdx so that GC could free memory.

fix AddPack so that multiple calls with same ID are allowed.

aawsome · 2019-12-09T14:41:04Z

Seems like the original PR broke some tests with packs in more than one index. This should be fixed now.

aawsome · 2019-12-17T08:33:24Z

While still being confident that this PR can improve the current situation for some users, I now believe that on the long run a refactoring of rebuild-index and prune (and thus removing internal/index) should be favored.

irasnyd · 2020-01-03T00:27:32Z

Hi @aawsome,

I used this PR in conjunction with #2513 during my prune of a large (12M object / ~55TB) AWS S3 backed repository. I don't know how to tell if it was faster or slower than normal, since this was my first prune of this very large repository. As far as I can tell, it worked great. Backups have continued normally without any issue since the repository was pruned (almost 2 weeks ago).

It isn't much feedback, but it's the best I can do. I wanted to give something back, since I had used and tested this PR. Thanks!

aawsome · 2020-01-05T11:47:50Z

@irasnyd

I used this PR in conjunction with #2513 during my prune of a large (12M object / ~55TB) AWS S3 backed repository.

Just to clarify: As far as I understood you used cleanup-index and cleanup-packs from #2513 instead of prune, right?
In this case this PR was not used at all. These two commands do not use internal/index as there is already an index data structure in repository/index. Using both data structure multiplies the memory usage and is one of the reasons why prune has a memory issue (besides the performance issue due to reading all packs which can be improved by using this PR)

irasnyd · 2020-01-06T17:34:25Z

@irasnyd

I used this PR in conjunction with #2513 during my prune of a large (12M object / ~55TB) AWS S3 backed repository.

Just to clarify: As far as I understood you used cleanup-index and cleanup-packs from #2513 instead of prune, right?

Yes, that is correct. I only used the cleanup-index and cleanup-packs commands from #2513. I did not use prune.

In this case this PR was not used at all. These two commands do not use internal/index as there is already an index data structure in repository/index. Using both data structure multiplies the memory usage and is one of the reasons why prune has a memory issue (besides the performance issue due to reading all packs which can be improved by using this PR)

Apologies for my confusion. I understand better what is going on now. Thanks!

MichaelEischer · 2020-01-12T14:55:43Z

internal/index/index.go

 				return nil
 			}

+			if res, ok := oldIdx.Packs[id]; ok {


This looks like a race condition with the for res := range outputCh loop. Two coroutines are working on idx and I don't see any synchronization in the AddPack method.

Replacing repo.ListPack with a lookup into the old index should avoid that race condition.

Or better: Just send the pack from the oldIdx over the outputCh channel

MichaelEischer · 2020-01-12T14:57:04Z

internal/index/index.go

+				if err := idx.AddPack(res.ID, size, res.Entries); err != nil {
+					return err
+				}
+				if err := oldIdx.RemovePack(id); err != nil {


Why do you remove the old pack? Is this done to reduce memory usage?

Yes that was what I was thinking. I actually don't know if the GC is really freeing something in practise, though...

MichaelEischer · 2020-01-12T14:58:38Z

internal/index/index.go

 func (idx *Index) AddPack(id restic.ID, size int64, entries []restic.Blob) error {
 	if _, ok := idx.Packs[id]; ok {
-		return errors.Errorf("pack %v already present in the index", id.Str())
+		debug.Log("pack %v already present in the index", id.Str())


What is the reason for this change? According to my understand of the code it just replaces loading a pack with reusing the old index for that pack. This shouldn't lead to any pack being added twice.

It is not the call in New this change is needed for but the call in Load. Seems there are some integration tests with index where duplicate packs are present in the index (either within one index file or over different index files)...

aawsome · 2020-02-05T14:09:13Z

I close this PR for two reasons:

with Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index' #2513 there is a functionality to replace prune which I assume to be even faster than prune with this functionality. Moreover, this PR is not needed for Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index' #2513
in Comparison of approaches to tackle index memory usage #2523 I already proposed a new index implementation which IMO should be enhanced such that it can also be used within Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index' #2513 and then 'prune' can be replaced.

Alexander Weiss added 3 commits December 8, 2019 21:57

Correct size when using stored index in index.New() and remove used I…

8ef8399

…Ds from oldIDx Use the size given by List for packs. Additionally remove the used ids from oldIdx so that GC could free memory.

Apply go fmt

83340f3

aawsome mentioned this pull request Dec 9, 2019

handle large prune much more efficent #2162

Closed

Fix AddPack

9a1a08d

fix AddPack so that multiple calls with same ID are allowed.

This was referenced Dec 13, 2019

Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index' #2513

Closed

restic with "cold" storage (here: OVH Cloud Archive) #2504

Closed

MichaelEischer reviewed Jan 12, 2020

View reviewed changes

Fix race condition

a894008

aawsome mentioned this pull request Jan 12, 2020

Discussion: Future of prune and rebuild-index #2547

Closed

aawsome closed this Feb 5, 2020

aawsome deleted the optimize-index branch February 5, 2020 14:09

aawsome mentioned this pull request May 14, 2020

Reimplementation of prune #2718

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize index#2507

Optimize index#2507
aawsome wants to merge 5 commits intorestic:masterfrom
aawsome:optimize-index

aawsome commented Dec 9, 2019

Uh oh!

aawsome commented Dec 9, 2019

Uh oh!

aawsome commented Dec 17, 2019

Uh oh!

irasnyd commented Jan 3, 2020

Uh oh!

aawsome commented Jan 5, 2020 •

edited

Loading

Uh oh!

irasnyd commented Jan 6, 2020

Uh oh!

MichaelEischer Jan 12, 2020

Uh oh!

MichaelEischer Jan 12, 2020

Uh oh!

MichaelEischer Jan 12, 2020

Uh oh!

aawsome Jan 12, 2020

Uh oh!

MichaelEischer Jan 12, 2020

Uh oh!

aawsome Jan 12, 2020

Uh oh!

aawsome commented Feb 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aawsome commented Dec 9, 2019

What is the purpose of this change? What does it change?

Was the change discussed in an issue or in the forum before?

Checklist

Uh oh!

aawsome commented Dec 9, 2019

Uh oh!

aawsome commented Dec 17, 2019

Uh oh!

irasnyd commented Jan 3, 2020

Uh oh!

aawsome commented Jan 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irasnyd commented Jan 6, 2020

Uh oh!

MichaelEischer Jan 12, 2020

Choose a reason for hiding this comment

Uh oh!

MichaelEischer Jan 12, 2020

Choose a reason for hiding this comment

Uh oh!

MichaelEischer Jan 12, 2020

Choose a reason for hiding this comment

Uh oh!

aawsome Jan 12, 2020

Choose a reason for hiding this comment

Uh oh!

MichaelEischer Jan 12, 2020

Choose a reason for hiding this comment

Uh oh!

aawsome Jan 12, 2020

Choose a reason for hiding this comment

Uh oh!

aawsome commented Feb 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aawsome commented Jan 5, 2020 •

edited

Loading