Skip to content

Merge in-memory indexes; use multiple maps#2799

Closed
aawsome wants to merge 5 commits intorestic:masterfrom
aawsome:merge-index
Closed

Merge in-memory indexes; use multiple maps#2799
aawsome wants to merge 5 commits intorestic:masterfrom
aawsome:merge-index

Conversation

@aawsome
Copy link
Copy Markdown
Contributor

@aawsome aawsome commented Jun 18, 2020

What is the purpose of this change? What does it change?

Merge in-memory indexes when they are finalized.
This increases performance especially for large repositories with many index files.
To avoid large spikes in memory consumption when maps are resized, this PR also introduces sharded maps to save the index entries. By using 256 shards one byte of key information can be removed which also reduces the memory required for large indexes a bit.

Was the change discussed in an issue or in the forum before?

The idea comes from #2523.

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about the memory overhead this PR might have. When the index hash map has to be resized, Go doubles the size of the backing array and then slowly migrates the entries into the new hashmap. That way we might end up needing three times the memory than actually needed.

During the discussions in #2523 I did propose to split the hashtable into e.g. 10 smaller parts and then grow those in turns until they are close to reaching the maximum load factor. That did at least avoid the three times peak memory usage and is still rather fast. A slightly cleaned up version of that merging logic is available here: MichaelEischer@2d43a0b#diff-c1bc3bce10dd7a8df44bed943acb01c3

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 19, 2020

I'm a bit worried about the memory overhead this PR might have. When the index hash map has to be resized, Go doubles the size of the backing array and then slowly migrates the entries into the new hashmap. That way we might end up needing three times the memory than actually needed.

I didn't dive into the golang map implementation too much, but seems you are right, this might be an issue. I tried to add an benchmark BenchmarkMasterIndexLookupMultipleIndexMergedand this failed with an memory allocation error on my 8GB machine..

I changed the benchmark (a 200k pack file index should't occur in reality) to use 50 indexes with 20k packs. The benchmark results of this much more realistic setting show a huge speed improvement.

I think that we should try to find a replacement for the blobs map - as the implementation of the map causes this trouble - and leave the Index and MasterIndex logic as is.

However, as long as we risk a huge memory overhead, this should not be merged; I'll change this PR to draft mode.

@aawsome aawsome marked this pull request as draft June 19, 2020 05:42
@greatroar
Copy link
Copy Markdown
Contributor

greatroar commented Jun 19, 2020

A simple sharded hash table grows at the same rate as a regular one, but avoids the big temporary (one shard grows at a time). It can also prevent lock contention. E.g., (untested)

type idxMap [32]struct {
	sync.Mutex
	m map[restic.BlobHandle]indexEntry
}

func (m *idxMap) get(h restic.BlobHandle) (indexEntry, bool) {
	i := h.ID[0] % byte(len(m))
	m[i].Lock()
	defer m[i].Unlock()

	entry, ok := m[i].m[h]
	return entry, ok
}

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 19, 2020

@greatroar I was thinking about something like this.
Maybe even use 256 shards - this would allow us to use h.ID[1:] in the map key and saves 1 additional byte per index entry 😄

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 19, 2020

Ok, as implementation of the sharded index was not too complex, I've added this to this PR.

A first benchmark showed that the speed improvement of the merged index should outweight the additional overhead introduced by the shards. Also slightly reduces the memory used by the index for large repositories (but introduces the cost of a constant overhead to manage the 256 shards)

@aawsome aawsome marked this pull request as ready for review June 19, 2020 07:20
-> makes index operations faster, especially for large repositories
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 19, 2020

Here are the benchmark results of internal/repository (the BenchmarkMasterIndexLookupMultipleIndexReal* benchmarks are added with this PR)

Old:

BenchmarkPackerManager-4                           	       4	 272467956 ns/op	 193.42 MB/s	   91178 B/op	    1411 allocs/op
BenchmarkDecodeIndex-4                             	   66586	     17468 ns/op
BenchmarkIndexHasUnknown-4                         	21096319	        57.1 ns/op
BenchmarkIndexHasKnown-4                           	19841572	        61.1 ns/op
BenchmarkIndexAlloc-4                              	       2	 964482906 ns/op	785897508 B/op	  976686 allocs/op
BenchmarkMasterIndexLookupSingleIndex-4            	 7649384	       155 ns/op
BenchmarkMasterIndexLookupMultipleIndex-4          	 2455197	       474 ns/op
BenchmarkMasterIndexLookupSingleIndexUnknown-4     	14100042	        85.4 ns/op
BenchmarkMasterIndexLookupMultipleIndexUnknown-4   	 3046612	       396 ns/op
BenchmarkSaveAndEncrypt-4                          	 8935536	       129 ns/op	32436382.67 MB/s	       2 B/op	       0 allocs/op
BenchmarkLoadTree-4                                	    3351	    354630 ns/op
BenchmarkLoadBlob-4                                	     140	   8288340 ns/op	 120.65 MB/s
BenchmarkLoadAndDecrypt-4                          	     100	  10114474 ns/op	  98.87 MB/s
BenchmarkLoadIndex-4                               	      30	  41519666 ns/op

New:

BenchmarkPackerManager-4                                     	       4	 267857058 ns/op	 196.75 MB/s	   91174 B/op	    1411 allocs/op
BenchmarkDecodeIndex-4                                       	   22050	     52617 ns/op
BenchmarkIndexHasUnknown-4                                   	18344560	        65.8 ns/op
BenchmarkIndexHasKnown-4                                     	17026891	        70.6 ns/op
BenchmarkIndexAlloc-4                                        	       1	1099485149 ns/op	759550568 B/op	  982149 allocs/op
BenchmarkMasterIndexLookupSingleIndex-4                      	 5698429	       221 ns/op
BenchmarkMasterIndexLookupMultipleIndex-4                    	 2507014	       506 ns/op
BenchmarkMasterIndexLookupMultipleIndexMerged-4              	 7457487	       165 ns/op
BenchmarkMasterIndexLookupSingleIndexUnknown-4               	 7913674	       144 ns/op
BenchmarkMasterIndexLookupMultipleIndexUnknown-4             	 2839654	       422 ns/op
BenchmarkMasterIndexLookupMultipleIndexMergedUnknown-4       	12912753	        92.0 ns/op
BenchmarkMasterIndexLookupMultipleIndexReal-4                	  149528	      8000 ns/op
BenchmarkMasterIndexLookupMultipleIndexRealMerged-4          	 7547283	       160 ns/op
BenchmarkMasterIndexLookupMultipleIndexRealUnknown-4         	  151024	      8089 ns/op
BenchmarkMasterIndexLookupMultipleIndexRealMergedUnknown-4   	12840684	        91.8 ns/op
BenchmarkSaveAndEncrypt-4                                    	 8539178	       140 ns/op	30004544.64 MB/s	       2 B/op	       0 allocs/op
BenchmarkLoadTree-4                                          	    3098	    357358 ns/op
BenchmarkLoadBlob-4                                          	     141	   8457648 ns/op	 118.24 MB/s
BenchmarkLoadAndDecrypt-4                                    	     115	   9897292 ns/op	 101.04 MB/s
BenchmarkLoadIndex-4                                         	      31	  40329784 ns/op

So pure Lookup within Index seems to be about 9ns slower than before and also with MasterIndex there is some more overhead (about 60ns per lookup)
However with multiple indexes in MasterIndex you see the benefit already in the 5 index files à 200k packs example - compare e.g. old BenchmarkMasterIndexLookupMultipleIndex-4 with new BenchmarkMasterIndexLookupMultipleIndexMerged-4.
In the more real world scenario (100 index files à 10k pack files) you can see that the merged index also scales well with a lot of indexes being loaded and merged - as expected.

Memory usage seems to be reduced by about 3%

@aawsome aawsome changed the title Merge in-memory indexes Merge in-memory indexes; use sharded maps Jun 19, 2020
@greatroar
Copy link
Copy Markdown
Contributor

greatroar commented Jun 19, 2020

You can get a nice overview of benchmark results from the benchstat tool:

$ git checkout $oldversion
$ go test -bench=. -count=10 | tee old
$ git checkout $newversion
$ go test -bench=. -count=10 | tee new
$ benchstat old new

For new benchmarks, this will of course not give a comparison, but you can do benchstat new to get some descriptive statistics.

@MichaelEischer
Copy link
Copy Markdown
Member

MichaelEischer commented Jun 20, 2020

tl;dr Sharding the index into equal parts does not and cannot fix the memory usage problem.

(Sorry, for writing this much. This comment somehow got 'slightly' out of hand...)

"Proof" by example

Sharding the index into equal parts does not fix the memory usage problem.

The 3 times memory overhead still applies for the sharded index variant. Using the following code snippet in the BenchmarkMasterIndexLookupMultipleIndexReal benchmark (or any other) which triggers a GC before and after creating the MasterIndex and reads the memory stats afterwards, I get this memory usage:

Number of indexes with 10k packs Allocated heap memory in MB
75 501.300
83 1365.902
90 878.233

At 75 indexes the hashtables are pretty close to reach the maximum load factor. At 83 we're already quite a bit beyond that point, such that each of the index shards has allocated a new bucket array (with 2 times the size of the old one) and is in the progress of migrating entries (More on why the memory usage just increased by a factor of 2.7 and not 3 later). At 90 indexes the memory usage is down to 2/3s of the maxmimum as the migration to the new bucket array has been completed.

i := 75 // number of indexes with 10k packs
runtime.GC()
var m runtime.MemStats
runtime.ReadMemStats(&m)

mIdx, _ := createRandomMasterIndex(rand.New(rand.NewSource(0)), i, 10000)
mIdx.FinalizeNotFinalIndexes()
mIdx.MergeFinalIndexes()

runtime.GC()
var n runtime.MemStats
runtime.ReadMemStats(&n)
runtime.KeepAlive(mIdx)
b.Logf("%d %.3f", i, float32((n.HeapAlloc-m.HeapAlloc)/1024)/1024)

Hashmap growth in Go

To understand why the memory usage spikes at that point, we have to take a closer look at hashmaps in Go first, as these are used as datastore for the index shards.

Basics

(Based on hashmap source code https://github.com/golang/go/blob/60f78765022a59725121d3b800268adffe78bde3/src/runtime/map.go) Hashmaps in Go allocate an array of 2^B buckets to back the hashmap (line 120, 344). Entries are accessed by calculating a hash value of the accessed key and then using the low-order bits as index into the hashmap (just use the lowest B bits). Each bucket contains 8 entries (line 66, 149). Hash collisions (i.e. two different keys end up in the same bucket) are handled by first searching for a free spot in the bucket and if necessary chaining to overflow buckets, which takes the form of a simple linked list (line 16). The memory layout of a bucket looks as follows, keys and values are grouped together to avoid a memory usage overhead due to alignment of the contained datatypes (line 155). The tophash is used to mark unused entries and as a quick check whether the corresponding key in the bucket might match the one to be queried/inserted (line 12, 88).

type bmap struct {
    tophash  [8]uint8
    keys     [8]restic.BlobHandle
    values   [8]indexEntry
    overflow *bmap
}

The restic.BlobHandle has 33 bytes and an indexEntries is 16 bytes large, which yields 408 bytes on a 64bit platform. When allocated in a large array, each entry requires exactly that many bytes, whereas when allocating a single bmap object for an overflow bucket it require 416 bytes as this is the smallest matching malloc size class. (The current sharding code just stores a restic.ID instead of a restic.BlobHandle which saves 8 bytes.)

Hashmap growth

The hashmap uses a maximum load factor (elements in hashmap divided by backing array capacity) of 6.5 entries per bucket which when reached triggers the hashmap growth (line 33, 70). In the following I use the more common (?) notation of specifying the load factor in relation to the number of entries in the backing array: 6.5/8 = 0.8125.

Growing a hashmap means that a new backing array with twice the previous size (B += 1) is allocated, which the hashmap has to keep in parallel to the old backing array until all entries are migrated into the new buckets. This is where the memory usage spikes to roughly three times the previous usage.

A special feature of the hashmap in Go is that it migrates buckets incrementally in a process called evacuation and not all entries at once as the latter would result in a huge latency spike. Each assignment or deletion to the hashmap triggers the migration of roughly two buckets (the bucket that should be accessed and the first not evacuated bucket. If the former is already migrated, nothing happens), that is up to 16 entries (actually all overflow buckets are also migrated along their parent bucket). The evacuation works in principle by moving the entries of these two buckets one by one into the new backing array.

Give me numbers!

Putting this into numbers a hashmap should grow when reaching 13/16 * 8*2^B = 0.8125 * 8*2^B entries and finish the migration after adding another 1/1.5 * 2^B entries (formula assumes the same B as before growth, no 8* before 2^B this time). The 1.5 is a crude approximation of the number of buckets moved on each hashmap modification. Each modification moves one bucket plus the accessed one if it was not yet migrated. As the number of migrated buckets thus increases roughly linearly we end up on average with a 50-50 chance of accessing a migrated bucket during the evacuation process. That is the total expected number of entries to complete the migration is 13/16+1/12 * 8*2^B = 0.896 * 8*2^B entries.

The code used for the measurements is available at https://gist.github.com/MichaelEischer/e3280247264101864b26c162fd823df9 (using a trick from https://hackernoon.com/some-insights-on-maps-in-golang-rm5v3ywh to access the hashmap internals) and can be run using go test -run '^$' -bench . -benchtime 2x -v ./....

The extracted and modified index test allows the creation of indexes with a specified number of entries (not packs!, might overshoot by a few entries). In addition is also prints the number of used overflow buckets and whether an evacuation is in progress.

Maximum entries before growing buckets approx. overflow buckets approx. evacuation completion memory usage in MB bytes per entry
13307 2^11 422 14750 1.022 80.608
212987 2^15 6846 236000 16.428 80.878
851965 2^17 27314 943000 65.556 80.685

The maximum number of entries before the backing array grows is pretty much identical to the expected value, except that due to the way the index is created it might be necessary to subtract a buffer of up to 8 entries.

The evacuation completion column only contains approximate values as the exact numbers vary due to varying hash seeds used to calculate the key hashes. The approximate values correspond to roughly 0.9 * 8*2^B entries which is slightly more than the expected number of entries.

Let's now continue with a close look at the memory usage per index entry. For a single entry in a fully loaded hashmap we get 16/13*408/8 bytes per hashmap entry. As pack IDs are stored separately we also have to account for these: createRandomIndex() ends up creating on average 8.67 blobs per pack. That is in sum we get 16/13*408/8 + 32/8.67 = 66.46 bytes per entry, that is the measured value is approx. 21% higher than the expected memory usage.

This discrepancy nearly completely disappears when we account for the overflow buckets, whose number is up to 20.9% of the number of normal buckets (see table). Note that this seems to be the standard expected number of overflow buckets (line 43). A hashmap preallocates an additional 1/16 of the bucket array for overflow buckets (line 350) which then benefit from the lower memory storage size of 408 instead of 416 when allocated individually (line 262). 16/13*(408 + 408*1/16 + 416*0.1465 + 8*0.209)/8 + 32/8.67 = 80.02 (backing array + preallocated overflow buckets + individually allocated overflow buckets (0.209-1/16 = 0.1465) + the hashmap keeps an extra pointer to each overflow bucket (line 141)).

With variables that gives the following formula for bytes per entry: 1/load_factor*(408 + 408*1/16 + 416*max(0, overflow_factor-1/16) + 8*overflow_factor)/8 + 32/blobs_per_pack where 408 is the size of the bmap struct and 416 corresponds to 408 rounded up to the next malloc size class.

The optimal number of bytes per entry (without further compression) would be 33+16 + 32/8.67 = 52.69. In converse this means that the minimal storage overhead possible using the Go hashmap is 52%.

Optimal hashmap load

Right after completing the growth of a hashmap at 0.9*2^B we get (assuming 1.5% overflow buckets) 2/0.9*(408 + 408*1/16 + 416*0 + 8*0.015)/8 + 32/8.67 = 124.14 bytes per entry.

When moving in ten steps from the evacuation completion point to right before the next growth point, the memory usage per index entry steadily decreases. That is for the most efficient data storage each hashmap should be filled as close as possible to the maximum load factor.

Count Packs Buckets Overflow Memory in MB Bytes per entry
1888002 217928 524288 6051 223.743 124.265
2056872 237390 524288 10035 225.516 114.967
2225732 256914 524288 15430 225.570 106.270
2394595 276419 524288 22692 225.609 98.793
2563457 295889 524288 32179 227.891 93.218
2732326 315365 524288 43628 232.277 89.141
2901187 334792 524288 56735 237.570 85.865
3070052 354304 524288 72194 243.821 83.277
3238914 373830 524288 89652 253.606 82.103
3407781 393335 524288 109532 261.681 80.520

Sharding the index

The index sharding variant distributes index entries to shards based on the top byte of their blob ID. As the IDs are SHA256 hashes this yields a uniform distribution to shards for normal data. When modelling this as a Binomial distribution where a blob can either end up in a specific shard or not, we get p = 1/256 with an expected value of n / 256 blobs in the selected shard when adding n blobs in total. The variance is given by n * 1/256 * (1 - 1/256) = n * 255 / 256^2. For large n this converges to a normal distribution. The number of blobs in the shard lies with a probability of 99.73% within
three times the standard distribution (square root of the variance) that is within an absolute deviation of up to 0.187 * sqrt(n) from n / 256. Expressed as a relative deviation we get 47.872/sqrt(n).

The master index with 75 indexes with 10k packs each contains about 6.5 million blobs, which for the above formula results in an absolute expected deviation of up to 1.88% which means that the size of most sharded indexes will quite likely differ less than 3.8%. Refering back to the hashmap growth details we see that a hashmap has to grow by 10.8% to complete the evacuation to a larger backing array. This means that all sharded hashmaps will undergo the evacuation at the same time. This also shows up nicely in an expanded version of the initial memory usage table:

Number of indexes with 10k packs Allocated heap memory in MB
77 508.984
78 582.850
79 1136.313
80 1362.396
86 1360.301
87 1158.267
88 885.805
89 878.208
100 885.416

The formulas used to estimate the growth and evacuation completions points yield 78.7 and 86.7 as the expected number of indexes which align very well with the measurements.

As mentioned earlier the memory usage "just" jumps to 2.67 of its initial value. When accounting for the overflow buckets
we get (2+1.209)/1.209 = 2.65 which matches the observed memory usage growth.

For 80 indexes the memory overhead grows to nearly 206 bytes per entry which is 3.9 times the optimum. For 100 indexes the situation is far better: 107 bytes per entry which is roughly twice the optimum.

Sanity check

The discussion and the tests so far assume equally distributed blob IDs. In the following I will take a look whether this assumptions holds for a real backup repository. For the one used to backup my computer the sharded indexes end up with the following numbers of blobs:

[4752 4755 4622 4705 4799 4709 4625 4616 4671 4744 4669 4714 4583 4685 4720 4667 4641 4666 4618 4734 4716 4704 4749 4616 4754 4499 4672 4677 4713 4595 4701 4674 4652 4737 4650 4599 4624 4721 4687 4649 4601 4742 4533 4653 4641 4670 4678 4643 4617 4652 4554 4730 4704 4663 4634 4637 4555 4665 4696 4612 4811 4718 4793 4568 4564 4649 4727 4618 4761 4711 4721 4633 4685 4706 4681 4569 4740 4646 4751 4621 4572 4626 4732 4699 4581 4633 4746 4659 4754 4567 4686 4653 4647 4605 4557 4657 4807 4570 4685 4668 4745 4700 4716 4737 4680 4670 4669 4669 4723 4671 4572 4664 4740 4656 4604 4674 4644 4703 4770 4698 4583 4683 4674 4645 4705 4667 4580 4648 4712 4742 4641 4745 4701 4617 4704 4641 4627 4769 4719 4633 4553 4708 4480 4738 4603 4639 4632 4654 4687 4725 4572 4660 4757 4752 4604 4676 4671 4623 4691 4667 4661 4609 4650 4652 4771 4562 4533 4711 4673 4684 4739 4705 4743 4652 4792 4768 4734 4643 4624 4624 4550 4613 4762 4712 4665 4673 4572 4715 4718 4771 4743 4538 4603 4578 4629 4645 4677 4699 4515 4685 4739 4685 4656 4625 4636 4618 4516 4775 4630 4685 4727 4684 4693 4634 4636 4602 4669 4671 4678 4751 4686 4641 4664 4665 4642 4600 4738 4684 4607 4747 4538 4703 4751 4702 4616 4737 4603 4616 4645 4651 4640 4693 4675 4683 4738 4778 4734 4635 4734 4494 4733 4626 4761 4671 4655 4741]

These are 1.195 million blobs in total which gives an expected number of 4668 blobs per index and an expected deviation below 4.38% or up to 204 entries. The minimum of 4480 and the maximum of 4811 stay within that limit. To sum up, the model assumptions and the real repository match nicely.

Conclusion

As discussed previously the memory usage of the sharded index can spike to nearly 4 times the optimal memory usage. This is an inherent problem of an index with equal-sized shards and also applies to the simplest solution of using only one shard, that is a simple single hashmap.

So, what is the solution then? The key to avoiding the memory usage spike is to get rid of "equal" in "equal-sized shards". That way only a few or even just a single shard will be in the evacuation phase at a given point in time. Shards should also be kept as near as possible to the maximum load factor to be memory efficient.

It's probably possible to skew the key ranges each shard is responsible for, such that e.g. the first shard gets nearly twice (1.9x) as many entries as the last one (and then linearly (?) interpolate between those). That way only about 11% of the shards would be evacuated at the same time. However a lot of the shards would end up with a size that leads to relatively low load factors and thus a rather high value for bytes per entry (my sketch calculations ended up with 120 bytes or more).

Another option would be dynamically adjusting the key ranges, but that sounds like a bad idea to me as that would involve juggling a lot entries between shards. If the shards are kept close to their maximum capacity this could even cause ripples across multiple shards.

My suggestion would be to assign entries to shards based on the filling level of a shard (independent of the blob ID) and just use a low number of shards. An entry would end up in the first shard of with the lowest memory costs (regarding whether a new backing array for the hashmap would be required and how large it would be). That is if an index can store the entry without allocating then one of those will be selected. Otherwise one of the smallest indexes is grown. Using the first cheapest index has another nice property: Once the hashmap of that index was grown it will be able to store lots of new entries with zero allocation costs and it will be the first hashmap with these costs. That is the hashmap will be quickly filled up just below the maxmimum load factor. Only then will the next hashmap be grown and filled up and so on. That way only a single hashmap can be in the evacuation phase and have a low load factor. (See MichaelEischer@2d43a0b#diff-c1bc3bce10dd7a8df44bed943acb01c3 for a hacky implementation of that growth strategy).

The only downside is that a lookup would then have to iterate through all shards/indexes to lookup an entry. The memory overhead relative to the 81 bytes per entry for an optimal load factor would then be roughly 2.67*1/parts+(parts-1)/parts = 1 + 1.67/parts. With 5 to 10 parts this yields an overhead of just 33% and 17%. This should have a lower overhead than the current index implementation on master.

I think that we should try to find a replacement for the blobs map - as the implementation of the map causes this trouble - and leave the Index and MasterIndex logic as is.

A hashmap always has to allocate a new backing array and keep that in memory together with the existing backing array for the evacuation steps. If the evacuation happens all at once (or when the backing array just contains pointers to the buckets (and ends up in pointer hell)) then it should be possible to decrease the 3 times memory overhead to "just" two times. Undercutting this requires either that the hash map can grow it's backing array by less than a factor 2 or that only a part of the hashmaps are grown at once (as in my suggestion).

// After calling, there will be only one big final index in MasterIndex
// containing all final index contents.
// Indexes that are not final are left untouched.
func (mi *MasterIndex) MergeFinalIndexes() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be much more convenient if a new index is merged automatically when a final index is added via Insert.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would be. There is however a problem if taking care of superseded index files: If treated correctly, they should be removed from MasterIndex. However when reading all index files, one already superseded index could be read and inserted while the index which supersedes that index will be read and inserted later. So in this case we would have to remove the entries again. This is why I first read and Insert all index files (and Insert can easily take care of deleting "newly" superseded ones) and then merge the final indexes.

A correct implementation of treating superseded indexes in Insert is in #2718 - however I think, I should make a separate PR out of it.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 21, 2020

tl;dr Sharding the index into equal parts does not and cannot fix the memory usage problem.

@MichaelEischer Thank you for the explanation! I thought about problems with sharding and made the fallacy that I thought this could happen but would be pretty much unlikely. I however did not take into account that the hash property ensures that all entries will be about equally distributed over the shards - damn!

Mhh - so we are back in the discussion of #2523 and it seems there is no easy way to design a suitable data set (already thought about wheter we should open this and implement the sharded map - that however wasn't too complex).

Just a side remark - but of course it doesn't salvage a non-optimal design by pointing to design problems of other solutions:

The current implementation with a map per index file can also yield the same issue if all or most of the index files reach a critical size where its map need 3x the memory size. This may be not as probable and #2749 made it more unlikely, but of course still can happen...

@MichaelEischer
Copy link
Copy Markdown
Member

Mhh - so we are back in the discussion of #2523 and it seems there is no easy way to design a suitable data set (already thought about wheter we should open this and implement the sharded map - that however wasn't too complex).

Yes, it's that discussion all over again. However, I think that it should, as an interim solution, be rather easy to fix the index lookup performance while keeping the memory usage at roughly the current level. For that we'd just have to merge the indexes such that there are at most 5 to 10 indexes (see "my suggestion" in my previous comment for more details on that idea). That would keep the lookup speed similar no matter how large the index gets and would allow growing one index at a time. The peak memory usage should be at about two times the optimum (ca 100 bytes per entry), which is already smaller than the on-disk representation. And for my personal backup repository it could even reduce the memory usage which is currently at 122 bytes per entry.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 21, 2020

Yes, it's that discussion all over again. However, I think that it should, as an interim solution, be rather easy to fix the index lookup performance while keeping the memory usage at roughly the current level. For that we'd just have to merge the indexes such that there are at most 5 to 10 indexes (see "my suggestion" in my previous comment for more details on that idea). That would keep the lookup speed similar no matter how large the index gets and would allow growing one index at a time. The peak memory usage should be at about two times the optimum (ca 100 bytes per entry), which is already smaller than the on-disk representation. And for my personal backup repository it could even reduce the memory usage which is currently at 122 bytes per entry.

I think we can do better than always linearly search some (maybe 5-10 as you proposed) indexes every time.
In fact, we can use the property that the IDs will be equally distributed and then use maps which by design tend to have a constant ratio between the number of elements within. This way we can assure that (at least for large maps where the law of large numbers "holds") at most one map will have the critical size at a time.

I implemented it with three maps and ratio 12/32, 11/32, 9/32. (see last commit) Its tested but not benchmarked yet. Of course any other ratio and maybe more than three maps are also possible and may work out better.
The more maps, the more the actual memory overhead will tend to the average memory overhead. The tradeoff is more maps and a slower function to derive the actual map to use. My feeling would be that around 5 maps would suffice.

@aawsome aawsome changed the title Merge in-memory indexes; use sharded maps Merge in-memory indexes; use multiple maps Jun 21, 2020
@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 21, 2020

A bit of math:

If we want to have n maps such that

len(map_0) = k* len(map_1) = k^2*len(map_3) = ... k^(n-1)*len(map_(n-1))

and len(map_(n-1)) = k* 1/2 len(map_0) we get k = 2^(1/n).

Now, how much entries should be in each map?

We have

total_len := len(map_0) + len(map_1) + ... + len(map_(n-1)) 
= (1+k+k^2 + ... + k ^(n-1)) * len(map_(n-1)) 
= (k^n - 1)/(k-1) * len(map_(n-1))

thus (note that k^n = 2):

len(map_(n-1)) =  (2^(1/n) - 1) * total_len

(other maps lengths can be easily computed with the first formula)

To get the sum of the length of the i largest maps, we compute

len(map_0) + len(map_1) + .. + len(map_(i-1))
= total_len - (len(map_i) + len(map_(i+1)) + ... + len(map_(n-1))
= total_len - (1+k+...+k^(n-i-1)) * len(map_(n-1))
= (1 - (k^(n-i) - 1) / (k - 1) * (k - 1)) * total_len
= (2 - 2^((n-i)/n)) * total_len

If we use 5 maps (n=5), we get a factor k = 2 ^(1/5) = 1.1487 which means 14.87% more entries for the next-larger map. This should be enough buffer to the 10% mentioned by @MichaelEischer .

Then we get the following ranges:

256 * (2-2^(4/5)) = 66
256 * (2-2^(3/5)) = 124
256 * (2-2^(2/5)) = 174
256 * (2-2^(1/5)) = 218

I modified the the code to use these "optimal" distribution to 5 maps.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 26, 2020

Yes, it's that discussion all over again. However, I think that it should, as an interim solution, be rather easy to fix the index lookup performance while keeping the memory usage at roughly the current level. For that we'd just have to merge the indexes such that there are at most 5 to 10 indexes (see "my suggestion" in my previous comment for more details on that idea). That would keep the lookup speed similar no matter how large the index gets and would allow growing one index at a time. The peak memory usage should be at about two times the optimum (ca 100 bytes per entry), which is already smaller than the on-disk representation. And for my personal backup repository it could even reduce the memory usage which is currently at 122 bytes per entry.

How to proceed with this PR?

Ok, as a good in-memory representation of the index data seems to cruel, I think we should do the following steps:

  1. design a good in-memory index representation
  2. after that is finished, continue to work on this PR which should be just about merging.

I'll convert this PR to draft in mean time.

How to proceed with the in-memory index?

About a possible solution for the first point, I agree that a sorted list is optimal with respect to memory usage, but not optimal with respect to inserts. Hence I used the last evenings to tried out a combined approach:

I used a hash map to distribute the index data into large "buckets" containing around 1000-2000 index entries. These are itself saved in a paged list with pagesize of 32. As usually 32-64 pages are used, the average fill-level of a page is actually above 31.5 entries or more than 98%. Sorting 1000-2000 entries for each bucket is feasible if this is done carefully. When growing the hashmap at most one page needs to be added per bucket which gives an extra overhead of less than 3%. Also no lazy growing is implemented as peaks in insertion times should not play a role within restic.

A few remarks about this solution:

  1. Note that instead of using a "paged list" one could also simply use a linked list similar to the overflow buckets used in golang`s hashmap implementation. The paged list has the advantage that easy indexing is possible which makes this solution faster for sorting, searching and inserting
  2. In opposite to golang`s hashmap implementation here more than one page per bucket (or the existence of overflow buckets if speaking with the other vocabulary) is the usual and wanted case in order to get a high fill-level of the pages
  3. Using 32 instead of 8 entries per page minimizes the overhead of the needed pointer from 1 byte per entry to 1/4 byte per entry.
  4. As all buckets can be treated independently, sorting and growing can be easily parallelized.
  5. As the key to this hashmap (the blob ID) is already a hash, there is no need to use another hash function. In fact, using the first 8 bytes of the ID is sufficient.

I already started a first test-implementation, it gave a stable memory usage below 55 bytes per entry for the test case with on average 8.67 blobs per pack. This is quite close to the minimum of 52.69 bytes 😄 Also first performance tests showed aceptable performance. I'm going to clean things up and prepare a separate PR for this.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 27, 2020

I added PR #2811 and will work an this as a follow-up PR.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jun 28, 2020

As #2812 seem to be very promising, I suggest to build this PR on top of it.

#2811 has a little more potential for low memory usage; however the implementation is much more complicated, it is way slower than #2812 and the handling (necessity to sort) is worse. Moreover, as @greatroar correctly pointed out, with #2812 it is possible to implement a merge with almost no extra allocation which I consider a great feature 👍

// The first index is always final and the one to merge into
newIdx := mi.idx[:1]
for i := 1; i < len(mi.idx); i++ {
idx := mi.idx[i]
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be complemented with a mi.idx[i] = null assignment to get rid of the reference to the old index. That way merge also no longer has to modify idx2.

@aawsome
Copy link
Copy Markdown
Contributor Author

aawsome commented Jul 4, 2020

As this discussion is kind of "polluted" with index-related topics, I openend #2818 which implements merging of indexes based on #2812.

@aawsome aawsome closed this Jul 17, 2020
@aawsome aawsome deleted the merge-index branch July 22, 2020 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants