Merge in-memory indexes; use multiple maps#2799
Merge in-memory indexes; use multiple maps#2799aawsome wants to merge 5 commits intorestic:masterfrom
Conversation
MichaelEischer
left a comment
There was a problem hiding this comment.
I'm a bit worried about the memory overhead this PR might have. When the index hash map has to be resized, Go doubles the size of the backing array and then slowly migrates the entries into the new hashmap. That way we might end up needing three times the memory than actually needed.
During the discussions in #2523 I did propose to split the hashtable into e.g. 10 smaller parts and then grow those in turns until they are close to reaching the maximum load factor. That did at least avoid the three times peak memory usage and is still rather fast. A slightly cleaned up version of that merging logic is available here: MichaelEischer@2d43a0b#diff-c1bc3bce10dd7a8df44bed943acb01c3
I didn't dive into the golang map implementation too much, but seems you are right, this might be an issue. I tried to add an benchmark I changed the benchmark (a 200k pack file index should't occur in reality) to use 50 indexes with 20k packs. The benchmark results of this much more realistic setting show a huge speed improvement. I think that we should try to find a replacement for the However, as long as we risk a huge memory overhead, this should not be merged; I'll change this PR to draft mode. |
|
A simple sharded hash table grows at the same rate as a regular one, but avoids the big temporary (one shard grows at a time). It can also prevent lock contention. E.g., (untested) type idxMap [32]struct {
sync.Mutex
m map[restic.BlobHandle]indexEntry
}
func (m *idxMap) get(h restic.BlobHandle) (indexEntry, bool) {
i := h.ID[0] % byte(len(m))
m[i].Lock()
defer m[i].Unlock()
entry, ok := m[i].m[h]
return entry, ok
} |
|
@greatroar I was thinking about something like this. |
|
Ok, as implementation of the sharded index was not too complex, I've added this to this PR. A first benchmark showed that the speed improvement of the merged index should outweight the additional overhead introduced by the shards. Also slightly reduces the memory used by the index for large repositories (but introduces the cost of a constant overhead to manage the 256 shards) |
-> makes index operations faster, especially for large repositories
|
Here are the benchmark results of Old: New: So pure Lookup within Memory usage seems to be reduced by about 3% |
|
You can get a nice overview of benchmark results from the benchstat tool: For new benchmarks, this will of course not give a comparison, but you can do |
|
tl;dr Sharding the index into equal parts does not and cannot fix the memory usage problem. (Sorry, for writing this much. This comment somehow got 'slightly' out of hand...) "Proof" by exampleSharding the index into equal parts does not fix the memory usage problem. The 3 times memory overhead still applies for the sharded index variant. Using the following code snippet in the
At 75 indexes the hashtables are pretty close to reach the maximum load factor. At 83 we're already quite a bit beyond that point, such that each of the index shards has allocated a new bucket array (with 2 times the size of the old one) and is in the progress of migrating entries (More on why the memory usage just increased by a factor of 2.7 and not 3 later). At 90 indexes the memory usage is down to 2/3s of the maxmimum as the migration to the new bucket array has been completed. i := 75 // number of indexes with 10k packs
runtime.GC()
var m runtime.MemStats
runtime.ReadMemStats(&m)
mIdx, _ := createRandomMasterIndex(rand.New(rand.NewSource(0)), i, 10000)
mIdx.FinalizeNotFinalIndexes()
mIdx.MergeFinalIndexes()
runtime.GC()
var n runtime.MemStats
runtime.ReadMemStats(&n)
runtime.KeepAlive(mIdx)
b.Logf("%d %.3f", i, float32((n.HeapAlloc-m.HeapAlloc)/1024)/1024)Hashmap growth in GoTo understand why the memory usage spikes at that point, we have to take a closer look at hashmaps in Go first, as these are used as datastore for the index shards. Basics(Based on hashmap source code https://github.com/golang/go/blob/60f78765022a59725121d3b800268adffe78bde3/src/runtime/map.go) Hashmaps in Go allocate an array of type bmap struct {
tophash [8]uint8
keys [8]restic.BlobHandle
values [8]indexEntry
overflow *bmap
}The restic.BlobHandle has 33 bytes and an indexEntries is 16 bytes large, which yields 408 bytes on a 64bit platform. When allocated in a large array, each entry requires exactly that many bytes, whereas when allocating a single bmap object for an overflow bucket it require 416 bytes as this is the smallest matching malloc size class. (The current sharding code just stores a restic.ID instead of a restic.BlobHandle which saves 8 bytes.) Hashmap growthThe hashmap uses a maximum load factor (elements in hashmap divided by backing array capacity) of 6.5 entries per bucket which when reached triggers the hashmap growth (line 33, 70). In the following I use the more common (?) notation of specifying the load factor in relation to the number of entries in the backing array: 6.5/8 = 0.8125. Growing a hashmap means that a new backing array with twice the previous size (B += 1) is allocated, which the hashmap has to keep in parallel to the old backing array until all entries are migrated into the new buckets. This is where the memory usage spikes to roughly three times the previous usage. A special feature of the hashmap in Go is that it migrates buckets incrementally in a process called evacuation and not all entries at once as the latter would result in a huge latency spike. Each assignment or deletion to the hashmap triggers the migration of roughly two buckets (the bucket that should be accessed and the first not evacuated bucket. If the former is already migrated, nothing happens), that is up to 16 entries (actually all overflow buckets are also migrated along their parent bucket). The evacuation works in principle by moving the entries of these two buckets one by one into the new backing array. Give me numbers!Putting this into numbers a hashmap should grow when reaching The code used for the measurements is available at https://gist.github.com/MichaelEischer/e3280247264101864b26c162fd823df9 (using a trick from https://hackernoon.com/some-insights-on-maps-in-golang-rm5v3ywh to access the hashmap internals) and can be run using The extracted and modified index test allows the creation of indexes with a specified number of entries (not packs!, might overshoot by a few entries). In addition is also prints the number of used overflow buckets and whether an evacuation is in progress.
The maximum number of entries before the backing array grows is pretty much identical to the expected value, except that due to the way the index is created it might be necessary to subtract a buffer of up to 8 entries. The evacuation completion column only contains approximate values as the exact numbers vary due to varying hash seeds used to calculate the key hashes. The approximate values correspond to roughly Let's now continue with a close look at the memory usage per index entry. For a single entry in a fully loaded hashmap we get This discrepancy nearly completely disappears when we account for the overflow buckets, whose number is up to 20.9% of the number of normal buckets (see table). Note that this seems to be the standard expected number of overflow buckets (line 43). A hashmap preallocates an additional 1/16 of the bucket array for overflow buckets (line 350) which then benefit from the lower memory storage size of 408 instead of 416 when allocated individually (line 262). With variables that gives the following formula for bytes per entry: The optimal number of bytes per entry (without further compression) would be Optimal hashmap loadRight after completing the growth of a hashmap at When moving in ten steps from the evacuation completion point to right before the next growth point, the memory usage per index entry steadily decreases. That is for the most efficient data storage each hashmap should be filled as close as possible to the maximum load factor.
Sharding the indexThe index sharding variant distributes index entries to shards based on the top byte of their blob ID. As the IDs are SHA256 hashes this yields a uniform distribution to shards for normal data. When modelling this as a Binomial distribution where a blob can either end up in a specific shard or not, we get The master index with 75 indexes with 10k packs each contains about 6.5 million blobs, which for the above formula results in an absolute expected deviation of up to
The formulas used to estimate the growth and evacuation completions points yield 78.7 and 86.7 as the expected number of indexes which align very well with the measurements. As mentioned earlier the memory usage "just" jumps to 2.67 of its initial value. When accounting for the overflow buckets For 80 indexes the memory overhead grows to nearly 206 bytes per entry which is 3.9 times the optimum. For 100 indexes the situation is far better: 107 bytes per entry which is roughly twice the optimum. Sanity checkThe discussion and the tests so far assume equally distributed blob IDs. In the following I will take a look whether this assumptions holds for a real backup repository. For the one used to backup my computer the sharded indexes end up with the following numbers of blobs: [4752 4755 4622 4705 4799 4709 4625 4616 4671 4744 4669 4714 4583 4685 4720 4667 4641 4666 4618 4734 4716 4704 4749 4616 4754 4499 4672 4677 4713 4595 4701 4674 4652 4737 4650 4599 4624 4721 4687 4649 4601 4742 4533 4653 4641 4670 4678 4643 4617 4652 4554 4730 4704 4663 4634 4637 4555 4665 4696 4612 4811 4718 4793 4568 4564 4649 4727 4618 4761 4711 4721 4633 4685 4706 4681 4569 4740 4646 4751 4621 4572 4626 4732 4699 4581 4633 4746 4659 4754 4567 4686 4653 4647 4605 4557 4657 4807 4570 4685 4668 4745 4700 4716 4737 4680 4670 4669 4669 4723 4671 4572 4664 4740 4656 4604 4674 4644 4703 4770 4698 4583 4683 4674 4645 4705 4667 4580 4648 4712 4742 4641 4745 4701 4617 4704 4641 4627 4769 4719 4633 4553 4708 4480 4738 4603 4639 4632 4654 4687 4725 4572 4660 4757 4752 4604 4676 4671 4623 4691 4667 4661 4609 4650 4652 4771 4562 4533 4711 4673 4684 4739 4705 4743 4652 4792 4768 4734 4643 4624 4624 4550 4613 4762 4712 4665 4673 4572 4715 4718 4771 4743 4538 4603 4578 4629 4645 4677 4699 4515 4685 4739 4685 4656 4625 4636 4618 4516 4775 4630 4685 4727 4684 4693 4634 4636 4602 4669 4671 4678 4751 4686 4641 4664 4665 4642 4600 4738 4684 4607 4747 4538 4703 4751 4702 4616 4737 4603 4616 4645 4651 4640 4693 4675 4683 4738 4778 4734 4635 4734 4494 4733 4626 4761 4671 4655 4741] These are 1.195 million blobs in total which gives an expected number of 4668 blobs per index and an expected deviation below 4.38% or up to 204 entries. The minimum of 4480 and the maximum of 4811 stay within that limit. To sum up, the model assumptions and the real repository match nicely. ConclusionAs discussed previously the memory usage of the sharded index can spike to nearly 4 times the optimal memory usage. This is an inherent problem of an index with equal-sized shards and also applies to the simplest solution of using only one shard, that is a simple single hashmap. So, what is the solution then? The key to avoiding the memory usage spike is to get rid of "equal" in "equal-sized shards". That way only a few or even just a single shard will be in the evacuation phase at a given point in time. Shards should also be kept as near as possible to the maximum load factor to be memory efficient. It's probably possible to skew the key ranges each shard is responsible for, such that e.g. the first shard gets nearly twice (1.9x) as many entries as the last one (and then linearly (?) interpolate between those). That way only about 11% of the shards would be evacuated at the same time. However a lot of the shards would end up with a size that leads to relatively low load factors and thus a rather high value for bytes per entry (my sketch calculations ended up with 120 bytes or more). Another option would be dynamically adjusting the key ranges, but that sounds like a bad idea to me as that would involve juggling a lot entries between shards. If the shards are kept close to their maximum capacity this could even cause ripples across multiple shards. My suggestion would be to assign entries to shards based on the filling level of a shard (independent of the blob ID) and just use a low number of shards. An entry would end up in the first shard of with the lowest memory costs (regarding whether a new backing array for the hashmap would be required and how large it would be). That is if an index can store the entry without allocating then one of those will be selected. Otherwise one of the smallest indexes is grown. Using the first cheapest index has another nice property: Once the hashmap of that index was grown it will be able to store lots of new entries with zero allocation costs and it will be the first hashmap with these costs. That is the hashmap will be quickly filled up just below the maxmimum load factor. Only then will the next hashmap be grown and filled up and so on. That way only a single hashmap can be in the evacuation phase and have a low load factor. (See MichaelEischer@2d43a0b#diff-c1bc3bce10dd7a8df44bed943acb01c3 for a hacky implementation of that growth strategy). The only downside is that a lookup would then have to iterate through all shards/indexes to lookup an entry. The memory overhead relative to the 81 bytes per entry for an optimal load factor would then be roughly
A hashmap always has to allocate a new backing array and keep that in memory together with the existing backing array for the evacuation steps. If the evacuation happens all at once (or when the backing array just contains pointers to the buckets (and ends up in pointer hell)) then it should be possible to decrease the 3 times memory overhead to "just" two times. Undercutting this requires either that the hash map can grow it's backing array by less than a factor 2 or that only a part of the hashmaps are grown at once (as in my suggestion). |
| // After calling, there will be only one big final index in MasterIndex | ||
| // containing all final index contents. | ||
| // Indexes that are not final are left untouched. | ||
| func (mi *MasterIndex) MergeFinalIndexes() { |
There was a problem hiding this comment.
I would be much more convenient if a new index is merged automatically when a final index is added via Insert.
There was a problem hiding this comment.
Yes, would be. There is however a problem if taking care of superseded index files: If treated correctly, they should be removed from MasterIndex. However when reading all index files, one already superseded index could be read and inserted while the index which supersedes that index will be read and inserted later. So in this case we would have to remove the entries again. This is why I first read and Insert all index files (and Insert can easily take care of deleting "newly" superseded ones) and then merge the final indexes.
A correct implementation of treating superseded indexes in Insert is in #2718 - however I think, I should make a separate PR out of it.
@MichaelEischer Thank you for the explanation! I thought about problems with sharding and made the fallacy that I thought this could happen but would be pretty much unlikely. I however did not take into account that the hash property ensures that all entries will be about equally distributed over the shards - damn! Mhh - so we are back in the discussion of #2523 and it seems there is no easy way to design a suitable data set (already thought about wheter we should open this and implement the sharded map - that however wasn't too complex). Just a side remark - but of course it doesn't salvage a non-optimal design by pointing to design problems of other solutions: The current implementation with a map per index file can also yield the same issue if all or most of the index files reach a critical size where its map need 3x the memory size. This may be not as probable and #2749 made it more unlikely, but of course still can happen... |
Yes, it's that discussion all over again. However, I think that it should, as an interim solution, be rather easy to fix the index lookup performance while keeping the memory usage at roughly the current level. For that we'd just have to merge the indexes such that there are at most 5 to 10 indexes (see "my suggestion" in my previous comment for more details on that idea). That would keep the lookup speed similar no matter how large the index gets and would allow growing one index at a time. The peak memory usage should be at about two times the optimum (ca 100 bytes per entry), which is already smaller than the on-disk representation. And for my personal backup repository it could even reduce the memory usage which is currently at 122 bytes per entry. |
…vior at the same time
I think we can do better than always linearly search some (maybe 5-10 as you proposed) indexes every time. I implemented it with three maps and ratio 12/32, 11/32, 9/32. (see last commit) Its tested but not benchmarked yet. Of course any other ratio and maybe more than three maps are also possible and may work out better. |
|
A bit of math: If we want to have and Now, how much entries should be in each map? We have thus (note that (other maps lengths can be easily computed with the first formula) To get the sum of the length of the If we use 5 maps ( Then we get the following ranges: I modified the the code to use these "optimal" distribution to 5 maps. |
How to proceed with this PR?Ok, as a good in-memory representation of the index data seems to cruel, I think we should do the following steps:
I'll convert this PR to draft in mean time. How to proceed with the in-memory index?About a possible solution for the first point, I agree that a sorted list is optimal with respect to memory usage, but not optimal with respect to inserts. Hence I used the last evenings to tried out a combined approach: I used a hash map to distribute the index data into large "buckets" containing around 1000-2000 index entries. These are itself saved in a paged list with pagesize of 32. As usually 32-64 pages are used, the average fill-level of a page is actually above 31.5 entries or more than 98%. Sorting 1000-2000 entries for each bucket is feasible if this is done carefully. When growing the hashmap at most one page needs to be added per bucket which gives an extra overhead of less than 3%. Also no lazy growing is implemented as peaks in insertion times should not play a role within restic. A few remarks about this solution:
I already started a first test-implementation, it gave a stable memory usage below 55 bytes per entry for the test case with on average 8.67 blobs per pack. This is quite close to the minimum of 52.69 bytes 😄 Also first performance tests showed aceptable performance. I'm going to clean things up and prepare a separate PR for this. |
|
I added PR #2811 and will work an this as a follow-up PR. |
|
As #2812 seem to be very promising, I suggest to build this PR on top of it. #2811 has a little more potential for low memory usage; however the implementation is much more complicated, it is way slower than #2812 and the handling (necessity to sort) is worse. Moreover, as @greatroar correctly pointed out, with #2812 it is possible to implement a merge with almost no extra allocation which I consider a great feature 👍 |
| // The first index is always final and the one to merge into | ||
| newIdx := mi.idx[:1] | ||
| for i := 1; i < len(mi.idx); i++ { | ||
| idx := mi.idx[i] |
There was a problem hiding this comment.
This should be complemented with a mi.idx[i] = null assignment to get rid of the reference to the old index. That way merge also no longer has to modify idx2.
What is the purpose of this change? What does it change?
Merge in-memory indexes when they are finalized.
This increases performance especially for large repositories with many index files.
To avoid large spikes in memory consumption when maps are resized, this PR also introduces sharded maps to save the index entries. By using 256 shards one byte of key information can be removed which also reduces the memory required for large indexes a bit.
Was the change discussed in an issue or in the forum before?
The idea comes from #2523.
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits