Write & mmap chunks in the background#10709
Write & mmap chunks in the background#10709prymitive wants to merge 6 commits intoprometheus:mainfrom
Conversation
|
/prombench main |
|
FYI the main improvement here, I think, is the way writes are scheduled. |
|
Note that the rule until now is that TSDB only uses one core, therefore one go routine. |
chunk write queue, which was added a few releases ago and disabled recently as it caused us to notice big performance degradation in the linked issue, already runs an extra goroutine to process writes |
|
Each background thread is 1.33 cores that every user needs to provision, as you need to presume the worst case that they're all active at the same time. |
As I've mentioned on #10377 (comment) running mmap off a goroutine scheduled every N minutes is arbitrary. The goal was to move this whole process away from append() path so it doesn't block scrapes & rules. I wasn't sure what's the best moment to mmap extra chunks and so I went with a background goroutine. Happy to remove this goroutine but I'm not sure where & when mmap should be triggered (assuming the rest of the patch can stay). |
|
Could we do it at the previous head compaction? After the very first, we should have a good idea of how many we'll need. |
|
I can move it here - Line 967 in 707600d But one thing to consider here, I think, is that mmapping of chunks was added to lower memory usage, so if we mmap extra chunks before compaction, which seem to happen before writing out blocks (so 2h by default?) we'll have more in-memory (not mmapped) chunks and so we'll have higher memory usage that at the moment. |
82ffc60 to
471eab5
Compare
|
I wonder if mmap should also be forced in Line 581 in 9558b9b |
|
/prombench cancel |
|
Benchmark cancel is in progress. |
There was a problem hiding this comment.
We already have a async queue that does this (See #10051). It works but had some bottleneck somewhere on locks.
I have not read your PR in detail yet, but I see that you have a loop to go over all series and m-map them. I don't think we want to m-map all series at the same time. Also, I would avoid making the headChunk a slice, it is simpler to work with a single chunk here.
If it interests you, would you like to take a look at the existing queue and see how we can improve it? (We disabled it by default here #10425)
(Edit: I see that you reported the bottlenecks in #10377)
|
I see that you have already done some investigation in #10377. Did you have the queue enabled? Did this patch with the queue make the situation better? |
Enabling chunk_write_queue causes severe degradation, which is why #10377 was raised. The patch for queue locks didn't show any noticeable performance improvement for my workload. I did dig a bit more into both the code and history of changes around mmapping chunks, and I did realize that the queue added in #10051 was aimed at fixing the exact issue I'm observing, but it seems that it was focused at fixing remote write problems, so it might have improved writes but with unfortunate side effect of degrading reads under some conditions. I see this patch as an alternative solution to the same problem. In both cases you keep chunks in memory until you're ready to mmap them. The difference is that the queue variant moves in-memory chunks to a new single component with additional locking that might have new/different bottlenecks, while I was trying to keep everything as parallel and distributed as it is (since that works well) and do the mmapping in a more ordered and controllable way. |
I think that the queue doesn't really help here because it calls prometheus/tsdb/chunks/head_chunks.go Lines 382 to 383 in 59727ab which essentially serialised all writes. Without the queue this path is more expensive since we need to do the actual write, but it seems that this code path is expensive enough even if we just queue it, rather than do a full write. When tsdb block needs to be written and suddenly we need to create a new chunk for every time series being scraped and rule being evaluated we hit a race window. Let's say scrapes and rules have 1 minute interval. We hit tsdb block max duration and start creating new chunks for everything. We have a rule evaluation that produces a sample or a scrape that returns a sample - we need to append it. The problem I'm hitting is that I have an instance with 20-26M time series. Since WriteChunk is essentially serialised then the more chunks I have in Prometheus the longer any blocking will last. This is made worse the more rules using chunks produced by other rules I have. So eventually I will hit that 1 minute (rule interval) and 2 minute (query timeout) marks and see failures. Since the problem is with appends getting blocked, which only happens when we need to cut new chunks, and the queue doesn't seem enough to remediate this issue, I've decided that the easiest way to deal with it is to get rid of blocking on append() path. If we don't need to block and wait for mmap when appending then always have same performance level. Now since mmapping of chunks was added to lower memory usage we still do want to mmap it all, but this patch moves that to be run around GC time as suggested by @brian-brazil. |
8d348bc to
ca6ca84
Compare
|
Oh, I see it now. We will trigger the function everytime Compact() is called, and does not depend on compaction happening. I would suggest moving that call inside I will review rest of the code soon. |
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Addressing review comment Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Addressing PR review comments Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
a1774ff to
d81c0b0
Compare
codesome
left a comment
There was a problem hiding this comment.
Nice work! I just have a bunch of small comments now.
| if err := db.compactBlocks(); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| return nil |
There was a problem hiding this comment.
This change is not needed now.
| require.NoError(t, app.Commit()) | ||
| maxt++ | ||
| } | ||
| db.head.mmapHeadChunks() |
There was a problem hiding this comment.
The db.Close() below will end up calling this. So not required here I guess?
| // Save memory by mmapping all but last chunk for all time series with multiple chunks' | ||
| // stored in series.headChunks. | ||
| // We need to keep last chunk around so it can receive appends, but every other chunk | ||
| // holds older samples and so is only used for reads and so can be safely mmapped. |
There was a problem hiding this comment.
I see comments like this spread around everywhere. I suggest removing this comment and instead adding a comment on the mmapHeadChunks function explaining what it is doing and why.
This comment could be replaced by
| // Save memory by mmapping all but last chunk for all time series with multiple chunks' | |
| // stored in series.headChunks. | |
| // We need to keep last chunk around so it can receive appends, but every other chunk | |
| // holds older samples and so is only used for reads and so can be safely mmapped. | |
| // We attempt mmapping of head chunks regularly. |
| if mmapped > 0 { | ||
| level.Info(h.logger).Log("msg", "Finished mmapping head chunks", "chunks", mmapped, "duration", time.Since(start).String()) | ||
| } |
There was a problem hiding this comment.
I don't think we need this comment. I can see this becoming spammy very soon (large setup with something being m-mapped every minute).
There was a problem hiding this comment.
Alternatively, we can make it a debug log
There was a problem hiding this comment.
I was thinking of turning this into a metric instead
| // pN is the pointer to the mmappedChunk referered to by HeadChunkID=N | ||
| mmappedChunks []*mmappedChunk | ||
| headChunk *memChunk // Most recent chunk in memory that's still being built. | ||
| headChunks []*memChunk // In-memory chunks not yet written & mmapped, including the last chunk that's still being built. |
There was a problem hiding this comment.
| headChunks []*memChunk // In-memory chunks not yet written & mmapped, including the last chunk that's still being built. | |
| headChunks []*memChunk // In-memory chunks not yet mmapped, including the last chunk that's still being built. |
| // is len(s.mmappedChunks), it represents the next chunk, which is the head chunk. | ||
| ix := int(id) - int(s.firstChunkID) | ||
| if ix < 0 || ix > len(s.mmappedChunks) { | ||
| if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks) { |
There was a problem hiding this comment.
ix is 0 based. Previously len(s.mmappedChunks) meant the last head chunk. Now len(s.mmappedChunks)+len(s.headChunks)-1 is the last chunk.
| if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks) { | |
| if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks)-1 { |
Because tests didn't fail, we need to add a test case such that return s.headChunks[ix], false, nil would have panicked.
| totalSamples += d.chunk.NumSamples() | ||
| if j < ix { | ||
| previousSamples += d.chunk.NumSamples() | ||
| } |
There was a problem hiding this comment.
We can call d.chunk.NumSamples() only once by assigning it to a variable. It actually reads the chunk bytes and decodes the number.
| }() | ||
| const chunkRange = 2000 | ||
| type testCaseT struct { | ||
| chunkRange int |
There was a problem hiding this comment.
chunkRange is the same for all cases. So we can remove this field from here.
| if tc.mmappedSamples > 0 { | ||
| for i := 0; i < tc.mmappedSamples; i += 5 { | ||
| ok, _ := s.append(int64(i), float64(i), 0, chunkDiskMapper, int64(tc.chunkRange)) | ||
| require.True(t, ok, "sample append failed") | ||
| } | ||
| s.mmapHeadChunks(chunkDiskMapper) | ||
| require.Greater(t, len(s.mmappedChunks), 0) | ||
| } | ||
|
|
||
| s.truncateChunksBefore(2000, 0) | ||
| // append samples that will stay in the head | ||
| if tc.headSamples > 0 { | ||
| for i := tc.mmappedSamples; i < tc.mmappedSamples+tc.headSamples; i += 5 { | ||
| ok, _ := s.append(int64(i), float64(i), 0, chunkDiskMapper, int64(tc.chunkRange)) | ||
| require.True(t, ok, "sample append failed") | ||
| } | ||
| require.Greater(t, len(s.headChunks), 0) | ||
| } |
There was a problem hiding this comment.
You can simplify this by having a loop from 0 to tc.headSamples with a condition inside like
if i == tc.mmappedSamples {
// mmap
// check for >0
}
(while you do that, can you change the comment to start with a capital letter where possible and end with a .?)
| require.NoError(t, err) | ||
| require.Equal(t, lastChunk, chk) | ||
|
|
||
| // verify that we can read |
There was a problem hiding this comment.
| // verify that we can read | |
| // Verify that we can read. |
|
/prombench main One last benchmark after some recent changes. |
|
/prombench restart main |
|
/prombench cancel |
|
Benchmark cancel is in progress. |
|
/prombench main Some CI issue earlier and I messed up the CI by restarting it. Trying again. |
|
Prombench start CI failed again with the same error: But we still have a load of around 1m series. So will let it run for a while. |
|
There is a sustained 5% increase in the memory RSS, I believe it is the switch to a slice of headChunks and other factors. CPU usage is also slightly up. (link to prombench dashboard here) I would like to borrow @bboreham's expertise here. @bboreham: This basically helps with latency spikes as described in #10377. Apparently, the async mmap queue was not helpful. This PR triggers the mmap of in-memory chunks in the background by holding a slice of in-memory chunks per series. Does the memory/CPU increase look concerning or is there a way to optimize it? |
|
/prombench cancel I think we have enough data for now. |
|
Benchmark cancel is in progress. |
Slice needs extra 24 bytes if I recall. That's 24MB per 1 million time series. But at the same time there was another slice added for out of order chunks, and we also have another slice for mmapped chunks. An obvious saving would be if we managed to keep all different chunks on a single slice, regardless if they are head, mmapped or OoO chunks. That would require a lot more changes but there are bits of code that I found difficult to get my head around quickly as they need to get the right chunk from passed offset (see Chunk write queue and other alternative solutions are likely to eat up some extra memory too, but those might be a bit harder to calculate exact memory usage growth due to more dynamic nature of it. Plus with queues you don't pay that extra memory usage all the time (except for the base structure to have the queue running), so from purely "bytes payed" PoV slice is more of a small but permanent cost that scales with the number of time series. |
|
Since you changed a single pointer to a slice, it should be +16 bytes per series. (When looking at rss this number is doubled by Go's garbage-collector). Would it work as a linked list? I have pondered making that change for |
I cannot think of any reason why it wouldn't. Reading from memSeries would become a bit more complicated, since referencing chunks via |
|
I think in practice those indices are only ever used from an iterator stepping through the set of chunks, but the code assumes this can be effectively expressed by an integer. |
|
Here's a linked list version of this PR - #11818 |
|
Closing this one, there's #11818 instead |















Mmapping chunks is expensive and so we should avoid having this done on the hot path to avoid blocking scrapes and rule evaluations.
This change allows tsdb to keep most recent chunks in-memory and write & mmap them on a schedule by a background goroutine.
Signed-off-by: Łukasz Mierzwa l.mierzwa@gmail.com
This is an attempt to address #10377, early testing shows that it does in fact resolve most of the issue for me.
@codesome / @replay