Skip to content

Write & mmap chunks in the background#10709

Closed
prymitive wants to merge 6 commits intoprometheus:mainfrom
prymitive:head_chunks
Closed

Write & mmap chunks in the background#10709
prymitive wants to merge 6 commits intoprometheus:mainfrom
prymitive:head_chunks

Conversation

@prymitive
Copy link
Contributor

Mmapping chunks is expensive and so we should avoid having this done on the hot path to avoid blocking scrapes and rule evaluations.
This change allows tsdb to keep most recent chunks in-memory and write & mmap them on a schedule by a background goroutine.

Signed-off-by: Łukasz Mierzwa l.mierzwa@gmail.com

This is an attempt to address #10377, early testing shows that it does in fact resolve most of the issue for me.

@codesome / @replay

@prymitive
Copy link
Contributor Author

Screenshot from 2022-05-17 17-26-00
Screenshot from 2022-05-17 17-25-45
Screenshot from 2022-05-17 17-25-09
Screenshot from 2022-05-17 17-24-42

@roidelapluie
Copy link
Member

/prombench main

@prombot
Copy link
Contributor

prombot commented May 17, 2022

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@prymitive
Copy link
Contributor Author

FYI the main improvement here, I think, is the way writes are scheduled.
There's now one big loop that iterates all chunks, locks each one and mmaps them. So we have strict ordering which helps to avoid getting blocked and waiting long for a lock.
Before chunk appends would trigger mmap, so at some point we would have huge number of chunks competing for a lock, hence some of them would get unlucky and stuck waiting on a lock, blocking reads and appends to such chunk.

@prymitive
Copy link
Contributor Author

prymitive commented May 17, 2022

More screenshots with Prometheus metrics after deploying latest patch.

Query and rule evaluation duration did improve for p90:
Screenshot from 2022-05-17 21-14-26
Screenshot from 2022-05-17 21-14-08

And I'm no longer seeing missed evaluations:
Screenshot from 2022-05-17 21-13-47

Sample rate is more stable, before we had a dip when scrapes get blocked followed by a spike as it got unblocked:
Screenshot from 2022-05-17 21-13-06

And rule evaluation duration is now more stable, no more big spikes in duration as rules get blocked:
Screenshot from 2022-05-17 21-12-45

But there is a slight memory increase, which I need to look into and see if there's anything I can do about it:
(green - RSS, yellow - Go alloc)
Screenshot from 2022-05-17 21-12-15

@roidelapluie
Copy link
Member

Note that the rule until now is that TSDB only uses one core, therefore one go routine.

@prymitive
Copy link
Contributor Author

Note that the rule until now is that TSDB only uses one core, therefore one go routine.

chunk write queue, which was added a few releases ago and disabled recently as it caused us to notice big performance degradation in the linked issue, already runs an extra goroutine to process writes

@brian-brazil
Copy link
Contributor

Each background thread is 1.33 cores that every user needs to provision, as you need to presume the worst case that they're all active at the same time.

@prymitive
Copy link
Contributor Author

Each background thread is 1.33 cores that every user needs to provision, as you need to presume the worst case that they're all active at the same time.

As I've mentioned on #10377 (comment) running mmap off a goroutine scheduled every N minutes is arbitrary. The goal was to move this whole process away from append() path so it doesn't block scrapes & rules. I wasn't sure what's the best moment to mmap extra chunks and so I went with a background goroutine. Happy to remove this goroutine but I'm not sure where & when mmap should be triggered (assuming the rest of the patch can stay).

@brian-brazil
Copy link
Contributor

Could we do it at the previous head compaction? After the very first, we should have a good idea of how many we'll need.

@prymitive
Copy link
Contributor Author

prymitive commented May 18, 2022

I can move it here -

func (db *DB) compactHead(head *RangeHead) error {

But one thing to consider here, I think, is that mmapping of chunks was added to lower memory usage, so if we mmap extra chunks before compaction, which seem to happen before writing out blocks (so 2h by default?) we'll have more in-memory (not mmapped) chunks and so we'll have higher memory usage that at the moment.
If we do mmap more often we should be able to keep memory usage closer to current levels.

@prymitive prymitive force-pushed the head_chunks branch 4 times, most recently from 82ffc60 to 471eab5 Compare May 18, 2022 11:18
@prymitive
Copy link
Contributor Author

I wonder if mmap should also be forced in ChunkSnapshot() -

func (h *Head) ChunkSnapshot() (*ChunkSnapshotStats, error) {

@prymitive
Copy link
Contributor Author

Deployed latest patch to another instance and it also show improvement. No longer seeing rule evaluation duration spikes leading to missed iterations.
Memory usage looks similar to vanilla v2.35.0

Screenshot from 2022-05-18 14-54-01
Screenshot from 2022-05-18 14-53-38
Screenshot from 2022-05-18 14-53-19
Screenshot from 2022-05-18 14-52-59
Screenshot from 2022-05-18 14-52-21

@roidelapluie
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented May 18, 2022

Benchmark cancel is in progress.

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a async queue that does this (See #10051). It works but had some bottleneck somewhere on locks.

I have not read your PR in detail yet, but I see that you have a loop to go over all series and m-map them. I don't think we want to m-map all series at the same time. Also, I would avoid making the headChunk a slice, it is simpler to work with a single chunk here.

If it interests you, would you like to take a look at the existing queue and see how we can improve it? (We disabled it by default here #10425)

(Edit: I see that you reported the bottlenecks in #10377)

@codesome
Copy link
Member

codesome commented Jun 4, 2022

I see that you have already done some investigation in #10377. Did you have the queue enabled? Did this patch with the queue make the situation better?

@prymitive
Copy link
Contributor Author

I see that you have already done some investigation in #10377. Did you have the queue enabled? Did this patch with the queue make the situation better?

Enabling chunk_write_queue causes severe degradation, which is why #10377 was raised. The patch for queue locks didn't show any noticeable performance improvement for my workload.
Without queue enabled servers I run that have 20+ million time series will show a minute or two minute long performance issue, as new chunks are being mmapped. Enabling the queue turns it into 5+ minute window of poor performance.

I did dig a bit more into both the code and history of changes around mmapping chunks, and I did realize that the queue added in #10051 was aimed at fixing the exact issue I'm observing, but it seems that it was focused at fixing remote write problems, so it might have improved writes but with unfortunate side effect of degrading reads under some conditions.

I see this patch as an alternative solution to the same problem. In both cases you keep chunks in memory until you're ready to mmap them. The difference is that the queue variant moves in-memory chunks to a new single component with additional locking that might have new/different bottlenecks, while I was trying to keep everything as parallel and distributed as it is (since that works well) and do the mmapping in a more ordered and controllable way.
With the queue I hit an hour on the wall clock and all in-memory are suddenly sent to the queue where they fight for locks. With this patch there's a loop that writes one chunk at a time, minimising any potential for lock congestion to minimum.

@prymitive
Copy link
Contributor Author

I see that you have already done some investigation in #10377. Did you have the queue enabled? Did this patch with the queue make the situation better?

I think that the queue doesn't really help here because it calls WriteChunk() which then needs to hold a lock

cdm.evtlPosMtx.Lock()
defer cdm.evtlPosMtx.Unlock()

which essentially serialised all writes. Without the queue this path is more expensive since we need to do the actual write, but it seems that this code path is expensive enough even if we just queue it, rather than do a full write.

When tsdb block needs to be written and suddenly we need to create a new chunk for every time series being scraped and rule being evaluated we hit a race window. Let's say scrapes and rules have 1 minute interval. We hit tsdb block max duration and start creating new chunks for everything. We have a rule evaluation that produces a sample or a scrape that returns a sample - we need to append it.
As we call append() tsdb cuts a new chunk.
Since cutting a new chunks is blocking (because there are locks like the one mentioned above) we are suddenly against the clock - our append is potentially blocked by appends on other chunks.
And so if we get unblocked within a minute then everything works.
But if our append gets blocked for more than a minute then we've missed scrape/rule interval.
And that's just for the write path. There are also reads to consider.
Our chunk might need to be read for queries or rule evaluations, if it's blocked for more than 2 minutes then any query using it will timeout.
So it's very important to keep any blocking under scrape/rule interval and query timeout. Otherwise we can't keep up.

The problem I'm hitting is that I have an instance with 20-26M time series. Since WriteChunk is essentially serialised then the more chunks I have in Prometheus the longer any blocking will last. This is made worse the more rules using chunks produced by other rules I have. So eventually I will hit that 1 minute (rule interval) and 2 minute (query timeout) marks and see failures.

Since the problem is with appends getting blocked, which only happens when we need to cut new chunks, and the queue doesn't seem enough to remediate this issue, I've decided that the easiest way to deal with it is to get rid of blocking on append() path. If we don't need to block and wait for mmap when appending then always have same performance level.

Now since mmapping of chunks was added to lower memory usage we still do want to mmap it all, but this patch moves that to be run around GC time as suggested by @brian-brazil.
Mmapping is done in a loop, one chunk at a time, this has the advantage that we minimise any blocking, since we mmap one chunk at a time the lock in WriteChunk is unlikely to be held, and even if it is that's gonna be a lot fewer chunks locking it then during that minute when tsdb block is written and all time series need new chunks.
Now it was also mentioned that we want to avoid having more than a single goroutine for tsdb, which this patch does have, but the queue code needs to spin additional goroutine to service writes, which seem to violate this requirement.

@codesome
Copy link
Member

Oh, I see it now. We will trigger the function everytime Compact() is called, and does not depend on compaction happening.

I would suggest moving that call inside DB.run(), and inside the reloadBlocks() case. It would be good to not depend on the compaction call cycle and reliably trigger it every minute. If the frequency turns out to be too much, we can always tweak it in the future.

I will review rest of the code soon.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Addressing review comment

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Addressing PR review comments

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I just have a bunch of small comments now.

Comment on lines +1110 to +1114
if err := db.compactBlocks(); err != nil {
return err
}

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not needed now.

require.NoError(t, app.Commit())
maxt++
}
db.head.mmapHeadChunks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The db.Close() below will end up calling this. So not required here I guess?

Comment on lines +903 to +906
// Save memory by mmapping all but last chunk for all time series with multiple chunks'
// stored in series.headChunks.
// We need to keep last chunk around so it can receive appends, but every other chunk
// holds older samples and so is only used for reads and so can be safely mmapped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see comments like this spread around everywhere. I suggest removing this comment and instead adding a comment on the mmapHeadChunks function explaining what it is doing and why.

This comment could be replaced by

Suggested change
// Save memory by mmapping all but last chunk for all time series with multiple chunks'
// stored in series.headChunks.
// We need to keep last chunk around so it can receive appends, but every other chunk
// holds older samples and so is only used for reads and so can be safely mmapped.
// We attempt mmapping of head chunks regularly.

Comment on lines +1571 to +1573
if mmapped > 0 {
level.Info(h.logger).Log("msg", "Finished mmapping head chunks", "chunks", mmapped, "duration", time.Since(start).String())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment. I can see this becoming spammy very soon (large setup with something being m-mapped every minute).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can make it a debug log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of turning this into a metric instead

// pN is the pointer to the mmappedChunk referered to by HeadChunkID=N
mmappedChunks []*mmappedChunk
headChunk *memChunk // Most recent chunk in memory that's still being built.
headChunks []*memChunk // In-memory chunks not yet written & mmapped, including the last chunk that's still being built.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
headChunks []*memChunk // In-memory chunks not yet written & mmapped, including the last chunk that's still being built.
headChunks []*memChunk // In-memory chunks not yet mmapped, including the last chunk that's still being built.

// is len(s.mmappedChunks), it represents the next chunk, which is the head chunk.
ix := int(id) - int(s.firstChunkID)
if ix < 0 || ix > len(s.mmappedChunks) {
if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ix is 0 based. Previously len(s.mmappedChunks) meant the last head chunk. Now len(s.mmappedChunks)+len(s.headChunks)-1 is the last chunk.

Suggested change
if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks) {
if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks)-1 {

Because tests didn't fail, we need to add a test case such that return s.headChunks[ix], false, nil would have panicked.

Comment on lines +658 to +661
totalSamples += d.chunk.NumSamples()
if j < ix {
previousSamples += d.chunk.NumSamples()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call d.chunk.NumSamples() only once by assigning it to a variable. It actually reads the chunk bytes and decodes the number.

}()
const chunkRange = 2000
type testCaseT struct {
chunkRange int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunkRange is the same for all cases. So we can remove this field from here.

Comment on lines +789 to +805
if tc.mmappedSamples > 0 {
for i := 0; i < tc.mmappedSamples; i += 5 {
ok, _ := s.append(int64(i), float64(i), 0, chunkDiskMapper, int64(tc.chunkRange))
require.True(t, ok, "sample append failed")
}
s.mmapHeadChunks(chunkDiskMapper)
require.Greater(t, len(s.mmappedChunks), 0)
}

s.truncateChunksBefore(2000, 0)
// append samples that will stay in the head
if tc.headSamples > 0 {
for i := tc.mmappedSamples; i < tc.mmappedSamples+tc.headSamples; i += 5 {
ok, _ := s.append(int64(i), float64(i), 0, chunkDiskMapper, int64(tc.chunkRange))
require.True(t, ok, "sample append failed")
}
require.Greater(t, len(s.headChunks), 0)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this by having a loop from 0 to tc.headSamples with a condition inside like

if i == tc.mmappedSamples {
    // mmap
    // check for >0
} 

(while you do that, can you change the comment to start with a capital letter where possible and end with a .?)

require.NoError(t, err)
require.Equal(t, lastChunk, chk)

// verify that we can read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// verify that we can read
// Verify that we can read.

@codesome
Copy link
Member

/prombench main

One last benchmark after some recent changes.

@prombot
Copy link
Contributor

prombot commented Dec 23, 2022

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@codesome
Copy link
Member

/prombench restart main

@prombot
Copy link
Contributor

prombot commented Dec 23, 2022

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@codesome
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Dec 23, 2022

Benchmark cancel is in progress.

@codesome
Copy link
Member

/prombench main

Some CI issue earlier and I messed up the CI by restarting it. Trying again.

@prombot
Copy link
Contributor

prombot commented Dec 23, 2022

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@codesome
Copy link
Member

Prombench start CI failed again with the same error:

09:21:36 gke.go:594: error while applying a resource err:error applying 'manifests/prombench/benchmark/6_loadgen.yaml' err:Request for 'applying deployment:prometheus-loadgen-querier' hasn't completed after retrying 50 times

But we still have a load of around 1m series. So will let it run for a while.

@codesome
Copy link
Member

codesome commented Dec 23, 2022

There is a sustained 5% increase in the memory RSS, I believe it is the switch to a slice of headChunks and other factors. CPU usage is also slightly up. (link to prombench dashboard here)

I would like to borrow @bboreham's expertise here. @bboreham: This basically helps with latency spikes as described in #10377. Apparently, the async mmap queue was not helpful. This PR triggers the mmap of in-memory chunks in the background by holding a slice of in-memory chunks per series. Does the memory/CPU increase look concerning or is there a way to optimize it?

@codesome
Copy link
Member

/prombench cancel

I think we have enough data for now.

@prombot
Copy link
Contributor

prombot commented Dec 23, 2022

Benchmark cancel is in progress.

@prymitive
Copy link
Contributor Author

prymitive commented Dec 23, 2022

There is a sustained 5% increase in the memory RSS, I believe it is the switch to a slice of headChunks and other factors.

Slice needs extra 24 bytes if I recall. That's 24MB per 1 million time series.
Last time I checked (#11003 (comment)) memSeries was using 224 bytes, so extra 24 bytes is 10% more just for baseline memory usage.
That being said most memory in memSeries will be eaten by samples and labels. In my setup I usually see around 3.2KB per memSeries and 24 bytes is <1% memory usage bump when you compare it with that number. Real world overhead depends on how big your memSeries are really.
But the bottom line is obvious here - having a slice here uses more memory.

But at the same time there was another slice added for out of order chunks, and we also have another slice for mmapped chunks. An obvious saving would be if we managed to keep all different chunks on a single slice, regardless if they are head, mmapped or OoO chunks. That would require a lot more changes but there are bits of code that I found difficult to get my head around quickly as they need to get the right chunk from passed offset (see tsdb/head_read.go changes) and having a single slice could potentially simplify some of that, which would be great.

Chunk write queue and other alternative solutions are likely to eat up some extra memory too, but those might be a bit harder to calculate exact memory usage growth due to more dynamic nature of it. Plus with queues you don't pay that extra memory usage all the time (except for the base structure to have the queue running), so from purely "bytes payed" PoV slice is more of a small but permanent cost that scales with the number of time series.
The upside is that it removes blocking behaviour from the hot path. Another solution could be to keep two pointers on memSeries: one for current HEAD chunk and one for previous one that should be mmapped. That only adds 8 bytes. The reason I went for the slice is that it felt safer - with two pointers I need to ensure "previous" chunk is mmapped before current HEAD one gets full, otherwise we need to block again - there's no such worry about slices.

@bboreham
Copy link
Member

Since you changed a single pointer to a slice, it should be +16 bytes per series. (When looking at rss this number is doubled by Go's garbage-collector).

Would it work as a linked list? I have pondered making that change for mmappedChunks and oooMmappedChunks.

@prymitive
Copy link
Contributor Author

Would it work as a linked list?

I cannot think of any reason why it wouldn't. Reading from memSeries would become a bit more complicated, since referencing chunks via series.headChunks[3] wouldn't be possible and we would have to iterate over the list until we get to the 4th chunk. But that's the main complication I can think of. No idea how that would impact read performance.

@bboreham
Copy link
Member

I think in practice those indices are only ever used from an iterator stepping through the set of chunks, but the code assumes this can be effectively expressed by an integer.

@prymitive
Copy link
Contributor Author

Here's a linked list version of this PR - #11818

@prymitive
Copy link
Contributor Author

Closing this one, there's #11818 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants