Write & mmap chunks in the background by prymitive · Pull Request #10709 · prometheus/prometheus

prymitive · 2022-05-17T16:18:41Z

Mmapping chunks is expensive and so we should avoid having this done on the hot path to avoid blocking scrapes and rule evaluations.
This change allows tsdb to keep most recent chunks in-memory and write & mmap them on a schedule by a background goroutine.

Signed-off-by: Łukasz Mierzwa l.mierzwa@gmail.com

This is an attempt to address #10377, early testing shows that it does in fact resolve most of the issue for me.

@codesome / @replay

prymitive · 2022-05-17T16:26:46Z

roidelapluie · 2022-05-17T19:04:26Z

/prombench main

prombot · 2022-05-17T19:04:28Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

prymitive · 2022-05-17T19:20:12Z

FYI the main improvement here, I think, is the way writes are scheduled.
There's now one big loop that iterates all chunks, locks each one and mmaps them. So we have strict ordering which helps to avoid getting blocked and waiting long for a lock.
Before chunk appends would trigger mmap, so at some point we would have huge number of chunks competing for a lock, hence some of them would get unlucky and stuck waiting on a lock, blocking reads and appends to such chunk.

prymitive · 2022-05-17T20:17:46Z

More screenshots with Prometheus metrics after deploying latest patch.

Query and rule evaluation duration did improve for p90:

And I'm no longer seeing missed evaluations:

Sample rate is more stable, before we had a dip when scrapes get blocked followed by a spike as it got unblocked:

And rule evaluation duration is now more stable, no more big spikes in duration as rules get blocked:

But there is a slight memory increase, which I need to look into and see if there's anything I can do about it:
(green - RSS, yellow - Go alloc)

roidelapluie · 2022-05-17T21:23:51Z

Note that the rule until now is that TSDB only uses one core, therefore one go routine.

prymitive · 2022-05-17T21:41:14Z

Note that the rule until now is that TSDB only uses one core, therefore one go routine.

chunk write queue, which was added a few releases ago and disabled recently as it caused us to notice big performance degradation in the linked issue, already runs an extra goroutine to process writes

brian-brazil · 2022-05-18T06:09:00Z

Each background thread is 1.33 cores that every user needs to provision, as you need to presume the worst case that they're all active at the same time.

prymitive · 2022-05-18T08:14:03Z

Each background thread is 1.33 cores that every user needs to provision, as you need to presume the worst case that they're all active at the same time.

As I've mentioned on #10377 (comment) running mmap off a goroutine scheduled every N minutes is arbitrary. The goal was to move this whole process away from append() path so it doesn't block scrapes & rules. I wasn't sure what's the best moment to mmap extra chunks and so I went with a background goroutine. Happy to remove this goroutine but I'm not sure where & when mmap should be triggered (assuming the rest of the patch can stay).

brian-brazil · 2022-05-18T08:20:48Z

Could we do it at the previous head compaction? After the very first, we should have a good idea of how many we'll need.

prymitive · 2022-05-18T08:29:43Z

I can move it here -

prometheus/tsdb/db.go

Line 967 in 707600d

func (db *DB) compactHead(head *RangeHead) error {

But one thing to consider here, I think, is that mmapping of chunks was added to lower memory usage, so if we mmap extra chunks before compaction, which seem to happen before writing out blocks (so 2h by default?) we'll have more in-memory (not mmapped) chunks and so we'll have higher memory usage that at the moment.
If we do mmap more often we should be able to keep memory usage closer to current levels.

prymitive · 2022-05-18T11:24:37Z

I wonder if mmap should also be forced in ChunkSnapshot() -

prometheus/tsdb/head_wal.go

Line 581 in 9558b9b

func (h *Head) ChunkSnapshot() (*ChunkSnapshotStats, error) {

prymitive · 2022-05-18T13:57:45Z

Deployed latest patch to another instance and it also show improvement. No longer seeing rule evaluation duration spikes leading to missed iterations.
Memory usage looks similar to vanilla v2.35.0

roidelapluie · 2022-05-18T21:54:23Z

/prombench cancel

prombot · 2022-05-18T21:54:25Z

Benchmark cancel is in progress.

codesome

We already have a async queue that does this (See #10051). It works but had some bottleneck somewhere on locks.

I have not read your PR in detail yet, but I see that you have a loop to go over all series and m-map them. I don't think we want to m-map all series at the same time. Also, I would avoid making the headChunk a slice, it is simpler to work with a single chunk here.

If it interests you, would you like to take a look at the existing queue and see how we can improve it? (We disabled it by default here #10425)

(Edit: I see that you reported the bottlenecks in #10377)

codesome · 2022-06-04T01:04:17Z

I see that you have already done some investigation in #10377. Did you have the queue enabled? Did this patch with the queue make the situation better?

prymitive · 2022-06-04T14:07:26Z

I see that you have already done some investigation in #10377. Did you have the queue enabled? Did this patch with the queue make the situation better?

Enabling chunk_write_queue causes severe degradation, which is why #10377 was raised. The patch for queue locks didn't show any noticeable performance improvement for my workload.
Without queue enabled servers I run that have 20+ million time series will show a minute or two minute long performance issue, as new chunks are being mmapped. Enabling the queue turns it into 5+ minute window of poor performance.

I did dig a bit more into both the code and history of changes around mmapping chunks, and I did realize that the queue added in #10051 was aimed at fixing the exact issue I'm observing, but it seems that it was focused at fixing remote write problems, so it might have improved writes but with unfortunate side effect of degrading reads under some conditions.

I see this patch as an alternative solution to the same problem. In both cases you keep chunks in memory until you're ready to mmap them. The difference is that the queue variant moves in-memory chunks to a new single component with additional locking that might have new/different bottlenecks, while I was trying to keep everything as parallel and distributed as it is (since that works well) and do the mmapping in a more ordered and controllable way.
With the queue I hit an hour on the wall clock and all in-memory are suddenly sent to the queue where they fight for locks. With this patch there's a loop that writes one chunk at a time, minimising any potential for lock congestion to minimum.

prymitive · 2022-06-15T13:26:03Z

I see that you have already done some investigation in #10377. Did you have the queue enabled? Did this patch with the queue make the situation better?

I think that the queue doesn't really help here because it calls WriteChunk() which then needs to hold a lock

prometheus/tsdb/chunks/head_chunks.go

Lines 382 to 383 in 59727ab

    
           cdm.evtlPosMtx.Lock() 
        
           defer cdm.evtlPosMtx.Unlock()

which essentially serialised all writes. Without the queue this path is more expensive since we need to do the actual write, but it seems that this code path is expensive enough even if we just queue it, rather than do a full write.

When tsdb block needs to be written and suddenly we need to create a new chunk for every time series being scraped and rule being evaluated we hit a race window. Let's say scrapes and rules have 1 minute interval. We hit tsdb block max duration and start creating new chunks for everything. We have a rule evaluation that produces a sample or a scrape that returns a sample - we need to append it.
As we call append() tsdb cuts a new chunk.
Since cutting a new chunks is blocking (because there are locks like the one mentioned above) we are suddenly against the clock - our append is potentially blocked by appends on other chunks.
And so if we get unblocked within a minute then everything works.
But if our append gets blocked for more than a minute then we've missed scrape/rule interval.
And that's just for the write path. There are also reads to consider.
Our chunk might need to be read for queries or rule evaluations, if it's blocked for more than 2 minutes then any query using it will timeout.
So it's very important to keep any blocking under scrape/rule interval and query timeout. Otherwise we can't keep up.

The problem I'm hitting is that I have an instance with 20-26M time series. Since WriteChunk is essentially serialised then the more chunks I have in Prometheus the longer any blocking will last. This is made worse the more rules using chunks produced by other rules I have. So eventually I will hit that 1 minute (rule interval) and 2 minute (query timeout) marks and see failures.

Since the problem is with appends getting blocked, which only happens when we need to cut new chunks, and the queue doesn't seem enough to remediate this issue, I've decided that the easiest way to deal with it is to get rid of blocking on append() path. If we don't need to block and wait for mmap when appending then always have same performance level.

Now since mmapping of chunks was added to lower memory usage we still do want to mmap it all, but this patch moves that to be run around GC time as suggested by @brian-brazil.
Mmapping is done in a loop, one chunk at a time, this has the advantage that we minimise any blocking, since we mmap one chunk at a time the lock in WriteChunk is unlikely to be held, and even if it is that's gonna be a lot fewer chunks locking it then during that minute when tsdb block is written and all time series need new chunks.
Now it was also mentioned that we want to avoid having more than a single goroutine for tsdb, which this patch does have, but the queue code needs to spin additional goroutine to service writes, which seem to violate this requirement.

codesome · 2022-12-19T13:07:53Z

Oh, I see it now. We will trigger the function everytime Compact() is called, and does not depend on compaction happening.

I would suggest moving that call inside DB.run(), and inside the reloadBlocks() case. It would be good to not depend on the compaction call cycle and reliably trigger it every minute. If the frequency turns out to be too much, we can always tweak it in the future.

I will review rest of the code soon.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

Addressing review comment Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

Addressing PR review comments Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

codesome

Nice work! I just have a bunch of small comments now.

codesome · 2022-12-23T06:18:39Z

tsdb/db.go

+	if err := db.compactBlocks(); err != nil {
+		return err
+	}
+
+	return nil


This change is not needed now.

codesome · 2022-12-23T06:20:04Z

tsdb/db_test.go

 			require.NoError(t, app.Commit())
 			maxt++
 		}
+		db.head.mmapHeadChunks()


The db.Close() below will end up calling this. So not required here I guess?

codesome · 2022-12-23T06:26:06Z

tsdb/db.go

+			// Save memory by mmapping all but last chunk for all time series with multiple chunks'
+			// stored in series.headChunks.
+			// We need to keep last chunk around so it can receive appends, but every other chunk
+			// holds older samples and so is only used for reads and so can be safely mmapped.


I see comments like this spread around everywhere. I suggest removing this comment and instead adding a comment on the mmapHeadChunks function explaining what it is doing and why.

This comment could be replaced by

Suggested change

// Save memory by mmapping all but last chunk for all time series with multiple chunks'

// stored in series.headChunks.

// We need to keep last chunk around so it can receive appends, but every other chunk

// holds older samples and so is only used for reads and so can be safely mmapped.

// We attempt mmapping of head chunks regularly.

codesome · 2022-12-23T06:38:38Z

tsdb/head.go

+	if mmapped > 0 {
+		level.Info(h.logger).Log("msg", "Finished mmapping head chunks", "chunks", mmapped, "duration", time.Since(start).String())
+	}


I don't think we need this comment. I can see this becoming spammy very soon (large setup with something being m-mapped every minute).

Alternatively, we can make it a debug log

I was thinking of turning this into a metric instead

codesome · 2022-12-23T06:44:04Z

tsdb/head.go

 	// pN is the pointer to the mmappedChunk referered to by HeadChunkID=N
 	mmappedChunks []*mmappedChunk
-	headChunk     *memChunk          // Most recent chunk in memory that's still being built.
+	headChunks    []*memChunk        // In-memory chunks not yet written & mmapped, including the last chunk that's still being built.


Suggested change

headChunks []*memChunk // In-memory chunks not yet written & mmapped, including the last chunk that's still being built.

headChunks []*memChunk // In-memory chunks not yet mmapped, including the last chunk that's still being built.

codesome · 2022-12-23T07:03:44Z

tsdb/head_read.go

 	// is len(s.mmappedChunks), it represents the next chunk, which is the head chunk.
 	ix := int(id) - int(s.firstChunkID)
-	if ix < 0 || ix > len(s.mmappedChunks) {
+	if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks) {


ix is 0 based. Previously len(s.mmappedChunks) meant the last head chunk. Now len(s.mmappedChunks)+len(s.headChunks)-1 is the last chunk.

Suggested change

if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks) {

if ix < 0 || ix > len(s.mmappedChunks)+len(s.headChunks)-1 {

Because tests didn't fail, we need to add a test case such that return s.headChunks[ix], false, nil would have panicked.

codesome · 2022-12-23T07:12:25Z

tsdb/head_read.go

+			totalSamples += d.chunk.NumSamples()
+			if j < ix {
+				previousSamples += d.chunk.NumSamples()
+			}


We can call d.chunk.NumSamples() only once by assigning it to a variable. It actually reads the chunk bytes and decodes the number.

codesome · 2022-12-23T07:36:06Z

tsdb/head_test.go

-	}()
-	const chunkRange = 2000
+	type testCaseT struct {
+		chunkRange     int


chunkRange is the same for all cases. So we can remove this field from here.

codesome · 2022-12-23T07:40:43Z

tsdb/head_test.go

+			if tc.mmappedSamples > 0 {
+				for i := 0; i < tc.mmappedSamples; i += 5 {
+					ok, _ := s.append(int64(i), float64(i), 0, chunkDiskMapper, int64(tc.chunkRange))
+					require.True(t, ok, "sample append failed")
+				}
+				s.mmapHeadChunks(chunkDiskMapper)
+				require.Greater(t, len(s.mmappedChunks), 0)
+			}

-	s.truncateChunksBefore(2000, 0)
+			// append samples that will stay in the head
+			if tc.headSamples > 0 {
+				for i := tc.mmappedSamples; i < tc.mmappedSamples+tc.headSamples; i += 5 {
+					ok, _ := s.append(int64(i), float64(i), 0, chunkDiskMapper, int64(tc.chunkRange))
+					require.True(t, ok, "sample append failed")
+				}
+				require.Greater(t, len(s.headChunks), 0)
+			}


You can simplify this by having a loop from 0 to tc.headSamples with a condition inside like

if i == tc.mmappedSamples { // mmap // check for >0 }

(while you do that, can you change the comment to start with a capital letter where possible and end with a .?)

codesome · 2022-12-23T07:56:19Z

tsdb/head_test.go

+			require.NoError(t, err)
+			require.Equal(t, lastChunk, chk)
+
+			// verify that we can read


Suggested change

// verify that we can read

// Verify that we can read.

codesome · 2022-12-23T07:57:52Z

/prombench main

One last benchmark after some recent changes.

prombot · 2022-12-23T07:57:53Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

codesome · 2022-12-23T08:28:21Z

/prombench restart main

prombot · 2022-12-23T08:28:23Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

codesome · 2022-12-23T08:43:31Z

/prombench cancel

prombot · 2022-12-23T08:43:33Z

Benchmark cancel is in progress.

codesome · 2022-12-23T09:01:17Z

/prombench main

Some CI issue earlier and I messed up the CI by restarting it. Trying again.

prombot · 2022-12-23T09:01:19Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10709 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

codesome · 2022-12-23T09:22:35Z

Prombench start CI failed again with the same error:

09:21:36 gke.go:594: error while applying a resource err:error applying 'manifests/prombench/benchmark/6_loadgen.yaml' err:Request for 'applying deployment:prometheus-loadgen-querier' hasn't completed after retrying 50 times

But we still have a load of around 1m series. So will let it run for a while.

codesome · 2022-12-23T10:00:45Z

There is a sustained 5% increase in the memory RSS, I believe it is the switch to a slice of headChunks and other factors. CPU usage is also slightly up. (link to prombench dashboard here)

I would like to borrow @bboreham's expertise here. @bboreham: This basically helps with latency spikes as described in #10377. Apparently, the async mmap queue was not helpful. This PR triggers the mmap of in-memory chunks in the background by holding a slice of in-memory chunks per series. Does the memory/CPU increase look concerning or is there a way to optimize it?

codesome · 2022-12-23T12:09:33Z

/prombench cancel

I think we have enough data for now.

prombot · 2022-12-23T12:09:35Z

Benchmark cancel is in progress.

prymitive · 2022-12-23T14:45:06Z

There is a sustained 5% increase in the memory RSS, I believe it is the switch to a slice of headChunks and other factors.

Slice needs extra 24 bytes if I recall. That's 24MB per 1 million time series.
Last time I checked (#11003 (comment)) memSeries was using 224 bytes, so extra 24 bytes is 10% more just for baseline memory usage.
That being said most memory in memSeries will be eaten by samples and labels. In my setup I usually see around 3.2KB per memSeries and 24 bytes is <1% memory usage bump when you compare it with that number. Real world overhead depends on how big your memSeries are really.
But the bottom line is obvious here - having a slice here uses more memory.

But at the same time there was another slice added for out of order chunks, and we also have another slice for mmapped chunks. An obvious saving would be if we managed to keep all different chunks on a single slice, regardless if they are head, mmapped or OoO chunks. That would require a lot more changes but there are bits of code that I found difficult to get my head around quickly as they need to get the right chunk from passed offset (see tsdb/head_read.go changes) and having a single slice could potentially simplify some of that, which would be great.

Chunk write queue and other alternative solutions are likely to eat up some extra memory too, but those might be a bit harder to calculate exact memory usage growth due to more dynamic nature of it. Plus with queues you don't pay that extra memory usage all the time (except for the base structure to have the queue running), so from purely "bytes payed" PoV slice is more of a small but permanent cost that scales with the number of time series.
The upside is that it removes blocking behaviour from the hot path. Another solution could be to keep two pointers on memSeries: one for current HEAD chunk and one for previous one that should be mmapped. That only adds 8 bytes. The reason I went for the slice is that it felt safer - with two pointers I need to ensure "previous" chunk is mmapped before current HEAD one gets full, otherwise we need to block again - there's no such worry about slices.

bboreham · 2022-12-23T15:00:52Z

Since you changed a single pointer to a slice, it should be +16 bytes per series. (When looking at rss this number is doubled by Go's garbage-collector).

Would it work as a linked list? I have pondered making that change for mmappedChunks and oooMmappedChunks.

prymitive · 2022-12-23T15:23:58Z

Would it work as a linked list?

I cannot think of any reason why it wouldn't. Reading from memSeries would become a bit more complicated, since referencing chunks via series.headChunks[3] wouldn't be possible and we would have to iterate over the list until we get to the 4th chunk. But that's the main complication I can think of. No idea how that would impact read performance.

bboreham · 2022-12-23T16:28:08Z

I think in practice those indices are only ever used from an iterator stepping through the set of chunks, but the code assumes this can be effectively expressed by an integer.

prymitive · 2023-01-06T13:22:00Z

Here's a linked list version of this PR - #11818

prymitive · 2023-03-31T12:19:06Z

Closing this one, there's #11818 instead

prymitive requested a review from codesome as a code owner May 17, 2022 16:18

prymitive mentioned this pull request May 17, 2022

Rule evaluation & scrapes are frequently blocked or delayed with Prometheus >= v2.33.0 #10377

Closed

prombot added the prombench label May 17, 2022

prymitive force-pushed the head_chunks branch from fa38033 to 31b28fc Compare May 18, 2022 09:16

prymitive requested review from JessicaGreben and dgl as code owners May 18, 2022 09:16

prymitive force-pushed the head_chunks branch 4 times, most recently from 82ffc60 to 471eab5 Compare May 18, 2022 11:18

codesome reviewed Jun 4, 2022

View reviewed changes

prymitive mentioned this pull request Jun 8, 2022

Use github.com/klauspost/compress for gzip and zlib #10782

Merged

prymitive force-pushed the head_chunks branch from 8d348bc to ca6ca84 Compare December 19, 2022 12:38

prymitive added 5 commits December 19, 2022 15:07

Add more test coverage to TestMemSeries_truncateChunks

a525be4

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

Create a new headChunks slice on truncate

51f6f65

Addressing review comment Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

Move db.mmapHead -> head.mmapHeadChunks()

62ae579

Addressing PR review comments Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

Move mmap loop onto Head.mmapHeadChunks()

f3f7d78

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

Move mmapHeadChunks() calls to DB.run()

d81c0b0

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>

prymitive force-pushed the head_chunks branch from a1774ff to d81c0b0 Compare December 19, 2022 15:07

codesome reviewed Dec 23, 2022

View reviewed changes

prymitive mentioned this pull request Jan 6, 2023

Use a linked list for memSeries.headChunk #11818

Merged

prymitive closed this Mar 31, 2023

prymitive mentioned this pull request Aug 7, 2024

Allow mmap chunks during head compaction #14610

Open

	headChunks []*memChunk // In-memory chunks not yet written & mmapped, including the last chunk that's still being built.
	headChunks []*memChunk // In-memory chunks not yet mmapped, including the last chunk that's still being built.

	if ix < 0 \|\| ix > len(s.mmappedChunks)+len(s.headChunks) {
	if ix < 0 \|\| ix > len(s.mmappedChunks)+len(s.headChunks)-1 {

Conversation

prymitive commented May 17, 2022

Uh oh!

prymitive commented May 17, 2022

Uh oh!

roidelapluie commented May 17, 2022

Uh oh!

prombot commented May 17, 2022

Uh oh!

prymitive commented May 17, 2022

Uh oh!

prymitive commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roidelapluie commented May 17, 2022

Uh oh!

prymitive commented May 17, 2022

Uh oh!

brian-brazil commented May 18, 2022

Uh oh!

prymitive commented May 18, 2022

Uh oh!

brian-brazil commented May 18, 2022

Uh oh!

prymitive commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prymitive commented May 18, 2022

Uh oh!

prymitive commented May 18, 2022

Uh oh!

roidelapluie commented May 18, 2022

Uh oh!

prombot commented May 18, 2022

Uh oh!

codesome left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codesome commented Jun 4, 2022

Uh oh!

prymitive commented Jun 4, 2022

Uh oh!

prymitive commented Jun 15, 2022

Uh oh!

codesome commented Dec 19, 2022

Uh oh!

codesome left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codesome commented Dec 23, 2022

Uh oh!

prombot commented Dec 23, 2022

Uh oh!

codesome commented Dec 23, 2022

Uh oh!

prymitive commented May 17, 2022 •

edited

Loading

prymitive commented May 18, 2022 •

edited

Loading

codesome left a comment •

edited

Loading

codesome commented Dec 23, 2022 •

edited

Loading

prymitive commented Dec 23, 2022 •

edited

Loading