Snapshot in-memory chunks on shutdown for faster restarts by codesome · Pull Request #7229 · prometheus/prometheus

codesome · 2020-05-08T13:41:36Z

This PR is based on this design doc: https://docs.google.com/document/d/1c90BmhErybHH94hoGNsgJC9pyaUTU3dgHoyD3dm6Kw4/edit?usp=sharing (the doc is a little outdated after some discussions here)

Snapshot happens during shutdown and not at regular intervals to avoid write amplification.

The snapshot name is of the format chunk_snapshot.<last WAL segment>.<offset in the segment>. The offset in the WAL segment gives us more precision on where to continue the snapshot and avoids unnecessary replay of entire WAL segment.

Format of the snapshot
Starts with series records. Each series gets 1 record.

┌────────────────────────────┬────────────────────────────┐
│      Record Type <byte>    │    Series Ref <uint64>     │
├────────────────────────────┴────────────────────────────┤
│                Number of  Labels <uvarint>              │
├────────────────────────────┬────────────────────────────┤
│    len(name_1) <uvarint>   │      name_1 <bytes>        │
├────────────────────────────┼────────────────────────────┤
│    len(val_1) <uvarint>    │      val_1 <bytes>         │
├────────────────────────────┴────────────────────────────┤
│                          . . .                          │
├────────────────────────────┬────────────────────────────┤
│    len(name_N) <uvarint>   │      name_N <bytes>        │
├────────────────────────────┼────────────────────────────┤
│    len(val_N) <uvarint>    │      val_N <bytes>         │
├────────────────────────────┴────────────────────────────┤
│                   Chunk Range <int64>                   │
├─────────────────────────────────────────────────────────┤
│                  Chunk Exists <uvarint>                 │
│  # 1 if head chunk exists, 0 otherwise to detect a nil  |
|  # chunk. Below fields exists only when it's 1 here.    |
├────────────────────────────┬────────────────────────────┤
│      Chunk Mint <int64>    │     Chunk Maxt <int64>     │
├────────────────────────────┴────────────────────────────┤
│                  Chunk Encoding <byte>                  │
├────────────────────────────┬────────────────────────────┤
│     len(Chunk) <uvarint>   │     Chunk <bytes>          │
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[0].t <int64>   |  sampleBuf[0].v <float64>  | 
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[1].t <int64>   |  sampleBuf[1].v <float64>  | 
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[2].t <int64>   |  sampleBuf[2].v <float64>  | 
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[3].t <int64>   |  sampleBuf[3].v <float64>  | 
└────────────────────────────┴────────────────────────────┘

Ends with a single record for tombstones. The encoded tombstones uses the same encoding as tombstone file in blocks.

┌─────────────────────────────────────────────────────────────────┐
│                        Record Type <byte>                       │
├───────────────────────────────────┬─────────────────────────────┤
│ len(Encoded Tombstones) <uvarint> │ Encoded Tombstones <bytes>  │
└───────────────────────────────────┴─────────────────────────────┘

codesome · 2020-05-20T10:25:08Z

/prombench master

prombot · 2020-05-20T10:25:09Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-7229 and master

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart v2.12.0

codesome · 2020-05-20T13:02:50Z

/prombench cancel

prombot · 2020-05-20T13:02:51Z

Benchmark cancel is in progress.

codesome · 2020-05-20T13:12:58Z

There were some bugs in the replay part in terms of handing the file, will benchmark again after fixing them. But from what I could gather with this:

Snapshotting chunks ~1.5M series takes ~3.6s, lets take 4s for any anomalies. So that is like ~30s for 10M series.

After running for 2hrs, replay of master took 2m34s. Excluding the mmap replay of 14s, that naively translates to ~11.6s for 10m worth of WAL data.

Replaying the chunk snapshot took around 11s with snapshot interval being 10m. As there was some bug in code, taking 10m of WAL replay from above, replaying entire data with snapshots would be 11+14+11.6=36.6s roughly in the worst case. That is 4-5x faster replay, and expect it to be >5x in general case.

I forgot to check the size of snapshots.

Before I dive into fixing the issues, how do these numbers look for a trade-off? @brian-brazil

codesome · 2020-05-20T13:31:32Z

Additionally, replay time won't increase linearly (or linear with very small factor) with partial chunks like WAL replay would. This was tested only for 2h of WAL data, but in extended use, I guess it goes upto 3h or more. So in that case, the difference would be >5x.

codesome · 2020-05-20T13:40:18Z

I compared only the replay times here whereas the turnaround time is the one that matters. With <4s for taking a snapshot (with most hovering around 3s), I see not much change with the speedup even for turnaround time :)

brian-brazil · 2020-05-20T16:58:49Z

To summaries:
90s for 30M series, or an additional ~15% duty cycle on I/O for 10m snapshots. That's a fair bit.

The benchmark was for ~250k/s. For 1M/s, we'd be looking at 2h of WAL replay taking ~616s with the current mmap code.

Replaying just the chunk snapshot for 30M series would be ~220s plus ~46s for the WAL replay at 1M/s so ~266s in all. So the saving here is only about 2x for very large Prometheus servers, and we're still over a minute.

I'm not sure this complexity is worth it given all the additional disk writes it'd cause. I'd prefer to keep things simple to allow for some future solution that'd get us under a minute for large servers.

codesome · 2020-05-21T07:27:47Z

get us under a minute for large servers.

Might be possible for only replays, but for turn around time that sounds very ambitious unless we can write out the data to disk faster or m-map partial chunks (unlikely) :) I will think of more ways

codesome · 2020-05-26T11:51:28Z

From the way I see it, for the sake of some arbitrary target we should not block some optimization which is possible right now. The tradeoff seems to be good.

Additionally, the code and the logic are well separated from the rest of the Head block logic and I don't see this being a complex addition. It is also safe to downgrade without any issues as it will fallback to the old replay which makes it easy to upgrade to any new techniques which hits the target in the future. This could also live behind a flag for now with disabled by default.

brancz · 2020-05-26T12:10:56Z

I tend to agree with @codesome

brian-brazil · 2020-05-26T12:29:41Z

I can see that argument, but the duty cycle and additional disk writes still concern me. Do you have a rough idea of what the overall impact of the writes is on top of the WAL/mmap/compactions we already do? Write amplification is something we should keep an eye on, and I think this could double it.

brancz · 2020-05-26T12:58:25Z

I agree we need to measure and decide based on what's measured, but I don't think anyone suggested anything contrary to that.

codesome · 2020-05-27T09:46:35Z

Here are some calculations (for a single series)

Considering a time window of 2h, and 15s scrape interval, the calculations are based on how many chunks will be written to disk for the chunks created in these 2h in it's entire lifecycle.

Current writes

Chunks written in 2h to disk = 4 (from mmapping) + 4 (to write to blocks) = 8

I am not totally sure about the compression ratio between WAL and XORChunk, but considering that a chunk in WAL would be twice the size than XORChunk. So here it is 8 + 8 = 16. I will add +2 for the checkpoints (or should I not?). So now it's 12+2 = 14.

If we consider that a block goes through at least 2 compactions in it's lifecycle, then writing those 4 chunks twice = 16 + (4+4) = 24

Writes from partial chunks

Considering 15m snapshot interval, we can say it writes 0.5 chunks at 15m then 1 chunks at 30m. As the cycle continues, it would be 4*(0.5+1) = 6 chunks for 2 hrs.

Additional writes

6 additional chunks compared to 24 chunks currently, that is 25% increase in writes.

I have not considered block index writes and >2 compactions which will happen, so increase in writes would be less than <20% I believe.

Did I miss something?

brian-brazil · 2020-05-27T10:30:06Z

considering that a chunk in WAL would be twice the size than XORChunk

It's going to be about 3x with compression, presuming a conservative 2 bytes/samples in a chunk.

Considering 15m snapshot interval

Why the switch from the 10m used previously?

My math is in terms of a single sample which is conservatively ~2 bytes. In terms of full blocks we'll currently write it out 9 times through compaction, plus another 6 bytes for a compressed WAL which is the same as 3 writes. So our current write amplification is around 12.

Mmaping I believe works out as no more than one additional write so that'd bring it to 13.

Snapshots of incomplete chunks every 10m could write a sample out 12 times if the chunk is never completed, which is plausible with a 60s interval. Halve that for the average case as later samples in the 2h get written fewer times, so 6. Overall impact is 13->19 so roughly adding ~46% more write amplification.

Is shaving off half of restart time worth reducing SSD lifetime by a third?

Doing the math for a Samsung 860 EVO 500GB with 200k/s (i.e. 15d retention just fits) which quotes 300TB of writes that takes the official supported lifetime from ~22 months to ~15 months. Now I'd presume it'll last a bit longer than that, but it's not as if it's 22 years versus 15 years where it'd be clear that it wouldn't be a problem.

codesome · 2020-06-06T10:17:13Z

Just a thought: given the write amplification with regular snapshotting, one of the ways to use this snapshotting can be to snapshot only when Prometheus is exiting. This way, a graceful restart will be fast, but not crashes.

With graceful restarts being very few in number, I expect not much impact on write amplification.

brian-brazil · 2020-06-08T07:50:44Z

That would reduce the disk writes, but then we'd be slowing down shutdown and the code would be exercised less so more likely a bug will slip in. I'm a bit wary of that, and keep in mind that the shutdown time would now needed to be added in to the restart time.

codesome · 2020-06-09T05:55:28Z

but then we'd be slowing down shutdown

and keep in mind that the shutdown time would now needed to be added in to the restart time.

But the overall turnaround time is still faster. Maybe even faster than previous calculation because with snapshotting at the end, the WAL to be replayed is 0, and snapshotting takes less time than replaying that extra WAL in the old calculation.

Some recalculation with snapshot at the end with the data gathered in this comment #7229 (comment)

For the turn around time we only have writing the snapshot + the replay of mmap and snapshot (no WAL)

For 1.5M series for 2h of data

Master = 2m34s for replay (hence turnaround time)
With Snapshotting = 4s (writing snapshot) + 14s (mmap replay) + 11s (snapshot replay) + 0s (WAL replay) = 29s

So that is ~80% reduction; better than before because the earlier calculation was based on the worst case of having to replay some WAL.

As we will have more than 2h of WAL, the reduction is going to be more.

the code would be exercised less so more likely a bug will slip in

The bug can be fixed :) It is similar to WAL replay - it is also exercised only when Prometheus starts so more likely a bug will slip, but we still have it.

roidelapluie · 2020-06-09T07:20:00Z

the code would be exercised less so more likely a bug will slip in

The bug can be fixed :) It is similar to WAL replay - it is also exercised only when Prometheus starts so more likely a bug will slip, but we still have it.

Prometheus does not always shut down nicely and does not always shutdown attended. So I am on brian's side that there is an important risk here.

codesome · 2020-06-09T07:23:00Z

If the snapshot is corrupt, it can be skipped and fallen back to using the WAL.

brian-brazil · 2020-06-09T08:28:02Z

I didn't say corruption, I said bug. A bug could happen at any stage of the reading or writing of the snapshot, and code that is run less often is code more likely to have undiscovered bugs.

codesome · 2020-06-09T09:04:44Z

Right. But we could also have extensive tests around it. I am on the side of having this feature in, maybe behind a flag, and disabled by default till we are confident about it instead of blocking it altogether.

brian-brazil · 2020-06-09T09:15:10Z

Hiding it behind a flag is worse in my mind. We get all the complexity costs, and the code is exercised even less often.

codesome · 2020-06-09T10:06:37Z

I would not worry about complexity as the code for this is well isolated and it is easy to turn it on or off. But about the code being used, right, it would be used less, but I expect Cortex/Thanos which import TSDB to enable it and it getting tested extensively.

Do you have any other suggestions if we were to not having it behind the flag? Is the code being executed less the only concern (which would make subtle bug invisible)?

brian-brazil · 2020-06-09T10:27:16Z

I would not worry about complexity as the code for this is well isolated

From a look at the PR this isn't the case, it's intertwined with the WAL reading code.

Do you have any other suggestions if we were to not having it behind the flag? Is the code being executed less the only concern (which would make subtle bug invisible)?

My point is more that putting it behind a flag makes it worse from my standpoint. If the main answer to the disadvantages of feature is that it's okay because it's behind a rarely used flag, it'd be better to not have the feature at all.

A feature like this should be always enabled, and justified on that basis.

codesome · 2020-06-10T11:40:27Z

After some offline discussions, here are some points on some aspects

Benefits

I ran a quick test of this PR as a part of Cortex to again verify the numbers. I was able to conclude a 50-80+% reduction in turnover time. Ideally, there would be no WAL to replay as snapshot happens at the end, but due to Cortex usage of TSDB or bug in this PoC, The 50% number comes up with some WAL to replay. Else it would be 75%+ reduction in most cases as calculated before.

Complexity

The majority of the code is well isolated. The writing of snapshot and the loading of snapshot is an isolated code in its own method. And writing/loading snapshot is as simple as calling those methods are not.

The main complexity comes from the special handling of m-mapped chunks during the creation of a series from the WAL. When you create a series from WAL (after having replayed the snapshot and m-mapped chunks already), you have to check if the m-mapped chunks for the new series overlaps with the chunk from snapshot and hence discard (in case its a duplicate series record). So this is the only additional complexity.

Risks

We track the exact byte in the WAL file from where to start replaying. So, any races that might exist between writing a record and tracking that byte should be eliminated.
With the view of eliminating races, it is best if it can be assured that no samples are appended once we call the Close() method on Head.

roidelapluie · 2021-02-24T21:00:04Z

Should we resume work here? Additionally, since the extra disk space needed might surprise users, it might require a feature flag.

codesome · 2021-03-04T11:49:42Z

We should resume, but I don't think I can get to it for many weeks to come. And yes, I was thinking of feature flag for this too :)

bhiravabhatla · 2021-04-15T16:24:25Z

Waiting for this PR to be merged. This is a very interesting/important feature we would like to have. Any planned ETA for this?

codesome · 2021-08-04T10:22:04Z

I have rebased the PR and added a feature flag for this now. It is ready for review :)

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

bwplotka

Looks solid. It's a lot of complexity, but given feature flags, we can experiment with this! LGTM, thanks!

codesome · 2021-08-07T15:36:29Z

WOW, I did not expect a direct merge XD Thanks :)
Do you have any possible suggestions for the record formats @bwplotka?

* Fix `kuma_sd` targetgroup reporting (#9157) * Bundle all xDS targets into a single group Signed-off-by: austin ce <austin.cawley@gmail.com> * Snapshot in-memory chunks on shutdown for faster restarts (#7229) Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Rename links Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921) * Moved everything to nPending buffer Signed-off-by: Levi Harrison <git@leviharrison.dev> * Simplify exemplar capacity addition Signed-off-by: Levi Harrison <git@leviharrison.dev> * Added pre-allocation Signed-off-by: Levi Harrison <git@leviharrison.dev> * Don't allocate if not sending exemplars Signed-off-by: Levi Harrison <git@leviharrison.dev> * Avoid deadlock when processing duplicate series record (#9170) * Avoid deadlock when processing duplicate series record `processWALSamples()` needs to be able to send on its output channel before it can read the input channel, so reads to allow this in case the output channel is full. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * processWALSamples: update comment Previous text seems to relate to an earlier implementation. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Optimise WAL loading by removing extra map and caching min-time (#9160) * BenchmarkLoadWAL: close WAL after use So that goroutines are stopped and resources released Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: make series IDs co-prime with #workers Series are distributed across workers by taking the modulus of the ID with the number of workers, so multiples of 100 are a poor choice. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * BenchmarkLoadWAL: simulate mmapped chunks Real Prometheus cuts chunks every 120 samples, then skips those samples when re-reading the WAL. Simulate this by creating a single mapped chunk for each series, since the max time is all the reader looks at. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Fix comment Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Remove series map from processWALSamples() The locks that is commented to reduce contention in are now sharded 32,000 ways, so won't be contended. Removing the map saves memory and goes just as fast. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * loadWAL: Cache the last mmapped chunk time So we can skip calling append() for samples it will reject. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Improvements from code review Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Full stops and capitals on comments Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Cache max time in both places mmappedChunks is updated Including refactor to extract function `setMMappedChunks`, to reduce code duplication. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Update head min/max time when mmapped chunks added This ensures we have the correct values if no WAL samples are added for that series. Note that `mSeries.maxTime()` was always `math.MinInt64` before, since that function doesn't consider mmapped chunks. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Split Go and React Tests (#8897) * Added go-ci and react-ci Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Signed-off-by: Levi Harrison <git@leviharrison.dev> * Remove search keymap from new expression editor (#9184) Signed-off-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev> Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Julius Volz <julius.volz@gmail.com>

codesome force-pushed the snapshot-chunks branch from bfa51fa to 08d1753 Compare May 8, 2020 14:08

codesome force-pushed the snapshot-chunks branch from 08d1753 to 6b12431 Compare May 20, 2020 10:12

prombot added the prombench label May 20, 2020

codesome force-pushed the snapshot-chunks branch from 4affd01 to 6d8de5a Compare June 10, 2020 09:22

This was referenced Jan 15, 2021

tsdb/wal: recover in case of corrupted checkpoint #8205

Closed

Memory usage spikes during WAL replay to more than normal usage #6934

Closed

Base automatically changed from master to main February 23, 2021 19:36

stale bot removed the stale label Feb 23, 2021

codesome mentioned this pull request Jun 1, 2021

Add a design docs page prometheus/docs#1961

Merged

stale bot added the stale label Jun 14, 2021

bboreham mentioned this pull request Jun 21, 2021

Compacting TSDB head on every ingester shutdown to speed up startup cortexproject/cortex#3723

Closed

stale bot removed the stale label Aug 3, 2021

codesome force-pushed the snapshot-chunks branch 2 times, most recently from a82766e to 5573ae9 Compare August 4, 2021 09:33

codesome force-pushed the snapshot-chunks branch from 5573ae9 to 1863d57 Compare August 4, 2021 11:15

codesome removed the prombench label Aug 4, 2021

codesome force-pushed the snapshot-chunks branch from 1863d57 to a4efc46 Compare August 4, 2021 12:07

codesome unassigned krasi-georgiev Aug 4, 2021

Snapshot in-memory chunks on shutdown for faster restarts

da68347

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

codesome force-pushed the snapshot-chunks branch from a4efc46 to da68347 Compare August 4, 2021 13:04

codesome requested a review from bwplotka August 5, 2021 09:14

bwplotka approved these changes Aug 6, 2021

View reviewed changes

bwplotka merged commit ee7e007 into prometheus:main Aug 6, 2021

grafanabot mentioned this pull request Aug 10, 2021

Compacting TSDB head on every ingester shutdown to speed up startup grafana/mimir#98

Closed

navneet1075 mentioned this pull request Oct 12, 2021

prometheus readiness probe keeps on failing #9179

Closed

steve-mt mentioned this pull request Dec 17, 2021

Out of sequence chunks no longer considered a CorruptionErr and never deleted from disk #10043

Closed

beorn7 mentioned this pull request Aug 15, 2023

Add a chunk size limit in bytes #12054

Merged

rnaveiras mentioned this pull request Nov 22, 2023

Make termination grace seconds configurable prometheus-operator/prometheus-operator#4681

Closed

5 tasks

Conversation

codesome commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codesome commented May 20, 2020

Uh oh!

prombot commented May 20, 2020

Uh oh!

codesome commented May 20, 2020

Uh oh!

prombot commented May 20, 2020

Uh oh!

codesome commented May 20, 2020

Uh oh!

codesome commented May 20, 2020

Uh oh!

codesome commented May 20, 2020

Uh oh!

brian-brazil commented May 20, 2020

Uh oh!

codesome commented May 21, 2020

Uh oh!

codesome commented May 26, 2020

Uh oh!

brancz commented May 26, 2020

Uh oh!

brian-brazil commented May 26, 2020

Uh oh!

brancz commented May 26, 2020

Uh oh!

codesome commented May 27, 2020

Uh oh!

brian-brazil commented May 27, 2020

Uh oh!

codesome commented Jun 6, 2020

Uh oh!

brian-brazil commented Jun 8, 2020

Uh oh!

codesome commented Jun 9, 2020

Uh oh!

roidelapluie commented Jun 9, 2020

Uh oh!

codesome commented Jun 9, 2020

Uh oh!

brian-brazil commented Jun 9, 2020

Uh oh!

codesome commented Jun 9, 2020

Uh oh!

brian-brazil commented Jun 9, 2020

Uh oh!

codesome commented Jun 9, 2020

Uh oh!

brian-brazil commented Jun 9, 2020

Uh oh!

codesome commented Jun 10, 2020

Uh oh!

roidelapluie commented Feb 24, 2021

Uh oh!

codesome commented Mar 4, 2021

Uh oh!

bhiravabhatla commented Apr 15, 2021

Uh oh!

codesome commented Aug 4, 2021

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

codesome commented Aug 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

codesome commented May 8, 2020 •

edited

Loading