Skip to content

Snapshot in-memory chunks on shutdown for faster restarts#7229

Merged
bwplotka merged 1 commit intoprometheus:mainfrom
codesome:snapshot-chunks
Aug 6, 2021
Merged

Snapshot in-memory chunks on shutdown for faster restarts#7229
bwplotka merged 1 commit intoprometheus:mainfrom
codesome:snapshot-chunks

Conversation

@codesome
Copy link
Member

@codesome codesome commented May 8, 2020

This PR is based on this design doc: https://docs.google.com/document/d/1c90BmhErybHH94hoGNsgJC9pyaUTU3dgHoyD3dm6Kw4/edit?usp=sharing (the doc is a little outdated after some discussions here)

Snapshot happens during shutdown and not at regular intervals to avoid write amplification.

The snapshot name is of the format chunk_snapshot.<last WAL segment>.<offset in the segment>. The offset in the WAL segment gives us more precision on where to continue the snapshot and avoids unnecessary replay of entire WAL segment.

Format of the snapshot
Starts with series records. Each series gets 1 record.

┌────────────────────────────┬────────────────────────────┐
│      Record Type <byte>    │    Series Ref <uint64>     │
├────────────────────────────┴────────────────────────────┤
│                Number of  Labels <uvarint>              │
├────────────────────────────┬────────────────────────────┤
│    len(name_1) <uvarint>   │      name_1 <bytes>        │
├────────────────────────────┼────────────────────────────┤
│    len(val_1) <uvarint>    │      val_1 <bytes>         │
├────────────────────────────┴────────────────────────────┤
│                          . . .                          │
├────────────────────────────┬────────────────────────────┤
│    len(name_N) <uvarint>   │      name_N <bytes>        │
├────────────────────────────┼────────────────────────────┤
│    len(val_N) <uvarint>    │      val_N <bytes>         │
├────────────────────────────┴────────────────────────────┤
│                   Chunk Range <int64>                   │
├─────────────────────────────────────────────────────────┤
│                  Chunk Exists <uvarint>                 │
│  # 1 if head chunk exists, 0 otherwise to detect a nil  |
|  # chunk. Below fields exists only when it's 1 here.    |
├────────────────────────────┬────────────────────────────┤
│      Chunk Mint <int64>    │     Chunk Maxt <int64>     │
├────────────────────────────┴────────────────────────────┤
│                  Chunk Encoding <byte>                  │
├────────────────────────────┬────────────────────────────┤
│     len(Chunk) <uvarint>   │     Chunk <bytes>          │
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[0].t <int64>   |  sampleBuf[0].v <float64>  | 
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[1].t <int64>   |  sampleBuf[1].v <float64>  | 
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[2].t <int64>   |  sampleBuf[2].v <float64>  | 
├────────────────────────────┼────────────────────────────┤
|   sampleBuf[3].t <int64>   |  sampleBuf[3].v <float64>  | 
└────────────────────────────┴────────────────────────────┘

Ends with a single record for tombstones. The encoded tombstones uses the same encoding as tombstone file in blocks.

┌─────────────────────────────────────────────────────────────────┐
│                        Record Type <byte>                       │
├───────────────────────────────────┬─────────────────────────────┤
│ len(Encoded Tombstones) <uvarint> │ Encoded Tombstones <bytes>  │
└───────────────────────────────────┴─────────────────────────────┘

@codesome
Copy link
Member Author

/prombench master

@prombot
Copy link
Contributor

prombot commented May 20, 2020

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-7229 and master

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart v2.12.0

@codesome
Copy link
Member Author

/prombench cancel

@prombot
Copy link
Contributor

prombot commented May 20, 2020

Benchmark cancel is in progress.

@codesome
Copy link
Member Author

There were some bugs in the replay part in terms of handing the file, will benchmark again after fixing them. But from what I could gather with this:

Snapshotting chunks ~1.5M series takes ~3.6s, lets take 4s for any anomalies. So that is like ~30s for 10M series.

After running for 2hrs, replay of master took 2m34s. Excluding the mmap replay of 14s, that naively translates to ~11.6s for 10m worth of WAL data.

Replaying the chunk snapshot took around 11s with snapshot interval being 10m. As there was some bug in code, taking 10m of WAL replay from above, replaying entire data with snapshots would be 11+14+11.6=36.6s roughly in the worst case. That is 4-5x faster replay, and expect it to be >5x in general case.

I forgot to check the size of snapshots.

Before I dive into fixing the issues, how do these numbers look for a trade-off? @brian-brazil

@codesome
Copy link
Member Author

Additionally, replay time won't increase linearly (or linear with very small factor) with partial chunks like WAL replay would. This was tested only for 2h of WAL data, but in extended use, I guess it goes upto 3h or more. So in that case, the difference would be >5x.

@codesome
Copy link
Member Author

I compared only the replay times here whereas the turnaround time is the one that matters. With <4s for taking a snapshot (with most hovering around 3s), I see not much change with the speedup even for turnaround time :)

@brian-brazil
Copy link
Contributor

To summaries:
90s for 30M series, or an additional ~15% duty cycle on I/O for 10m snapshots. That's a fair bit.

The benchmark was for ~250k/s. For 1M/s, we'd be looking at 2h of WAL replay taking ~616s with the current mmap code.

Replaying just the chunk snapshot for 30M series would be ~220s plus ~46s for the WAL replay at 1M/s so ~266s in all. So the saving here is only about 2x for very large Prometheus servers, and we're still over a minute.

I'm not sure this complexity is worth it given all the additional disk writes it'd cause. I'd prefer to keep things simple to allow for some future solution that'd get us under a minute for large servers.

@codesome
Copy link
Member Author

get us under a minute for large servers.

Might be possible for only replays, but for turn around time that sounds very ambitious unless we can write out the data to disk faster or m-map partial chunks (unlikely) :) I will think of more ways

@codesome
Copy link
Member Author

From the way I see it, for the sake of some arbitrary target we should not block some optimization which is possible right now. The tradeoff seems to be good.

Additionally, the code and the logic are well separated from the rest of the Head block logic and I don't see this being a complex addition. It is also safe to downgrade without any issues as it will fallback to the old replay which makes it easy to upgrade to any new techniques which hits the target in the future. This could also live behind a flag for now with disabled by default.

@brancz
Copy link
Member

brancz commented May 26, 2020

I tend to agree with @codesome

@brian-brazil
Copy link
Contributor

I can see that argument, but the duty cycle and additional disk writes still concern me. Do you have a rough idea of what the overall impact of the writes is on top of the WAL/mmap/compactions we already do? Write amplification is something we should keep an eye on, and I think this could double it.

@brancz
Copy link
Member

brancz commented May 26, 2020

I agree we need to measure and decide based on what's measured, but I don't think anyone suggested anything contrary to that.

@codesome
Copy link
Member Author

Here are some calculations (for a single series)

Considering a time window of 2h, and 15s scrape interval, the calculations are based on how many chunks will be written to disk for the chunks created in these 2h in it's entire lifecycle.

Current writes

Chunks written in 2h to disk = 4 (from mmapping) + 4 (to write to blocks) = 8

I am not totally sure about the compression ratio between WAL and XORChunk, but considering that a chunk in WAL would be twice the size than XORChunk. So here it is 8 + 8 = 16. I will add +2 for the checkpoints (or should I not?). So now it's 12+2 = 14.

If we consider that a block goes through at least 2 compactions in it's lifecycle, then writing those 4 chunks twice = 16 + (4+4) = 24

Writes from partial chunks

Considering 15m snapshot interval, we can say it writes 0.5 chunks at 15m then 1 chunks at 30m. As the cycle continues, it would be 4*(0.5+1) = 6 chunks for 2 hrs.

Additional writes

6 additional chunks compared to 24 chunks currently, that is 25% increase in writes.

I have not considered block index writes and >2 compactions which will happen, so increase in writes would be less than <20% I believe.

Did I miss something?

@brian-brazil
Copy link
Contributor

considering that a chunk in WAL would be twice the size than XORChunk

It's going to be about 3x with compression, presuming a conservative 2 bytes/samples in a chunk.

Considering 15m snapshot interval

Why the switch from the 10m used previously?

My math is in terms of a single sample which is conservatively ~2 bytes. In terms of full blocks we'll currently write it out 9 times through compaction, plus another 6 bytes for a compressed WAL which is the same as 3 writes. So our current write amplification is around 12.

Mmaping I believe works out as no more than one additional write so that'd bring it to 13.

Snapshots of incomplete chunks every 10m could write a sample out 12 times if the chunk is never completed, which is plausible with a 60s interval. Halve that for the average case as later samples in the 2h get written fewer times, so 6. Overall impact is 13->19 so roughly adding ~46% more write amplification.

Is shaving off half of restart time worth reducing SSD lifetime by a third?

Doing the math for a Samsung 860 EVO 500GB with 200k/s (i.e. 15d retention just fits) which quotes 300TB of writes that takes the official supported lifetime from ~22 months to ~15 months. Now I'd presume it'll last a bit longer than that, but it's not as if it's 22 years versus 15 years where it'd be clear that it wouldn't be a problem.

@codesome
Copy link
Member Author

codesome commented Jun 6, 2020

Just a thought: given the write amplification with regular snapshotting, one of the ways to use this snapshotting can be to snapshot only when Prometheus is exiting. This way, a graceful restart will be fast, but not crashes.

With graceful restarts being very few in number, I expect not much impact on write amplification.

@brian-brazil
Copy link
Contributor

That would reduce the disk writes, but then we'd be slowing down shutdown and the code would be exercised less so more likely a bug will slip in. I'm a bit wary of that, and keep in mind that the shutdown time would now needed to be added in to the restart time.

@codesome
Copy link
Member Author

codesome commented Jun 9, 2020

but then we'd be slowing down shutdown

and keep in mind that the shutdown time would now needed to be added in to the restart time.

But the overall turnaround time is still faster. Maybe even faster than previous calculation because with snapshotting at the end, the WAL to be replayed is 0, and snapshotting takes less time than replaying that extra WAL in the old calculation.

Some recalculation with snapshot at the end with the data gathered in this comment #7229 (comment)

For the turn around time we only have writing the snapshot + the replay of mmap and snapshot (no WAL)

For 1.5M series for 2h of data

  • Master = 2m34s for replay (hence turnaround time)
  • With Snapshotting = 4s (writing snapshot) + 14s (mmap replay) + 11s (snapshot replay) + 0s (WAL replay) = 29s

So that is ~80% reduction; better than before because the earlier calculation was based on the worst case of having to replay some WAL.

As we will have more than 2h of WAL, the reduction is going to be more.

the code would be exercised less so more likely a bug will slip in

The bug can be fixed :) It is similar to WAL replay - it is also exercised only when Prometheus starts so more likely a bug will slip, but we still have it.

@roidelapluie
Copy link
Member

the code would be exercised less so more likely a bug will slip in

The bug can be fixed :) It is similar to WAL replay - it is also exercised only when Prometheus starts so more likely a bug will slip, but we still have it.

Prometheus does not always shut down nicely and does not always shutdown attended. So I am on brian's side that there is an important risk here.

@codesome
Copy link
Member Author

codesome commented Jun 9, 2020

If the snapshot is corrupt, it can be skipped and fallen back to using the WAL.

@brian-brazil
Copy link
Contributor

I didn't say corruption, I said bug. A bug could happen at any stage of the reading or writing of the snapshot, and code that is run less often is code more likely to have undiscovered bugs.

@codesome
Copy link
Member Author

codesome commented Jun 9, 2020

Right. But we could also have extensive tests around it. I am on the side of having this feature in, maybe behind a flag, and disabled by default till we are confident about it instead of blocking it altogether.

@brian-brazil
Copy link
Contributor

Hiding it behind a flag is worse in my mind. We get all the complexity costs, and the code is exercised even less often.

@codesome
Copy link
Member Author

codesome commented Jun 9, 2020

I would not worry about complexity as the code for this is well isolated and it is easy to turn it on or off. But about the code being used, right, it would be used less, but I expect Cortex/Thanos which import TSDB to enable it and it getting tested extensively.

Do you have any other suggestions if we were to not having it behind the flag? Is the code being executed less the only concern (which would make subtle bug invisible)?

@brian-brazil
Copy link
Contributor

I would not worry about complexity as the code for this is well isolated

From a look at the PR this isn't the case, it's intertwined with the WAL reading code.

Do you have any other suggestions if we were to not having it behind the flag? Is the code being executed less the only concern (which would make subtle bug invisible)?

My point is more that putting it behind a flag makes it worse from my standpoint. If the main answer to the disadvantages of feature is that it's okay because it's behind a rarely used flag, it'd be better to not have the feature at all.

A feature like this should be always enabled, and justified on that basis.

@codesome
Copy link
Member Author

After some offline discussions, here are some points on some aspects

Benefits

I ran a quick test of this PR as a part of Cortex to again verify the numbers. I was able to conclude a 50-80+% reduction in turnover time. Ideally, there would be no WAL to replay as snapshot happens at the end, but due to Cortex usage of TSDB or bug in this PoC, The 50% number comes up with some WAL to replay. Else it would be 75%+ reduction in most cases as calculated before.

Complexity

The majority of the code is well isolated. The writing of snapshot and the loading of snapshot is an isolated code in its own method. And writing/loading snapshot is as simple as calling those methods are not.

The main complexity comes from the special handling of m-mapped chunks during the creation of a series from the WAL. When you create a series from WAL (after having replayed the snapshot and m-mapped chunks already), you have to check if the m-mapped chunks for the new series overlaps with the chunk from snapshot and hence discard (in case its a duplicate series record). So this is the only additional complexity.

Risks

  • We track the exact byte in the WAL file from where to start replaying. So, any races that might exist between writing a record and tracking that byte should be eliminated.
  • With the view of eliminating races, it is best if it can be assured that no samples are appended once we call the Close() method on Head.

Base automatically changed from master to main February 23, 2021 19:36
@stale stale bot removed the stale label Feb 23, 2021
@roidelapluie
Copy link
Member

Should we resume work here? Additionally, since the extra disk space needed might surprise users, it might require a feature flag.

@codesome
Copy link
Member Author

codesome commented Mar 4, 2021

We should resume, but I don't think I can get to it for many weeks to come. And yes, I was thinking of feature flag for this too :)

@bhiravabhatla
Copy link

Waiting for this PR to be merged. This is a very interesting/important feature we would like to have. Any planned ETA for this?

@codesome
Copy link
Member Author

codesome commented Aug 4, 2021

I have rebased the PR and added a feature flag for this now. It is ready for review :)

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
@codesome codesome requested a review from bwplotka August 5, 2021 09:14
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid. It's a lot of complexity, but given feature flags, we can experiment with this! LGTM, thanks!

@bwplotka bwplotka merged commit ee7e007 into prometheus:main Aug 6, 2021
@codesome
Copy link
Member Author

codesome commented Aug 7, 2021

WOW, I did not expect a direct merge XD Thanks :)
Do you have any possible suggestions for the record formats @bwplotka?

codesome added a commit that referenced this pull request Aug 11, 2021
* Fix `kuma_sd` targetgroup reporting (#9157)

* Bundle all xDS targets into a single group

Signed-off-by: austin ce <austin.cawley@gmail.com>

* Snapshot in-memory chunks on shutdown for faster restarts (#7229)

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Rename links

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Remove Individual Data Type Caps in Per-shard Buffering for Remote Write (#8921)

* Moved everything to nPending buffer

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Simplify exemplar capacity addition

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Added pre-allocation

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Don't allocate if not sending exemplars

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Avoid deadlock when processing duplicate series record (#9170)

* Avoid deadlock when processing duplicate series record

`processWALSamples()` needs to be able to send on its output channel
before it can read the input channel, so reads to allow this in case the
output channel is full.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* processWALSamples: update comment

Previous text seems to relate to an earlier implementation.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Optimise WAL loading by removing extra map and caching min-time (#9160)

* BenchmarkLoadWAL: close WAL after use

So that goroutines are stopped and resources released

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* BenchmarkLoadWAL: make series IDs co-prime with #workers

Series are distributed across workers by taking the modulus of the
ID with the number of workers, so multiples of 100 are a poor choice.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* BenchmarkLoadWAL: simulate mmapped chunks

Real Prometheus cuts chunks every 120 samples, then skips those samples
when re-reading the WAL. Simulate this by creating a single mapped chunk
for each series, since the max time is all the reader looks at.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Fix comment

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Remove series map from processWALSamples()

The locks that is commented to reduce contention in are now sharded
32,000 ways, so won't be contended. Removing the map saves memory and
goes just as fast.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* loadWAL: Cache the last mmapped chunk time

So we can skip calling append() for samples it will reject.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Improvements from code review

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Full stops and capitals on comments

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Cache max time in both places mmappedChunks is updated

Including refactor to extract function `setMMappedChunks`, to reduce
code duplication.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Update head min/max time when mmapped chunks added

This ensures we have the correct values if no WAL samples are added for
that series.

Note that `mSeries.maxTime()` was always `math.MinInt64` before, since
that function doesn't consider mmapped chunks.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Split Go and React Tests (#8897)

* Added go-ci and react-ci

Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Remove search keymap from new expression editor (#9184)

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants