Append metadata to the WAL in the scrape loop#10312
Append metadata to the WAL in the scrape loop#10312bwplotka merged 22 commits intoprometheus:mainfrom
Conversation
cstyan
left a comment
There was a problem hiding this comment.
I don't have much to say right now, looks relatively safe.
We should get a sanity check for the scrape code from @roidelapluie and for the TSDB code from @codesome
tsdb/head_append.go
Outdated
| } | ||
|
|
||
| hasNewMetadata := false | ||
| s.Lock() |
There was a problem hiding this comment.
Should we potentially use the RLock() here and then only lock if we need to overwrite s.meta? Or potentially introduce a new metadataLock to memSeries?
There was a problem hiding this comment.
I'm not really sure which is the best option, but personally I'd avoid adding extra fields to an already-large struct for each and every series.
Were you also thinking of something like 045ec58?
|
@roidelapluie @bwplotka could you have a look at this? We Grafanistas thought it wouldn't be kosher to do this just among ourselves. @tpaschalis is currently working on getting the tests passed. Beyond that, it would be great to unblock this PR. |
0214a62 to
9908337
Compare
|
@roidelapluie do you have time to take a look at this? |
|
I will take a look this week, hopefully today 🤞🏽 |
bwplotka
left a comment
There was a problem hiding this comment.
Did a first general pass. Looks clean, just let's make sure we are not introducing any regression even for those who does not want to append metadata to WAL (should we add this in opt-in manner?), also should it be a feature flag?
scrape/scrape.go
Outdated
| break loop | ||
| } | ||
|
|
||
| sl.cache.metaMtx.Lock() |
There was a problem hiding this comment.
Hmmm, those locks are a bit scary here. I assume lock is here because we want the latest metadata across all sources of a single metric name, right? Which could be inconsistent during scrape. Or am I wrong?
Nevertheless, we are on the hot path of scraping which has to be efficient, so we have to ideally benchmark this path. WDYT? Ideally with cases of potential lock congestion. If there is regression, we could opt in for some less consistent framework for choosing what metadata should stay.
Also for "error?" cases where process A is registering __name__=metric1 with help ABC and process B registering __name__=metric1 with help ABC2, we will constantly update those metadata back and worth in worse case, right?
There was a problem hiding this comment.
Sorry @bwplotka for the super late reply.
Yes, I think we can go with a feature flag until we see its effects on larger-scale Prometheus deployments.
I assume lock is here because we want the latest metadata across all sources of a single metric name, right? Which could be inconsistent during scrape. Or am I wrong?
Yes, that's what we hold as metadata in the scrapeCache. As you said, this inconsistency will happen when two different scrape_configs share a common metric name and try to simultaneously update its metadata, right? I think that yes, in the worst case that two processes register different metadata in each scrape iteration, we will be recording all these updates.
so we have to ideally benchmark this path. WDYT? Ideally with cases of potential lock congestion
I'm building a benchmark to measure the effect and will come back to you. We could also consider using read locks as we're checking whether an update is needed.
There was a problem hiding this comment.
Hey @bwplotka! About the metaMtx mutex in scrapeCache; if I'm getting this right, the scrapeCache is unique per scrapeLoop.
In that case, there will be no contention for this mutex between multiple scrapers, right? Each scraper will have its own lock to use as if it's encountering new metadata. The things that do run concurrently and might contend the mutex are a) the methods used by the MetadataWatcher in the QueueManager and b) the MetadataMetricsCollector which reports metrics to the scrape manager. If believe those use cases weren't needed, we wouldn't even need metaMtx.
If that's true, does it relieve some of the concerns around these locks?
I've run some benchmarks (tpaschalis@177d95a) to get a feel of the effect of this on the scrape loop after rebasing with the current main (070e409).
The first benchmark is the existing BenchmarkScrapeLoopAppend. The second one changes metadata on every scrape iteration, while the third one both changes metadata on every iteration and tries to obtain the metadata mutex every millisecond (not a realistic scenario with the current use cases).
$ benchstat main pr10312
name old time/op new time/op delta
ScrapeLoopAppend-8 41.9µs ± 3% 43.4µs ± 6% +3.50% (p=0.000 n=19+20)
ScrapeLoopAppend_MetadataChangesAlways-8 47.6µs ± 2% 51.7µs ± 2% +8.71% (p=0.000 n=19+20)
ScrapeLoopAppend_MetadataChangesAlwaysContended-8 48.9µs ± 3% 52.7µs ± 1% +7.88% (p=0.000 n=19+17)
How do you feel about these benchmarks?
There was a problem hiding this comment.
if I'm getting this right, the scrapeCache is unique per scrapeLoop. this is how I read the code as well. The newLoop function, part of scrapePool, takes in the scrapeLoopOptions struct. If the cache field on that struct is nil a new scrape cache is created. When a sync happens and the new scrape loops are started (some of which won't really have changed) the existing scrape loops are passed where/when relevant.
3e2dd46 to
4090e9c
Compare
862801f to
e9ccc51
Compare
|
As others have mentioned, it would be a good idea to put this behind a feature flag initially. Have a look at |
|
@cstyan @bwplotka I've put the feature behind a feature flag named Only thing that was clunky to put behind the feature flag was the new Just to be sure, I ran the benchmark again (with the flag disabled) to verify that there isn't any leak. Do you have any more comments at this stage? How would you prefer that we proceed? |
|
I plan to review the tsdb changes sometime next week |
codesome
left a comment
There was a problem hiding this comment.
I have done some initial review for the tsdb. I am calling it initial, because there are a bunch of things missing (only talking about tsdb/).
- We need unit tests for
- The ingestion of the metadata with different error scenarios and checking if we are writing only changed metadatas.
- Making sure we are writing it into the WAL properly and are able to read it back from the WAL.
- Checkpointing is happening as expected (check my comment about checkpoint, maybe we need to change the logic)
- We are missing WAL replay for these records. Is it intentional? If it is, then every restart will add a whole lot of duplicate metadata records. If it is not, we will need to add it (including tests for that).
tsdb/wal/checkpoint.go
Outdated
| if err != nil { | ||
| return nil, errors.Wrap(err, "decode metadata") | ||
| } | ||
| // Drop irrelevant metadata in place. |
There was a problem hiding this comment.
The implementation feels like we will keep all the historical metadata changes of a series as long as the series exists. Should we not keep just the last metadata of a series in this case?
There was a problem hiding this comment.
Yeah, that's true. I thought this wouldn't have much impact as a) metadata isn't usually rapidly changing, and b) if the metadata does change then it might make sense to be able to correlate it with the values associated with a series (eg. the unit changing from seconds to milliseconds).
If we do want to keep the latest metadata entry only, we don't have any timestamps associated with them so I can't say how we'd go about it off the top of my head, but I'll take a look just in case we want to move to that direction.
There was a problem hiding this comment.
It makes sense to keep a metadata around until all the samples after that have been removed from the checkpoint until the new metadata appears for the reason you said - make sense of the old data. But I am not sure how efficiently we can do that. Once idea is to store the timestamp of the last sample seen when the metadata was updated.
If we do want to keep the latest metadata entry only, we don't have any timestamps associated with them so I can't say how we'd go about it off the top of my head, but I'll take a look just in case we want to move to that direction.
This one is relatively simple but with one complexity. Similar to the keep(ref) method that we have, we have another keepMeta(ref, meta), which returns true if the metadata matches the one present in the head block currently, i.e. the latest. Else false. Problem here is that, between 2 checkpoints, a series might have metadata M1, then switched to M2, then again back to M1, in which case we will end up keeping 2 copies of M1. We need to think how to dedupe this. Maybe a timestamp of some sort will help to identify the latest.
There was a problem hiding this comment.
The WAL is written to in-order, right? We can assume that for a given ref ID, metadata entries are written in-order and we can keep only the latest metadata entry for each ref ID. We shouldn't need to include timestamps, IIUC.
There was a problem hiding this comment.
Sorry for the late response. I think @rfratto's suggestion should work nicely.
One way to go about this would be to just keep track of the latest metadata for each series in the for r.Next() loop, and then flush them separately along the remaining records in the end. I'll try it out to see how it works in practice, but I also wanted to hear how it sounds to y'all.
There was a problem hiding this comment.
Edit: I've implemented the deduplication logic on 36cb2b6 along with a relevant test. Let me know how it looks to you, I'll try to see if I can bring some benchmarks.
With this, I think we might be able to avoid complicating the WAL storage format.
There was a problem hiding this comment.
If we do not care of old metadata, then definitely, just keep the last one!
There was a problem hiding this comment.
One of the use-cases brought up was keeping track of changing metadata. Do we not care about knowing what the series meant a month ago and only care about what it means now?
|
Hey @codesome thanks for all the comments!
Yes, true! I wanted some feedback on the general direction before starting, but I'll go ahead and add some tests to increase our level of confidence
When I initially discussed this with @gotjosh, the idea was to split up the work into smaller PRs for easier review; that is
The work here is only concerned with the first part, so that's why Metadata blocks are just skipped over when reading. I see your point about adding duplicate records, but do you think this is an acceptable tradeoff since this is behind a feature flag and wouldn't directly impact everyone? Or do you feel we should prioritize reading them into a |
|
@tpaschalis Here the name and values would be Reading of WAL on startup can be done in a separate PR as long as we are clear that it is easily readable into the memory. Regarding the checkpointing logic, we need to brainstorm more idea to get rid of duplicates. It is possible that we need to modify the storage format for metadata depending on how checkpointing is to be done (for example a timestamp with the metadata), so it should be done in this PR (or merge this PR and do changes to the format before the next release which is also acceptable, but I would not do that because the next release is planned near the end of June and I will be unable to review for the reasons stated later). I will parallely think of ways, but note that I am on PTO from tomorrow until June 17ish (worst case I am back on June 20). While this is behind an experimental flag, I prefer to not make breaking changes to the formats when possible, because they are painful to handle and leaves behind tech debt. |
5e2449c to
7a503e2
Compare
|
cc @bwplotka as the co-maintainer of TSDB and until Ganesh is back from PTO, do you have any feedback on the proposed changes in the WAL format, the TSDB code or the recent benchmarks? |
cstyan
left a comment
There was a problem hiding this comment.
Just some minor comments on top of what we talked about during our call today.
|
Hey team :) The newly proposed storage format is implemented in 208a4bb and seems like it's working out. The encoding/decoding logic is simpler, we're able to correctly skip over unknown fields without assuming anything about their size or be concerned about their ordering. Let me know what you think! |
| const ( | ||
| UnknownMT MetricType = 0 | ||
| Counter MetricType = 1 | ||
| Gauge MetricType = 2 | ||
| Histogram MetricType = 3 | ||
| GaugeHistogram MetricType = 4 | ||
| Summary MetricType = 5 | ||
| Info MetricType = 6 | ||
| Stateset MetricType = 7 | ||
| ) |
There was a problem hiding this comment.
LGTM assuming these are the current metric types and how/where we want to encode them for source of truth. @codesome ?
There was a problem hiding this comment.
Could you provide some more context on this please? I'm not sure what you meant 😕
There was a problem hiding this comment.
I just happen to know that these are the types present in the client_golang. (Edit: also the ones recognized by the scrape)
|
FWIW, after a recent discussion with @cstyan, I've run a benchmark to verify the growth in WAL size and replay duration by quickly patching in support for reading back the metadata records in tpaschalis@abba445. I hope this is helpful to understand the worst-case when someone would opt-in to the feature. The benchmark is more on the heavyweight worst-case ceiling side; it simulates adding 50k records like the following in the scrape loop's appender What I observed was
|
codesome
left a comment
There was a problem hiding this comment.
Nice work! I believe tests for appending metadata and verifying them in the WAL and its replay will follow this PR, so I won't block this PR on that. There are only a couple of things that I would like to see being done in this PR, one about fixing append semantics and optimising memory usage for the checkpointing.
I have only reviewed the TSDB related stuff (tsdb/* storage/*) and haven't reviewed the scraping or any other part.
| repl := 0 | ||
| for _, m := range metadata { | ||
| if keep(m.Ref) { | ||
| latestMetadataMap[m.Ref] = m |
There was a problem hiding this comment.
I am a little unsure about doing it this way. Since there is a few % extra memory usage in general with metadata, holding this map will increase the peak memory requied a bit more.
What I would suggest is having another keep method, let's call it keepMetadata(ref, metadata). Like how keep() checks if the series exists, keepMetadata(..) will check if the series exists, and if it does, it will check if the metadata matches the one present in the memory. That way we don't have to keep this map around.
tsdb/head.go
Outdated
|
|
||
| ref chunks.HeadSeriesRef | ||
| lset labels.Labels | ||
| meta metadata.Metadata |
There was a problem hiding this comment.
A note for future (no action required right now): Does this have any noticeable memory impact when feature is disabled? Should it be a pointer instead?
There was a problem hiding this comment.
By my quick count this struct was previously 22 8-byte words, and adding meta would take it to 28 (+ 3 strings).
For 10 million series this adds 60MB, typically 120MB in memory usage before GC.
There was a problem hiding this comment.
(Not counting the contents of the strings, which I hope are shared across all series with the same values)
There was a problem hiding this comment.
which I hope are shared across all series with the same values
@tpaschalis is this the case? The only thing that I see can be share is the type and unit
There was a problem hiding this comment.
Sorry I went from your question "when feature is disabled" to the values, which obviously only exist when the feature is enabled. However it's valid to ask the memory impact in that case too.
There was a problem hiding this comment.
No, the UNIT and HELP strings are not currently shared between series.
Should we look into interning them? We have a PR for being able to share the interner used by remote_write with the scrape loop, but admittedly it has made little progress.
There was a problem hiding this comment.
(of course, any comments to the other PR or ideas here are more than welcome 😄 )
There was a problem hiding this comment.
Just to clarify: where the same metric appears many times in a single scrape (example), the help string will only be parsed once; are you sure that value isn't shared?
Interning will help further, sharing across different scrapes, but has some concurrency implications.
There was a problem hiding this comment.
Oh I see what you mean. We aren't changing anything about the way the HELP or UNIT strings are parsed; they are indeed shared between all series with the same metric name, but not across scrapes yet.
| if hasNewMetadata { | ||
| a.metadata = append(a.metadata, record.RefMetadata{ | ||
| Ref: s.ref, | ||
| Type: record.GetMetricType(meta.Type), |
There was a problem hiding this comment.
Feels weird to me that we do a text->number conversion here inside tsdb.
Why not do it at scrape time where we do all the other text parsing?
There was a problem hiding this comment.
Thanks for bringing this up. This is relevant to your next comment as well.
The enum-like MetricType was added as an encoding/decoding-specific optimisation after this suggestion and it didn't occur to me to make this conversion from the scrape loop.
One drawback I see is that the textparse.MetricType is shared between many structs, so there would a bigger change to review around the scrape loop.
For example, the current MetricType implementation of how metadata arrives to the remote_write would either need an extra conversion, or it would need to propagate the enum-type correctly, resulting in even more changes.
|
|
||
| // Metadata stores a series' metadata information. | ||
| type Metadata struct { | ||
| Type textparse.MetricType |
There was a problem hiding this comment.
Why is this string-like rather than enum-like?
I can see it was a string previously, but now you are introducing an enum it seems unnatural to have to convert it back to a string to populate this field.
There was a problem hiding this comment.
Tied to the previous comment; mainly for backwards-compatibility and not having to touch more files in an already-big PR.
I can produce a commit with how using the enum here would look and get back to you for a second round.
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
When we're setting metadata entries in the scrapeCace, we're using the p.Help(), p.Unit(), p.Type() helpers, which retrieve the series name and use it as the cache key. When checking for cache entries though, we used p.Series() as the key, which included the metric name _with_ its labels. That meant that we were never actually hitting the cache. We're fixing this by utiling the __name__ internal label for correctly getting the cache entries after they've been set by setHelp(), setType() or setUnit(). Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
444d108 to
044909e
Compare
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
|
Just saw this was merged, congrats and great work. |
Signed-off-by: Paschalis Tsilias paschalist0@gmail.com
Changes
We've recently merged support for Metadata in the TSDB on d1122e0.
This PR uses a feature flag for enabling the scrape loop to detect changes in metadata in both new and old(cached) series, and pass them along to the MetadataUpdater interface. The metadata we're passing along will be stored in the headAppender and encoded in the WAL.
This PR updates Prometheus to store metadata in its Write-Ahead Log. To achieve this, a newMetadataAppenderinterface is introduced and embedded in Appender.Its method implementation is used to store aRefMetadataslice as a headAppender field, which is then encoded in the WAL according to the proposed format.The scrape loop is also modified to detect changes in metadata in both new and old (cached) series and pass them along.TODO
Prior Art
This is an attempt at a revival of #7771 and other discussions around storing and utilizing Metadata throughout Prometheus.
Unlike the reference work, this only deals with appending metadata to the WAL, and aims to start a bigger discussion around how we can make Metadata more useful throughout Prometheus.
As such, reading from the WAL and sending over to remote_write is not in the scope of this PR and will be handled with a future one. Required Changes to the agent mode are also not currently present.
Older Mailing list discussion: https://groups.google.com/g/prometheus-developers/c/w-eotGenBGg/m/UZWdxzcBBwAJ
Older Design doc: https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#heading=h.bik9uwphqy3g
Notes
After I get some feedback on the overall idea, I will extend the test suite to cover more cases and increase overall confidence on how the feature works and interacts with different pieces of Prometheus.We've merged the TSDB code parts, with tests covering the encoding/decoding functionality, checkpointing, verifying the format of the data on disk, as well as in memory during writing and after a replay.
I'll be adding more tests to gain more confidence in the way that the scrape loop handles the updating of metadata.