[RW2.0] Enable fast metadata look up for Remote Write 2.0#17436
[RW2.0] Enable fast metadata look up for Remote Write 2.0#17436pipiland2612 wants to merge 18 commits intoprometheus:mainfrom
Conversation
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
|
Do you mind adding some TL;DR on high level choices? |
bwplotka
left a comment
There was a problem hiding this comment.
Great progress! Let's implement one of the option and see an overhead to support help (or type and unit as well without type and unit feature flag). Once you implement this, we can run prombench to test things out "easily".
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Thanks for pointing out the options, please check the latest 2 commits. |
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
storage/remote/metadata_watcher.go
Outdated
|
|
||
| // lookupMetadata performs the actual metadata lookup from active targets. | ||
| func (mw *MetadataWatcher) lookupMetadata(metricFamily string) *scrape.MetricMetadata { | ||
| for _, tset := range mw.manager.TargetsActive() { |
There was a problem hiding this comment.
The loops make me a bit nervous how well this would perform in a large Prometheus install that has a lot of targets.
Another option we could explore in a different PR behind feature flag:
- keep metadata in TSDB head memory per series, so that we have a map from storage.SeriesRef -> metadata
- in the queue_manager, when we receive a sample , we get the series reference as well, so we could do a single lookup in TSDB to get the metadata right then
Questions:
- how much overhead would this be?
- is there some concurrency issue? can it happen that the reference is wrong somehow by the time the queue manager get it? (reused somehow?)
(This solution could naturally evolve into persistent metadatas, but that's another story @jesusvazquez )
There was a problem hiding this comment.
Hi @krajorama, thanks for suggesting. Indeed, looping through every target is very expensive. I will try to use a ref map and see if this approach is viable, maybe also writing a concurrent test for it too.
There was a problem hiding this comment.
Hi @bwplotka @krajorama, I have implemented the map ref look up at 7a2e2a5. PTAL, do you think this is a good approach? If so I think I can delete ScrapeCache approach, its unit test and also change the PR desc. Also the lint is going off because I made another if else, but I will fix it when the final approach is decided
There was a problem hiding this comment.
nice! First of all can we add benchmark or use existing benchmark?
We could use the build Write Request benchmark but we need to add some subcase for large the source of cache. If we use this method for existing direct mapping, maybe special case is less relevant.
But can we?
We double checked and this metadata entry in a head series[ref].memSeries is only set when metadata-wal-record is enabled (((: It's called only then https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L1038
And that's interesting, because we have persistent path for metadata with this flag so we are kind of going in circles. However, there might be a way for some hybrid model, where we:
- Avoid metadata-wal-records (avoid the WAL part)
- We sort of move metadata cache from scrape cache to series memSeries metadata field (with some feature flag)?
I bet we cannot "move" easily, because we need to build extra mapping from metric family name -> seriesRef/meta for old functionalities (PRW1 and metadata APIs), but maybe we can do this additionally for this RW2 lookups?
Eventually series.getByID like above (we need to somehow add this meta WITHOUT wal-records flag) is essentially sort of reusing map and potentially reusing metadata objects (if the same object as in scrape cache?), so might be worth!
There was a problem hiding this comment.
So yea the biggest question is how we inject those metadata entries.
There was a problem hiding this comment.
Right. So the simplest solution would be to also call Appender.UpdateMetadata if the new feature flag is enabled. We should decide on the name. But not log to WAL without the metadata in wal feature flag. This basically gives us a series level cache for metadata . On top of scrape cache for now - note that OTEL collector also depends on metadata cache at the moment. But that can also be fixed.
There was a problem hiding this comment.
Yes, I like this
In our future appender interface we will allow appending meta with Append, so this will work too.
There was a problem hiding this comment.
- note that OTEL collector also depends on metadata cache at the moment. But that can also be fixed.
Yes, as PRW1 and metadata APIs, so we can think later how to get those to use series level cache later on 👍🏽
There was a problem hiding this comment.
It's so funny, it seems we can get more functionality (ber series mapping) near (or exactly) 0 cost, by default 🙈 just using what we have already ;P
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
fcfc97d to
7a2e2a5
Compare
|
Hi @krajorama, thanks for your suggestion. I have solved at 5156481. Do you think this approach is good? if it is do you think we should now remove the ScrapeCache work ? |
57d1be3 to
72debc2
Compare
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
72debc2 to
ccca244
Compare
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
fe9b2a7 to
80690b6
Compare
|
Hi @bwplotka, @krajorama I have implemented the new feature. PTAL! About the PR description and change log, I decided to change it later when there is a concensus on what name should we have for this new feature. |
| if len(b.metadata) > 0 { | ||
| // Only log metadata to WAL if metadata-wal-records feature is enabled. | ||
| // Metadata is always committed to memory (see commitMetadata), but WAL logging is optional. | ||
| if len(b.metadata) > 0 && a.head.opts.AppendMetadata { |
There was a problem hiding this comment.
Should we do this earlier? Do we need to build up and use memory for metadata batch in the first place?
There was a problem hiding this comment.
There was a problem hiding this comment.
So it feels it will work, just is this efficient and clean. Should we have a totally separate path for metadata-wal-records vs metadata cache update -- like a different field.
Anyway we can benchmark now and iterate.
| c.web.AppendMetadata = true | ||
| c.tsdb.AppendMetadata = true | ||
| logger.Info("Experimental metadata records in WAL enabled") | ||
| case "metadata-series-cache": |
There was a problem hiding this comment.
So to benchmark this, we need to enable this config on this branch.
We can either do a hack where we enable this now in PR temporarily (default "metadata-series-cache" on) or we can adjust the benchmark as per custom benchmark flow
- https://github.com/prometheus/test-infra/blob/master/prombench/README.md#triggering-tests-via-github-comment
- https://github.com/prometheus/test-infra/tree/master/prombench/manifests/prombench#prombench-benchmark-scenario-configuration (with https://github.com/prometheus/test-infra/blob/master/prombench/manifests/prombench/benchmark/3_prometheus-test-pr_deployment.yaml changes)
There was a problem hiding this comment.
Thank you for suggesting. I will solve this ASAP
|
This is looking promising. While I wish we could make the flow simpler and faster, we can iterate and improve later behind the new flag. Do you mind addressing #17436 (comment) @pipiland2612? Then I can run prombench to see things in action (: |
|
Hi @bwplotka, I have bring up the temporary feature on. PTAL! Thanks |
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
|
@bwplotka, I have checked out the brnchmark. Seems like nothing been changed ? |
|
Can you elaborate on "nothing been changed", what do you mean? Have you checked memory and CPU? I saw ~5% increase or so on the initial 1h, but might have stabilized, please check, ideally with screenshots (: |
I've finally got back around to this PR (Tuesday is my OSS day). I think there's one problem 😞 , if we look at Appender Commit it first writes to the WAL and then writes to memory. Which means that on the first scrape for a series, it might happen that remote write gets the sample first, before metadata is written into the memSeries, in which case we'll send the sample without metadata. For metadata in WAL this would not happens as we log metadata first. I'm not sure what the solution is. Maybe combine the two solutions and do use metadata in WAL when the series is new / metadata changes? |
|
Good point, although I wonder if practically you we will have that race. But indeed it's something we need to rethink. I think the solution (A) is to actually have a separate flow for this feature that will make sense, irrespectively to UpdateMetadata flow. I don't understand why we cannot always update memSeries that already exists in head ( prometheus/tsdb/head_append.go Line 528 in f50ff0a In fact, we could merge as is and ITERATE with different implementations of this flag. Also (A) might be easier with the interface change I'm working on right now (it will be easier for unified method like Looking on numbers here I see slight increase in CPU (~6%) and mem (~10%). This is likely due to aggregation of metadata records we don't really need.
|
|
Also something does not work @pipiland2612 - receiver does not have metadata. I'd suggest spinning up a quick test with the same containers using some bash or e2e framework similar to https://github.com/thanos-io/thanos/blob/main/test/e2e/query_test.go to investigate this and play with this locally: |
|
Oh, and extra mem is definitely coming from prometheus/tsdb/head_append.go Line 1690 in e43f1ba I think there's room to eventually use memSeries for all the metadata flow (including APIs). Will play around this, but we shouldn't block this feature flag (once it works e2e)- we can iterate IMO. |
|
/prombench cancel |
|
Benchmark cancel is in progress. |




Which issue(s) does the PR fix:
High level architecture:
Does this PR introduce a user-facing change?