Skip to content

[RW2.0] Enable fast metadata look up for Remote Write 2.0#17436

Closed
pipiland2612 wants to merge 18 commits intoprometheus:mainfrom
pipiland2612:feature_parity
Closed

[RW2.0] Enable fast metadata look up for Remote Write 2.0#17436
pipiland2612 wants to merge 18 commits intoprometheus:mainfrom
pipiland2612:feature_parity

Conversation

@pipiland2612
Copy link
Contributor

@pipiland2612 pipiland2612 commented Oct 31, 2025

Which issue(s) does the PR fix:

High level architecture:

Does this PR introduce a user-facing change?

[FEATURE]: Enable fast metadata loop up for Remote Write 2.0

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
@bwplotka
Copy link
Member

bwplotka commented Nov 3, 2025

Do you mind adding some TL;DR on high level choices?

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress! Let's implement one of the option and see an overhead to support help (or type and unit as well without type and unit feature flag). Once you implement this, we can run prombench to test things out "easily".

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
@pipiland2612
Copy link
Contributor Author

Great progress! Let's implement one of the option and see an overhead to support help (or type and unit as well without type and unit feature flag). Once you implement this, we can run prombench to test things out "easily".

Thanks for pointing out the options, please check the latest 2 commits.

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

// lookupMetadata performs the actual metadata lookup from active targets.
func (mw *MetadataWatcher) lookupMetadata(metricFamily string) *scrape.MetricMetadata {
for _, tset := range mw.manager.TargetsActive() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loops make me a bit nervous how well this would perform in a large Prometheus install that has a lot of targets.

Another option we could explore in a different PR behind feature flag:

  • keep metadata in TSDB head memory per series, so that we have a map from storage.SeriesRef -> metadata
  • in the queue_manager, when we receive a sample , we get the series reference as well, so we could do a single lookup in TSDB to get the metadata right then

Questions:

  • how much overhead would this be?
  • is there some concurrency issue? can it happen that the reference is wrong somehow by the time the queue manager get it? (reused somehow?)

(This solution could naturally evolve into persistent metadatas, but that's another story @jesusvazquez )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @krajorama, thanks for suggesting. Indeed, looping through every target is very expensive. I will try to use a ref map and see if this approach is viable, maybe also writing a concurrent test for it too.

Copy link
Contributor Author

@pipiland2612 pipiland2612 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bwplotka @krajorama, I have implemented the map ref look up at 7a2e2a5. PTAL, do you think this is a good approach? If so I think I can delete ScrapeCache approach, its unit test and also change the PR desc. Also the lint is going off because I made another if else, but I will fix it when the final approach is decided

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! First of all can we add benchmark or use existing benchmark?

We could use the build Write Request benchmark but we need to add some subcase for large the source of cache. If we use this method for existing direct mapping, maybe special case is less relevant.

But can we?

We double checked and this metadata entry in a head series[ref].memSeries is only set when metadata-wal-record is enabled (((: It's called only then https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L1038

And that's interesting, because we have persistent path for metadata with this flag so we are kind of going in circles. However, there might be a way for some hybrid model, where we:

  • Avoid metadata-wal-records (avoid the WAL part)
  • We sort of move metadata cache from scrape cache to series memSeries metadata field (with some feature flag)?

I bet we cannot "move" easily, because we need to build extra mapping from metric family name -> seriesRef/meta for old functionalities (PRW1 and metadata APIs), but maybe we can do this additionally for this RW2 lookups?

Eventually series.getByID like above (we need to somehow add this meta WITHOUT wal-records flag) is essentially sort of reusing map and potentially reusing metadata objects (if the same object as in scrape cache?), so might be worth!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yea the biggest question is how we inject those metadata entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So the simplest solution would be to also call Appender.UpdateMetadata if the new feature flag is enabled. We should decide on the name. But not log to WAL without the metadata in wal feature flag. This basically gives us a series level cache for metadata . On top of scrape cache for now - note that OTEL collector also depends on metadata cache at the moment. But that can also be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this

In our future appender interface we will allow appending meta with Append, so this will work too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • note that OTEL collector also depends on metadata cache at the moment. But that can also be fixed.

Yes, as PRW1 and metadata APIs, so we can think later how to get those to use series level cache later on 👍🏽

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so funny, it seems we can get more functionality (ber series mapping) near (or exactly) 0 cost, by default 🙈 just using what we have already ;P

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
@krajorama krajorama self-assigned this Nov 11, 2025
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
@pipiland2612
Copy link
Contributor Author

pipiland2612 commented Nov 12, 2025

Hi @krajorama, thanks for your suggestion. I have solved at 5156481. Do you think this approach is good? if it is do you think we should now remove the ScrapeCache work ?

@pipiland2612 pipiland2612 force-pushed the feature_parity branch 2 times, most recently from 57d1be3 to 72debc2 Compare November 19, 2025 09:28
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
@pipiland2612
Copy link
Contributor Author

Hi @bwplotka, @krajorama I have implemented the new feature. PTAL! About the PR description and change log, I decided to change it later when there is a concensus on what name should we have for this new feature.

if len(b.metadata) > 0 {
// Only log metadata to WAL if metadata-wal-records feature is enabled.
// Metadata is always committed to memory (see commitMetadata), but WAL logging is optional.
if len(b.metadata) > 0 && a.head.opts.AppendMetadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this earlier? Do we need to build up and use memory for metadata batch in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it feels it will work, just is this efficient and clean. Should we have a totally separate path for metadata-wal-records vs metadata cache update -- like a different field.

Anyway we can benchmark now and iterate.

c.web.AppendMetadata = true
c.tsdb.AppendMetadata = true
logger.Info("Experimental metadata records in WAL enabled")
case "metadata-series-cache":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to benchmark this, we need to enable this config on this branch.

We can either do a hack where we enable this now in PR temporarily (default "metadata-series-cache" on) or we can adjust the benchmark as per custom benchmark flow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for suggesting. I will solve this ASAP

@bwplotka
Copy link
Member

This is looking promising. While I wish we could make the flow simpler and faster, we can iterate and improve later behind the new flag. Do you mind addressing #17436 (comment) @pipiland2612? Then I can run prombench to see things in action (:

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>
@pipiland2612
Copy link
Contributor Author

pipiland2612 commented Nov 24, 2025

Hi @bwplotka, I have bring up the temporary feature on. PTAL! Thanks

@bwplotka
Copy link
Member

/prombench main

@prombot
Copy link
Contributor

prombot commented Nov 25, 2025

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-17436 and main

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

  • To restart benchmark: /prombench restart main
  • To stop benchmark: /prombench cancel
  • To print help: /prombench help

@pipiland2612
Copy link
Contributor Author

@bwplotka, I have checked out the brnchmark. Seems like nothing been changed ?

@bwplotka
Copy link
Member

Can you elaborate on "nothing been changed", what do you mean? Have you checked memory and CPU? I saw ~5% increase or so on the initial 1h, but might have stabilized, please check, ideally with screenshots (:

@krajorama
Copy link
Member

krajorama commented Nov 25, 2025

This is looking promising. While I wish we could make the flow simpler and faster, we can iterate and improve later behind the new flag. Do you mind addressing #17436 (comment) @pipiland2612? Then I can run prombench to see things in action (:

I've finally got back around to this PR (Tuesday is my OSS day).

I think there's one problem 😞 , if we look at Appender Commit it first writes to the WAL and then writes to memory. Which means that on the first scrape for a series, it might happen that remote write gets the sample first, before metadata is written into the memSeries, in which case we'll send the sample without metadata. For metadata in WAL this would not happens as we log metadata first.

I'm not sure what the solution is. Maybe combine the two solutions and do use metadata in WAL when the series is new / metadata changes?

@bwplotka
Copy link
Member

Good point, although I wonder if practically you we will have that race. But indeed it's something we need to rethink.

I think the solution (A) is to actually have a separate flow for this feature that will make sense, irrespectively to UpdateMetadata flow. I don't understand why we cannot always update memSeries that already exists in head (

func (a *headAppender) getOrCreate(lset labels.Labels) (s *memSeries, created bool, err error) {
) with the newest type, unit and help. Do we need to wait for commit? Or even store any silly metadata update records?

In fact, we could merge as is and ITERATE with different implementations of this flag.

Also (A) might be easier with the interface change I'm working on right now (it will be easier for unified method like Append(ref SeriesRef, ls labels.Labels, meta Metadata, st, t int64, v float64, h *histogram.Histogram, fh *histogram.FloatHistogram, es []exemplar.Exemplar) (SeriesRef, error)

Looking on numbers here I see slight increase in CPU (~6%) and mem (~10%). This is likely due to aggregation of metadata records we don't really need.

image image

@bwplotka
Copy link
Member

Also something does not work @pipiland2612 - receiver does not have metadata.

See logs: http://prombench.prometheus.io/grafana/explore?schemaVersion=1&panes=%7B%22yjf%22:%7B%22datasource%22:%22PEEA25C9CAAE5E714%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22PEEA25C9CAAE5E714%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22%7Bcontainer_name%3D%5C%22prom-sink%5C%22,%20namespace%3D%5C%22prombench-17436%5C%22%7D%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-15m%22,%22to%22:%22now%22%7D%7D%7D&orgId=1

I'd suggest spinning up a quick test with the same containers using some bash or e2e framework similar to https://github.com/thanos-io/thanos/blob/main/test/e2e/query_test.go to investigate this and play with this locally:

 2025-11-25T08:53:38.030545472Z stderr F time=2025-11-25T08:53:38.030Z level=WARN msg="received remote write request with some issues" source=prometheus-test-pr-17436 proto=io.prometheus.write.v2.Request series=2000 samples=2000 histograms=0 exemplars=0 details="series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"0.33252567300796504\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help,no unit,unspecified metric type\n,series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"0.49878850951194753\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help,no unit,unspecified metric type\n,series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"0.7481827642679213\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help,no unit,unspecified metric type\n,series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"1.122274146401882\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help\n" series-without-unit=2000 series-untyped=2000 series-without-help=2000

@bwplotka
Copy link
Member

bwplotka commented Nov 25, 2025

Oh, and extra mem is definitely coming from

func commitMetadata(b *appendBatch) {

I think there's room to eventually use memSeries for all the metadata flow (including APIs). Will play around this, but we shouldn't block this feature flag (once it works e2e)- we can iterate IMO.

@bwplotka
Copy link
Member

bwplotka commented Dec 1, 2025

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Dec 1, 2025

Benchmark cancel is in progress.

@bwplotka
Copy link
Member

bwplotka commented Dec 1, 2025

Cancelled prombench to save resources.

Results are not extremely bad, perhaps good for experiment and starting point. But we might want to try long solution too #17619, likely after #17632

image image

@krajorama krajorama changed the title [RW2.0] Enable fast metadata loop up for Remote Write 2.0 [RW2.0] Enable fast metadata look up for Remote Write 2.0 Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RW2] metadata.send feature parity with v1

4 participants