[RW2.0] Enable fast metadata look up for Remote Write 2.0 by pipiland2612 · Pull Request #17436 · prometheus/prometheus

pipiland2612 · 2025-10-31T12:43:54Z

Which issue(s) does the PR fix:

Fixes [RW2] metadata.send feature parity with v1 #17191

High level architecture:

Does this PR introduce a user-facing change?

[FEATURE]: Enable fast metadata loop up for Remote Write 2.0

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

bwplotka · 2025-11-03T09:49:51Z

Do you mind adding some TL;DR on high level choices?

config/config.go

storage/remote/queue_manager_test.go

storage/remote/queue_manager.go

bwplotka

Great progress! Let's implement one of the option and see an overhead to support help (or type and unit as well without type and unit feature flag). Once you implement this, we can run prombench to test things out "easily".

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 · 2025-11-03T18:52:29Z

Great progress! Let's implement one of the option and see an overhead to support help (or type and unit as well without type and unit feature flag). Once you implement this, we can run prombench to test things out "easily".

Thanks for pointing out the options, please check the latest 2 commits.

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

krajorama · 2025-11-05T10:29:38Z

storage/remote/metadata_watcher.go

+
+// lookupMetadata performs the actual metadata lookup from active targets.
+func (mw *MetadataWatcher) lookupMetadata(metricFamily string) *scrape.MetricMetadata {
+	for _, tset := range mw.manager.TargetsActive() {


The loops make me a bit nervous how well this would perform in a large Prometheus install that has a lot of targets.

Another option we could explore in a different PR behind feature flag:

keep metadata in TSDB head memory per series, so that we have a map from storage.SeriesRef -> metadata

in the queue_manager, when we receive a sample , we get the series reference as well, so we could do a single lookup in TSDB to get the metadata right then

Questions:

how much overhead would this be?

is there some concurrency issue? can it happen that the reference is wrong somehow by the time the queue manager get it? (reused somehow?)

(This solution could naturally evolve into persistent metadatas, but that's another story @jesusvazquez )

Hi @krajorama, thanks for suggesting. Indeed, looping through every target is very expensive. I will try to use a ref map and see if this approach is viable, maybe also writing a concurrent test for it too.

Hi @bwplotka @krajorama, I have implemented the map ref look up at 7a2e2a5. PTAL, do you think this is a good approach? If so I think I can delete ScrapeCache approach, its unit test and also change the PR desc. Also the lint is going off because I made another if else, but I will fix it when the final approach is decided

nice! First of all can we add benchmark or use existing benchmark?

We could use the build Write Request benchmark but we need to add some subcase for large the source of cache. If we use this method for existing direct mapping, maybe special case is less relevant.

But can we?

We double checked and this metadata entry in a head series[ref].memSeries is only set when metadata-wal-record is enabled (((: It's called only then https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L1038

And that's interesting, because we have persistent path for metadata with this flag so we are kind of going in circles. However, there might be a way for some hybrid model, where we:

Avoid metadata-wal-records (avoid the WAL part)

We sort of move metadata cache from scrape cache to series memSeries metadata field (with some feature flag)?

I bet we cannot "move" easily, because we need to build extra mapping from metric family name -> seriesRef/meta for old functionalities (PRW1 and metadata APIs), but maybe we can do this additionally for this RW2 lookups?

Eventually series.getByID like above (we need to somehow add this meta WITHOUT wal-records flag) is essentially sort of reusing map and potentially reusing metadata objects (if the same object as in scrape cache?), so might be worth!

So yea the biggest question is how we inject those metadata entries.

Right. So the simplest solution would be to also call Appender.UpdateMetadata if the new feature flag is enabled. We should decide on the name. But not log to WAL without the metadata in wal feature flag. This basically gives us a series level cache for metadata . On top of scrape cache for now - note that OTEL collector also depends on metadata cache at the moment. But that can also be fixed.

Yes, I like this

In our future appender interface we will allow appending meta with Append, so this will work too.

note that OTEL collector also depends on metadata cache at the moment. But that can also be fixed.

Yes, as PRW1 and metadata APIs, so we can think later how to get those to use series level cache later on 👍🏽

It's so funny, it seems we can get more functionality (ber series mapping) near (or exactly) 0 cost, by default 🙈 just using what we have already ;P

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

tsdb/head.go

storage/remote/queue_manager_test.go

storage/remote/queue_manager.go

storage/remote/queue_manager_test.go

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 · 2025-11-12T18:59:10Z

Hi @krajorama, thanks for your suggestion. I have solved at 5156481. Do you think this approach is good? if it is do you think we should now remove the ScrapeCache work ?

docs/configuration/configuration.md

storage/remote/queue_manager.go

docs/configuration/configuration.md

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 · 2025-11-19T18:38:01Z

Hi @bwplotka, @krajorama I have implemented the new feature. PTAL! About the PR description and change log, I decided to change it later when there is a concensus on what name should we have for this new feature.

bwplotka · 2025-11-24T10:41:46Z

tsdb/head_append.go

-		if len(b.metadata) > 0 {
+		// Only log metadata to WAL if metadata-wal-records feature is enabled.
+		// Metadata is always committed to memory (see commitMetadata), but WAL logging is optional.
+		if len(b.metadata) > 0 && a.head.opts.AppendMetadata {


Should we do this earlier? Do we need to build up and use memory for metadata batch in the first place?

Hm, assign is here: https://github.com/pipiland2612/prometheus/blob/80690b6aa60e9b579164af34feb3c084feba0d87/tsdb/head_append.go#L1697C2-L1697C98

So it feels it will work, just is this efficient and clean. Should we have a totally separate path for metadata-wal-records vs metadata cache update -- like a different field.

Anyway we can benchmark now and iterate.

bwplotka · 2025-11-24T10:53:02Z

cmd/prometheus/main.go

 				c.web.AppendMetadata = true
+				c.tsdb.AppendMetadata = true
 				logger.Info("Experimental metadata records in WAL enabled")
+			case "metadata-series-cache":


So to benchmark this, we need to enable this config on this branch.

We can either do a hack where we enable this now in PR temporarily (default "metadata-series-cache" on) or we can adjust the benchmark as per custom benchmark flow

https://github.com/prometheus/test-infra/blob/master/prombench/README.md#triggering-tests-via-github-comment

https://github.com/prometheus/test-infra/tree/master/prombench/manifests/prombench#prombench-benchmark-scenario-configuration (with https://github.com/prometheus/test-infra/blob/master/prombench/manifests/prombench/benchmark/3_prometheus-test-pr_deployment.yaml changes)

Thank you for suggesting. I will solve this ASAP

bwplotka · 2025-11-24T13:59:23Z

This is looking promising. While I wish we could make the flow simpler and faster, we can iterate and improve later behind the new flag. Do you mind addressing #17436 (comment) @pipiland2612? Then I can run prombench to see things in action (:

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 · 2025-11-24T17:46:44Z

Hi @bwplotka, I have bring up the temporary feature on. PTAL! Thanks

bwplotka · 2025-11-25T05:44:42Z

/prombench main

prombot · 2025-11-25T05:44:45Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-17436 and main

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

To restart benchmark: /prombench restart main
To stop benchmark: /prombench cancel
To print help: /prombench help

pipiland2612 · 2025-11-25T07:30:59Z

@bwplotka, I have checked out the brnchmark. Seems like nothing been changed ?

bwplotka · 2025-11-25T07:51:40Z

Can you elaborate on "nothing been changed", what do you mean? Have you checked memory and CPU? I saw ~5% increase or so on the initial 1h, but might have stabilized, please check, ideally with screenshots (:

krajorama · 2025-11-25T07:58:10Z

This is looking promising. While I wish we could make the flow simpler and faster, we can iterate and improve later behind the new flag. Do you mind addressing #17436 (comment) @pipiland2612? Then I can run prombench to see things in action (:

I've finally got back around to this PR (Tuesday is my OSS day).

I think there's one problem 😞 , if we look at Appender Commit it first writes to the WAL and then writes to memory. Which means that on the first scrape for a series, it might happen that remote write gets the sample first, before metadata is written into the memSeries, in which case we'll send the sample without metadata. For metadata in WAL this would not happens as we log metadata first.

I'm not sure what the solution is. Maybe combine the two solutions and do use metadata in WAL when the series is new / metadata changes?

bwplotka · 2025-11-25T08:45:46Z

Good point, although I wonder if practically you we will have that race. But indeed it's something we need to rethink.

I think the solution (A) is to actually have a separate flow for this feature that will make sense, irrespectively to UpdateMetadata flow. I don't understand why we cannot always update memSeries that already exists in head (

prometheus/tsdb/head_append.go

Line 528 in f50ff0a

    
           func (a *headAppender) getOrCreate(lset labels.Labels) (s *memSeries, created bool, err error) {

) with the newest type, unit and help. Do we need to wait for commit? Or even store any silly metadata update records?

In fact, we could merge as is and ITERATE with different implementations of this flag.

Also (A) might be easier with the interface change I'm working on right now (it will be easier for unified method like Append(ref SeriesRef, ls labels.Labels, meta Metadata, st, t int64, v float64, h *histogram.Histogram, fh *histogram.FloatHistogram, es []exemplar.Exemplar) (SeriesRef, error)

Looking on numbers here I see slight increase in CPU (~6%) and mem (~10%). This is likely due to aggregation of metadata records we don't really need.

bwplotka · 2025-11-25T08:56:57Z

Also something does not work @pipiland2612 - receiver does not have metadata.

See logs: http://prombench.prometheus.io/grafana/explore?schemaVersion=1&panes=%7B%22yjf%22:%7B%22datasource%22:%22PEEA25C9CAAE5E714%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22PEEA25C9CAAE5E714%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22%7Bcontainer_name%3D%5C%22prom-sink%5C%22,%20namespace%3D%5C%22prombench-17436%5C%22%7D%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-15m%22,%22to%22:%22now%22%7D%7D%7D&orgId=1

I'd suggest spinning up a quick test with the same containers using some bash or e2e framework similar to https://github.com/thanos-io/thanos/blob/main/test/e2e/query_test.go to investigate this and play with this locally:

 2025-11-25T08:53:38.030545472Z stderr F time=2025-11-25T08:53:38.030Z level=WARN msg="received remote write request with some issues" source=prometheus-test-pr-17436 proto=io.prometheus.write.v2.Request series=2000 samples=2000 histograms=0 exemplars=0 details="series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"0.33252567300796504\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help,no unit,unspecified metric type\n,series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"0.49878850951194753\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help,no unit,unspecified metric type\n,series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"0.7481827642679213\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help,no unit,unspecified metric type\n,series codelab_api_request_duration_seconds_bucket{instance=\"10.52.54.4:8080\", job=\"fake-webservers-49\", le=\"1.122274146401882\", method=\"GET\", nodeName=\"gke-test-infra-nodes-17436-6b249330-xsqm\", path=\"/api/boom\", status=\"200\"}: no help\n" series-without-unit=2000 series-untyped=2000 series-without-help=2000

bwplotka · 2025-11-25T09:45:29Z

Oh, and extra mem is definitely coming from

prometheus/tsdb/head_append.go

Line 1690 in e43f1ba

func commitMetadata(b *appendBatch) {

I think there's room to eventually use memSeries for all the metadata flow (including APIs). Will play around this, but we shouldn't block this feature flag (once it works e2e)- we can iterate IMO.

bwplotka · 2025-12-01T18:48:48Z

/prombench cancel

prombot · 2025-12-01T18:48:51Z

Benchmark cancel is in progress.

bwplotka · 2025-12-01T18:54:42Z

Cancelled prombench to save resources.

Results are not extremely bad, perhaps good for experiment and starting point. But we might want to try long solution too #17619, likely after #17632

pipiland2612 added 2 commits October 31, 2025 14:16

add feature for v2

1d3e7e2

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

update cmt

84eafae

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 requested review from bwplotka, cstyan and tomwilkie as code owners October 31, 2025 12:43

bwplotka reviewed Nov 3, 2025

View reviewed changes

config/config.go Outdated Show resolved Hide resolved

bwplotka reviewed Nov 3, 2025

View reviewed changes

storage/remote/queue_manager_test.go Outdated Show resolved Hide resolved

bwplotka reviewed Nov 3, 2025

View reviewed changes

storage/remote/queue_manager.go Outdated Show resolved Hide resolved

bwplotka reviewed Nov 3, 2025

View reviewed changes

pipiland2612 added 2 commits November 3, 2025 20:42

add check for histogram suffix

f4ed709

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

update docs

b336692

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

fix conflict

cca1366

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

krajorama reviewed Nov 5, 2025

View reviewed changes

pipiland2612 requested a review from jesusvazquez as a code owner November 10, 2025 19:47

add find by ref map

7a2e2a5

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 force-pushed the feature_parity branch from fcfc97d to 7a2e2a5 Compare November 10, 2025 19:48

krajorama reviewed Nov 11, 2025

View reviewed changes

krajorama self-assigned this Nov 11, 2025

fix

5156481

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 requested a review from krajorama November 17, 2025 10:51

krajorama reviewed Nov 18, 2025

View reviewed changes

docs/configuration/configuration.md Outdated Show resolved Hide resolved

pipiland2612 force-pushed the feature_parity branch 2 times, most recently from 57d1be3 to 72debc2 Compare November 19, 2025 09:28

add new feature, remove scrape cache and metadata.send for v2

ccca244

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 force-pushed the feature_parity branch from 72debc2 to ccca244 Compare November 19, 2025 09:30

only log metadata to wal when wal feature is on

ab17208

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 added 4 commits November 19, 2025 17:26

rename variable

e764134

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

add test for new feature

bb82a05

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

update docs

df5d729

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

update comments

80690b6

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

pipiland2612 force-pushed the feature_parity branch from fe9b2a7 to 80690b6 Compare November 19, 2025 15:56

pipiland2612 requested review from bwplotka and krajorama November 19, 2025 15:58

bwplotka mentioned this pull request Nov 24, 2025

[meta] PROM-35 Remote Write (PRW2.0) Stability #16944

Open

26 tasks

bwplotka reviewed Nov 24, 2025

View reviewed changes

temp

00f5842

Signed-off-by: pipiland2612 <nguyen.t.dang.minh@gmail.com>

prombot added the prombench label Nov 25, 2025

bwplotka mentioned this pull request Nov 26, 2025

metadata: Move caching from scrapeCache.metadata to head.series #17619

Open

krajorama mentioned this pull request Nov 28, 2025

API: Add a /api/v1/features endpoint #17427

Merged

krajorama changed the title ~~[RW2.0] Enable fast metadata loop up for Remote Write 2.0~~ [RW2.0] Enable fast metadata look up for Remote Write 2.0 Jan 6, 2026

pipiland2612 closed this Feb 14, 2026

Conversation

pipiland2612 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue(s) does the PR fix:

Does this PR introduce a user-facing change?

Uh oh!

bwplotka commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

pipiland2612 commented Nov 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pipiland2612 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pipiland2612 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pipiland2612 commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwplotka commented Nov 24, 2025

Uh oh!

pipiland2612 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bwplotka commented Nov 25, 2025

Uh oh!

prombot commented Nov 25, 2025

Uh oh!

pipiland2612 commented Nov 25, 2025

Uh oh!

bwplotka commented Nov 25, 2025

Uh oh!

krajorama commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pipiland2612 commented Oct 31, 2025 •

edited

Loading

pipiland2612 Nov 10, 2025 •

edited

Loading

pipiland2612 commented Nov 12, 2025 •

edited

Loading

pipiland2612 commented Nov 24, 2025 •

edited

Loading

krajorama commented Nov 25, 2025 •

edited

Loading

bwplotka commented Nov 25, 2025 •

edited

Loading