Allow metric metadata to be propagated via Remote Write. by gotjosh · Pull Request #6815 · prometheus/prometheus

gotjosh · 2020-02-13T15:29:49Z

Following up on #6395, we'd like to enable remote-write implementations with this API. To do so, we need to propagate the scaped metric's metadata via remote-write.

This PR takes a stab at that by using a similar approach to the WAL watcher. We observe the scrapeCache based on a specified period and pull the available metadata to send it to remote storage.

A high-level diagram of the process looks like:

Documentation that explains the thinking behind this design lives in this design doc. The first version of the design doc contained details about other parts not relevant to Prometheus. Highly advise to only analyse the former.

scrape/metadata_watcher.go

storage/remote/meta_shards.go

storage/remote/shards.go

tomwilkie · 2020-02-13T15:59:51Z

Thanks @gotjosh! PR looks very promising. One question in the comments above, one suggestion (don't do a code move) and one question here:

How much metadata do we have on average per series in our Prometheus servers at Grafana Labs? And how much bandwidth will this use to send? We need to be able to give guidelines here.

tomwilkie · 2020-02-13T16:03:04Z

Also to discuss:

config to enable / disable this, and whether this should be enabled by default (I think yes, but only if bandwidth is low enough)
config to specify the ratelimit / how often the data is sent.

brian-brazil · 2020-02-13T16:49:28Z

Link to design doc for those that missed this:

I believe this is the first time this document has ever been shared. We should probably open it up for review (and enable comments) before proceeding with implementation.

gotjosh · 2020-02-13T17:29:49Z

@tomwilkie re the questions of:

How much metadata do we have on average per series in our Prometheus servers at Grafana Labs?

Apologies beforehand as I'm not sure I follow a part of that question. Within Prometheus, metadata is not kept per series but rather per metric name in the scrape cache (and also we keep a scrape cache per target).

That being said, currently, at Grafana Labs we have about ~7.67mb (and a total of ~139k entries) [1] metadata across all targets. However, that number is mildly deceiving as we only have ~1800 unique entries across those caches [2].

And how much bandwidth will this use to send?

With the current implementation, we're keeping an in-memory buffer that is a set of metadata entries. Only appending to the set if we've not seen that metadata (a combination of help/type/name/unit) before. From there, we only ever send data based on two criteria: a) We've reached the maximum number of entries b) We're at the ticked deadline.

Currently, the maximum number of metadata unique entries (similarly to remote-write samples) to flush is 2500. At an average of 55 bytes per unique entry [1] albeit more realistically between 60 - 90 bytes -- we're sending between 150kb - 225kb w/o compression every few minutes as long as we don't tip over that maximum threshold.

On scenarios where the number of unique metadata > than the maximum, it is not perfect, as metadata will be sent on a FIFO approach. There's a chance some metadata will never get sent but this can be overcomed by removing the limitation or allowing the user to tune the maximum value.

Does that make sense? Have I missed something?

[1]

[2]

gotjosh · 2020-02-13T17:34:45Z

I believe this is the first time this document has ever been shared. We should probably open it up for review (and enable comments) before proceeding with implementation.

@brian-brazil I don't think the document is highly relevant at this point as it started a while ago. It is not exclusively for this work but also the work already done as part of #6395.

Could we have the discussion as part of this PR?

brian-brazil · 2020-02-13T17:58:43Z

Could we have the discussion as part of this PR?

This is a hard problem and per previous discussions I was lead to believe we'd have a design doc first. Lets see...

The PR description and doc don't provide any detail on how this works. From a quick look at the code this is re-adding the coupling that we went to some effort to remove between scraping and remote write, plus removing the property that remote write is resilient to restarts, and adding the requirement that a remote write endpoint be stateful.

Given the breadth of these changes I think that a design doc is in order, as these would be non-trivial architectural and semantic changes to both Prometheus in general and the remote write protocol more specifically.

I'm also seeing nothing here that couldn't be done in a sidecar, this smells a lot like the StackDriver use case.

gotjosh · 2020-02-13T18:09:16Z

plus removing the property that remote write is resilient to restarts, and adding the requirement that a remote write endpoint be stateful.

Isn't it slightly different? Even within Prometheus, metadata is not resilient to restarts. Don't see why we should make it so within remote-write. The goal here is to make it a best-effort approach similar to what we already do within Prometheus.

brian-brazil · 2020-02-13T18:18:36Z

Isn't it slightly different? Even within Prometheus, metadata is not resilient to restarts.

Were talking about remote write here. If you we add a feature to remote write, it should maintain the properties of remote write.

I look forward to your design doc, I've been hoping to see the novel ideas for this for a while now.

cstyan

Looks good as far as I can tell, Friday eyes though. Made a few comments but lets make sure we talk on Monday if there's anything you want to go over.

scrape/manager.go

storage/remote/write_test.go

config/config.go

scrape/metadata_watcher.go

gotjosh · 2020-02-24T14:13:53Z

For those following the conversation along, I've made a separate design doc that includes the decision and thinking for only this part of the feature. I can't make it editable by default so I'm assuming those who wish to comment can request to do so.

brian-brazil · 2020-02-24T16:47:12Z

Can you make it world commentable? That's how we usually do things.

gotjosh · 2020-02-24T17:02:46Z

I had to create a different copy for that. All the previous links should be up to date.

gotjosh · 2020-02-25T10:36:21Z

Just to make sure those following along, didn't miss my last message. The design doc is now commentable and open to the world.

brian-brazil · 2020-02-25T10:55:18Z

I was just getting to reviewing that. The developers list is a better place to share such things, rather than buried in a PR.

gotjosh · 2020-02-25T13:37:14Z

Thanks for the tip @brian-brazil, noted!

csmarchbanks

First pass, generally I think it would be nice to have a smaller footprint in packages outside of remote such that when metadata is written to disk all the relevant code will be in one place and easy to change/rip out. Curious what you think of that though.

Related to the above, what would you think of using the HTTP API instead of the internal scrape manager for getting all of the metadata? At first glance that would remove the dependency between remote storage and scrape entirely.

cmd/prometheus/main.go

prompb/types.proto

scrape/metadata_watcher.go

storage/remote/queue_manager_test.go

scrape/metadata_watcher.go

Signed-off-by: Callum Styan <callumstyan@gmail.com>

csmarchbanks

👍 for me, nice work getting this together again!

storage/remote/queue_manager.go

config/config.go

prompb/types.proto

docs/configuration/configuration.md

storage/remote/metadata_watcher.go

storage/remote/queue_manager.go

Signed-off-by: Callum Styan <callumstyan@gmail.com>

codesome · 2020-11-19T05:35:37Z

I would like to include this in v2.23. I guess only the metric names are yet to be finalised. Looking at it now.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

codesome · 2020-11-19T06:43:14Z

I have fixed the metrics and the doc comments. I will merge it in some time before starting the release process.

brian-brazil · 2020-11-19T10:14:28Z

👍

I will merge it in some time before starting the release process.

I think we need to wait for @cstyan here, as we can't reasonably infer that the changes you just made are good with him.

…remote-write

storage/remote/queue_manager.go

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

cstyan

LGTM assuming build passes for the latest commit

codesome · 2020-11-19T15:22:47Z

Merging it as tests are passing. Build will take forever.

tomwilkie requested a review from cstyan February 13, 2020 15:50

tomwilkie reviewed Feb 13, 2020

View reviewed changes

scrape/metadata_watcher.go Outdated Show resolved Hide resolved

tomwilkie reviewed Feb 13, 2020

View reviewed changes

storage/remote/meta_shards.go Outdated Show resolved Hide resolved

tomwilkie reviewed Feb 13, 2020

View reviewed changes

storage/remote/meta_shards.go Outdated Show resolved Hide resolved

tomwilkie reviewed Feb 13, 2020

View reviewed changes

storage/remote/shards.go Outdated Show resolved Hide resolved

gotjosh force-pushed the metadata-remote-write branch 3 times, most recently from d781cbb to 92446e7 Compare February 21, 2020 12:20

cstyan reviewed Feb 22, 2020

View reviewed changes

scrape/manager.go Outdated Show resolved Hide resolved

storage/remote/write_test.go Outdated Show resolved Hide resolved

config/config.go Outdated Show resolved Hide resolved

scrape/metadata_watcher.go Outdated Show resolved Hide resolved

gotjosh force-pushed the metadata-remote-write branch 4 times, most recently from defc16a to e01b3b4 Compare February 24, 2020 11:55

gotjosh force-pushed the metadata-remote-write branch from e01b3b4 to 7b96132 Compare February 24, 2020 20:20

csmarchbanks reviewed Feb 28, 2020

View reviewed changes

gotjosh force-pushed the metadata-remote-write branch from 7b96132 to e9e7170 Compare February 28, 2020 11:27

Fix issues caused during rebasing.

532305d

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan force-pushed the metadata-remote-write branch from 866842f to 532305d Compare November 17, 2020 21:53

Fix missing metric add and unneeded config code.

9bf9ce2

Signed-off-by: Callum Styan <callumstyan@gmail.com>

csmarchbanks approved these changes Nov 18, 2020

View reviewed changes

brian-brazil reviewed Nov 18, 2020

View reviewed changes

cstyan force-pushed the metadata-remote-write branch from 52a9891 to af164d8 Compare November 18, 2020 06:43

Address some review comments.

8eeaf0a

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan force-pushed the metadata-remote-write branch from af164d8 to 8eeaf0a Compare November 18, 2020 19:55

Fix metrics and docs

550964a

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

Merge remote-tracking branch 'upstream/master' into gotjosh-metadata-…

a6a2544

…remote-write

codesome reviewed Nov 19, 2020

View reviewed changes

storage/remote/queue_manager.go Show resolved Hide resolved

Ganesh Vernekar added 3 commits November 19, 2020 20:18

Replace assert with require

ddddb78

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

Bring back max_samples_per_send metric

7d13420

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

Fix tests

8df2b65

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

codesome mentioned this pull request Nov 19, 2020

Cut v2.23.0-rc.0 #8206

Merged

cstyan approved these changes Nov 19, 2020

View reviewed changes

codesome merged commit 4eca4df into prometheus:master Nov 19, 2020

amuraru mentioned this pull request Nov 27, 2020

prometheus-remote-write.json mixin dashboard still refers old prometheus_remote_storage_* metrics #8229

Closed

harkishen mentioned this pull request Nov 27, 2020

Metric type in prompb #5351

Closed

jmacd mentioned this pull request Dec 4, 2020

Metrics Start-time resource semantic convention open-telemetry/opentelemetry-specification#1273

Open

bruceg mentioned this pull request Dec 8, 2020

Add support for exporting metadata to the prometheus_remote_write sink vectordotdev/vector#5226

Closed

universam1 mentioned this pull request Mar 3, 2021

Prometheus remote_write metadata cannot be disabled prometheus-operator/prometheus-operator#3889

Closed

gotjosh mentioned this pull request May 27, 2021

Prometheus remote writes too many metrics metadata at once. #8870

Closed

jmacd mentioned this pull request Feb 2, 2022

Meter and instrument identity open-telemetry/opentelemetry-specification#2226

Closed

paologallinaharbur mentioned this pull request May 10, 2022

Metric type inferred override newrelic/nri-prometheus#294

Closed

kovrus mentioned this pull request Nov 15, 2022

new receiver prometheusremotewritereceiver open-telemetry/opentelemetry-collector-contrib#14752

Closed

Conversation

gotjosh commented Feb 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomwilkie commented Feb 13, 2020

Uh oh!

tomwilkie commented Feb 13, 2020

Uh oh!

brian-brazil commented Feb 13, 2020

Uh oh!

gotjosh commented Feb 13, 2020

Uh oh!

gotjosh commented Feb 13, 2020

Uh oh!

brian-brazil commented Feb 13, 2020

Uh oh!

gotjosh commented Feb 13, 2020

Uh oh!

brian-brazil commented Feb 13, 2020

Uh oh!

cstyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gotjosh commented Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-brazil commented Feb 24, 2020

Uh oh!

gotjosh commented Feb 24, 2020

Uh oh!

gotjosh commented Feb 25, 2020

Uh oh!

brian-brazil commented Feb 25, 2020

Uh oh!

gotjosh commented Feb 25, 2020

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codesome commented Nov 19, 2020

Uh oh!

codesome commented Nov 19, 2020

Uh oh!

brian-brazil commented Nov 19, 2020

Uh oh!

Uh oh!

cstyan left a comment

Choose a reason for hiding this comment

Uh oh!

codesome commented Nov 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

gotjosh commented Feb 13, 2020 •

edited

Loading

gotjosh commented Feb 24, 2020 •

edited

Loading