Skip to content

Allow metric metadata to be propagated via Remote Write.#6815

Merged
codesome merged 12 commits intoprometheus:masterfrom
gotjosh:metadata-remote-write
Nov 19, 2020
Merged

Allow metric metadata to be propagated via Remote Write.#6815
codesome merged 12 commits intoprometheus:masterfrom
gotjosh:metadata-remote-write

Conversation

@gotjosh
Copy link
Member

@gotjosh gotjosh commented Feb 13, 2020

Following up on #6395, we'd like to enable remote-write implementations with this API. To do so, we need to propagate the scaped metric's metadata via remote-write.

This PR takes a stab at that by using a similar approach to the WAL watcher. We observe the scrapeCache based on a specified period and pull the available metadata to send it to remote storage.

A high-level diagram of the process looks like:

mermaid-diagram-20200213152256

Documentation that explains the thinking behind this design lives in this design doc. The first version of the design doc contained details about other parts not relevant to Prometheus. Highly advise to only analyse the former.

@tomwilkie tomwilkie requested a review from cstyan February 13, 2020 15:50
@tomwilkie
Copy link
Member

Thanks @gotjosh! PR looks very promising. One question in the comments above, one suggestion (don't do a code move) and one question here:

How much metadata do we have on average per series in our Prometheus servers at Grafana Labs? And how much bandwidth will this use to send? We need to be able to give guidelines here.

@tomwilkie
Copy link
Member

Also to discuss:

  • config to enable / disable this, and whether this should be enabled by default (I think yes, but only if bandwidth is low enough)
  • config to specify the ratelimit / how often the data is sent.

@brian-brazil
Copy link
Contributor

Link to design doc for those that missed this:

I believe this is the first time this document has ever been shared. We should probably open it up for review (and enable comments) before proceeding with implementation.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 13, 2020

@tomwilkie re the questions of:

How much metadata do we have on average per series in our Prometheus servers at Grafana Labs?

Apologies beforehand as I'm not sure I follow a part of that question. Within Prometheus, metadata is not kept per series but rather per metric name in the scrape cache (and also we keep a scrape cache per target).

That being said, currently, at Grafana Labs we have about ~7.67mb (and a total of ~139k entries) [1] metadata across all targets. However, that number is mildly deceiving as we only have ~1800 unique entries across those caches [2].

And how much bandwidth will this use to send?

With the current implementation, we're keeping an in-memory buffer that is a set of metadata entries. Only appending to the set if we've not seen that metadata (a combination of help/type/name/unit) before. From there, we only ever send data based on two criteria: a) We've reached the maximum number of entries b) We're at the ticked deadline.

Currently, the maximum number of metadata unique entries (similarly to remote-write samples) to flush is 2500. At an average of 55 bytes per unique entry [1] albeit more realistically between 60 - 90 bytes -- we're sending between 150kb - 225kb w/o compression every few minutes as long as we don't tip over that maximum threshold.

On scenarios where the number of unique metadata > than the maximum, it is not perfect, as metadata will be sent on a FIFO approach. There's a chance some metadata will never get sent but this can be overcomed by removing the limitation or allowing the user to tune the maximum value.

Does that make sense? Have I missed something?

[1]
Screenshot 2020-02-13 at 16 54 47

[2]
Screenshot 2020-02-13 at 16 59 10

@gotjosh
Copy link
Member Author

gotjosh commented Feb 13, 2020

I believe this is the first time this document has ever been shared. We should probably open it up for review (and enable comments) before proceeding with implementation.

@brian-brazil I don't think the document is highly relevant at this point as it started a while ago. It is not exclusively for this work but also the work already done as part of #6395.

Could we have the discussion as part of this PR?

@brian-brazil
Copy link
Contributor

Could we have the discussion as part of this PR?

This is a hard problem and per previous discussions I was lead to believe we'd have a design doc first. Lets see...

The PR description and doc don't provide any detail on how this works. From a quick look at the code this is re-adding the coupling that we went to some effort to remove between scraping and remote write, plus removing the property that remote write is resilient to restarts, and adding the requirement that a remote write endpoint be stateful.

Given the breadth of these changes I think that a design doc is in order, as these would be non-trivial architectural and semantic changes to both Prometheus in general and the remote write protocol more specifically.

I'm also seeing nothing here that couldn't be done in a sidecar, this smells a lot like the StackDriver use case.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 13, 2020

plus removing the property that remote write is resilient to restarts, and adding the requirement that a remote write endpoint be stateful.

Isn't it slightly different? Even within Prometheus, metadata is not resilient to restarts. Don't see why we should make it so within remote-write. The goal here is to make it a best-effort approach similar to what we already do within Prometheus.

@brian-brazil
Copy link
Contributor

Isn't it slightly different? Even within Prometheus, metadata is not resilient to restarts.

Were talking about remote write here. If you we add a feature to remote write, it should maintain the properties of remote write.

I look forward to your design doc, I've been hoping to see the novel ideas for this for a while now.

@gotjosh gotjosh force-pushed the metadata-remote-write branch 3 times, most recently from d781cbb to 92446e7 Compare February 21, 2020 12:20
Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as I can tell, Friday eyes though. Made a few comments but lets make sure we talk on Monday if there's anything you want to go over.

@gotjosh gotjosh force-pushed the metadata-remote-write branch 4 times, most recently from defc16a to e01b3b4 Compare February 24, 2020 11:55
@gotjosh
Copy link
Member Author

gotjosh commented Feb 24, 2020

For those following the conversation along, I've made a separate design doc that includes the decision and thinking for only this part of the feature. I can't make it editable by default so I'm assuming those who wish to comment can request to do so.

@brian-brazil
Copy link
Contributor

Can you make it world commentable? That's how we usually do things.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 24, 2020

I had to create a different copy for that. All the previous links should be up to date.

@gotjosh gotjosh force-pushed the metadata-remote-write branch from e01b3b4 to 7b96132 Compare February 24, 2020 20:20
@gotjosh
Copy link
Member Author

gotjosh commented Feb 25, 2020

Just to make sure those following along, didn't miss my last message. The design doc is now commentable and open to the world.

@brian-brazil
Copy link
Contributor

I was just getting to reviewing that. The developers list is a better place to share such things, rather than buried in a PR.

@gotjosh
Copy link
Member Author

gotjosh commented Feb 25, 2020

Thanks for the tip @brian-brazil, noted!

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, generally I think it would be nice to have a smaller footprint in packages outside of remote such that when metadata is written to disk all the relevant code will be in one place and easy to change/rip out. Curious what you think of that though.

Related to the above, what would you think of using the HTTP API instead of the internal scrape manager for getting all of the metadata? At first glance that would remove the dependency between remote storage and scrape entirely.

@gotjosh gotjosh force-pushed the metadata-remote-write branch from 7b96132 to e9e7170 Compare February 28, 2020 11:27
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan force-pushed the metadata-remote-write branch from 866842f to 532305d Compare November 17, 2020 21:53
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for me, nice work getting this together again!

@cstyan cstyan force-pushed the metadata-remote-write branch from 52a9891 to af164d8 Compare November 18, 2020 06:43
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan force-pushed the metadata-remote-write branch from af164d8 to 8eeaf0a Compare November 18, 2020 19:55
@codesome
Copy link
Member

I would like to include this in v2.23. I guess only the metric names are yet to be finalised. Looking at it now.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
@codesome
Copy link
Member

I have fixed the metrics and the doc comments. I will merge it in some time before starting the release process.

@brian-brazil
Copy link
Contributor

👍

I will merge it in some time before starting the release process.

I think we need to wait for @cstyan here, as we can't reasonably infer that the changes you just made are good with him.

Ganesh Vernekar added 3 commits November 19, 2020 20:18
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
@codesome codesome mentioned this pull request Nov 19, 2020
Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming build passes for the latest commit

@codesome
Copy link
Member

Merging it as tests are passing. Build will take forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants