Add metadata to WAL, expose to WAL reader and send with remote write#7771
Add metadata to WAL, expose to WAL reader and send with remote write#7771robskillington wants to merge 19 commits intoprometheus:mainfrom
Conversation
|
Is this backwards compatible? Or could that be enabled with a flag? This is a risky one, no matter the effort we put in, and I'd like people to be able to downgrade in case of issues without deleting their WAL. WDYT? |
brian-brazil
left a comment
There was a problem hiding this comment.
This is heading in the right direction, but looks like it'd have quite a severe performance impact on scraping in terms of both RAM and CPU. This needs a bit of a rethink there to keep things performant.
I'd like people to be able to downgrade in case of issues without deleting their WAL. WDYT?
We don't promise that you can downgrade, only that you can upgrade.
| type WALReader interface { | ||
| Read( | ||
| seriesf func([]record.RefSeries), | ||
| metadataf func([]record.RefMetadata), |
There was a problem hiding this comment.
Does checkpointing need updating for the new data?
There was a problem hiding this comment.
Ah yes, good catch. Updated (still pending tests though).
I am well aware of that. but if it comes at low cost, I'd still like that. People sometimes downgrade for many different reasons. Telling them that the upgrade will not enable them to downgrade will slow the adoption of newer releases. |
|
@roidelapluie unfortunately this is not possible, basically any new WAL record type added to the WAL will cause previous versions to return "Invalid" when they read that entry type. This is due to the decoder prometheus/tsdb/record/record.go Lines 66 to 75 in 2b75c1b Ready for another round of review before starting tests. |
We could adjust this now to be more relaxed (catching 0 as an error is likely enough in practice), and give it a release. |
| help string | ||
| unit string | ||
| lastIter uint64 // Last scrape iteration the entry was observed at. | ||
| lastIterObservedType uint64 // Last scrape iteration the type was observed at. |
There was a problem hiding this comment.
Won't these always be the same as lastIter?
There was a problem hiding this comment.
Currently metadata is not actually zeroed when there is new metadata per scrape, each field (type/unit/help) are all individually are updated if and only if they are seen, all of which in their setters bump the lastIter field to current iteration. Since we want the effect of setting any of these to "zero" out the metadata when we next inform the TSDB of a metadata update, we need to track which we actually updated this scrape and treat the other fields as zeroed.
The complexity with actually zeroing out the other fields when any of type/unit/help is set the first time during a scrape is that the lock is only held for the duration of setting the single metadata field (type or unit or help) so zeroing the other fields at that time will mean that the metadata goes missing temporarily for the field you didn't update (i.e. you set type and clear unit/help since it's the first update for this "iter", once you release the lock now the old values of unit/help aren't there anymore even though the very next line of the scrape you parse could be restoring the unit/help values to what they were just before you cleared them if they aren't being zeroed this scrape).
So for consistency what I was doing here is keeping the current behavior of "don't zero the other metadata so that metadata constantly moves only forwards and is never reset" for the current metadata consumers, but adding enough information to the variables to know actually which fields were present when we did observe a change and using that information to properly make those fields zeroed when passing along metadata to the TSDB and only including the metadata fields that were set with the scrape iteration that the metadata changed.
| │ ├─────────────────────┴──────────────────┤ │ | ||
| │ │ len(unit) <uvarint> │ unit <bytes> │ │ | ||
| │ ├─────────────────────┴──────────────────┤ │ | ||
| │ │ len(help) <uvarint> │ help <bytes> │ │ |
There was a problem hiding this comment.
Should we allow for additional metadata in future?
There was a problem hiding this comment.
That sounds good, I can add a versioning scheme to record types if you like. It may add some complexity and doesn't exist already - but if that's the direction that would be ideal to move towards I can attempt implementing a method to try to achieve this.
Otherwise we'd have to change the record type altogether (i.e. MetadataV2) which means on a rollback of version, all new MetadataV2 records would be skipped (which would not be ideal, ideally you want to partially read the fields you know about, just skip the new fields you don't know about).
There was a problem hiding this comment.
Don't add strong versioning, that just means keeping multiple parsers around. I'd add the number of entries that follow.
There was a problem hiding this comment.
After catching up, went with writing the size of the encoded fields upfront so we can skip any fields unknown to the decoder once we've decoded all known fields to the decoder.
I've added a unit test to cover this case.
| │ ┌────────────────────────────────────────┐ │ | ||
| │ │ id <uvarint> │ │ | ||
| │ ├─────────────────────┬──────────────────┤ │ | ||
| │ │ len(type) <uvarint> │ type <bytes> │ │ |
There was a problem hiding this comment.
type should be an int
There was a problem hiding this comment.
Sounds good, I'll update storage.Metadata to use a uint8 as well.
| case record.Metadata: | ||
| metadata, err = dec.Metadata(rec, metadata) | ||
| if err != nil { | ||
| return nil, errors.Wrap(err, "decode series") |
There was a problem hiding this comment.
Ah good call, yes will change this to "decode metadata".
cstyan
left a comment
There was a problem hiding this comment.
We can't get around having to cache metadata in the queue_manager but we should also get some #'s related to memory usage increase with this change, in addition to the bandwidth changes.
I don't see the inclusion of metadata in write requests. Just want to confirm that's still todo and I'm not missing something?
storage/remote/queue_manager.go
Outdated
| } | ||
|
|
||
| func internMetadata(meta storage.Metadata) storage.Metadata { | ||
| meta.Type = textparse.MetricType(interner.intern(string(meta.Type))) |
There was a problem hiding this comment.
this doesn't seem correct? should it not be just interner.intern(string(meta.Type))?
|
|
||
| meta := false | ||
| s.Lock() | ||
| if s.typ != m.Type { |
There was a problem hiding this comment.
this block will result in whatever the last seen metadata for a series (based on labelset hash) was being used as the metadata for an entire block, is that right?
It's hidden in queue_manager.go, the last diff. |
| // seriesEntry holds meta information about an active series. | ||
| type seriesEntry struct { | ||
| labels labels.Labels | ||
| meta storage.Metadata |
There was a problem hiding this comment.
This should be a pointer I think
| // Metadata available and metadata changed this scrape | ||
| // iteration, be sure to only set the metadata that was | ||
| // actually specified this iteration since the other | ||
| // fields are treated as zeroed since they were not specified |
There was a problem hiding this comment.
Shouldn't they be cleared then? Why can't we use the metaEntry directly?
|
|
||
| if metaUpdated { | ||
| // Metadata has changed this iteration. | ||
| _, err = app.Add(ce.lset, meta, t, v) |
There was a problem hiding this comment.
Why isn't the ref being captured? It might be simple to fall through to the Add case, rather than duplicating it here.
|
Here's the results from testing:
Using the following remote write config: remote_write:
- url: http://localhost:3030/remote/write
remote_timeout: 30s
queue_config:
capacity: 10000
max_shards: 10
min_shards: 3
max_samples_per_send: 5000
batch_send_deadline: 1m
min_backoff: 50ms
max_backoff: 1s |
|
Here's the increase with a default config, the difference is relatively less measurable - likely to do with the previous config sending larger batches since the send deadline was longer with a max samples per send of 5,000:
Default config: remote_write:
- url: http://localhost:3030/remote/write
remote_timeout: 30s |
|
|
||
| if s.ref == nil { | ||
| ref, err := app.Add(s.labels, ts, float64(s.value)) | ||
| ref, err := app.Add(s.labels, storage.Metadata{}, ts, float64(s.value)) |
There was a problem hiding this comment.
Why not just make it as a pointer and pass as nil?
|
I'm still on the fence about 20-30% network usage increase by default, for large users in public cloud that's going to be a big increase in a network egress bill that's probably already a decent chunk of money. If we can move towards periodic inclusion of all metadata (like Josh's PR) with an option to always include all metadata. we can get closer to merging something. |
|
@cstyan there are some attributes that are required by downstream systems always, for instance TYPE and Stackdriver/GCP monitoring API:
Also some systems work differently and cannot consume the metadata periodically like Josh's PR. I would like the project to make sure it considers all consumers of the API when making decisions of this nature, rather than catering to one or two specific implementations. Accompanying the metadata with the samples themselves definitely is more straight forward for a wider set of consumers, as such I don't see a reason to block such an implementation. Providing alternatives seems ok, however does mean added complexity with multiple code paths/components that need to be updated when things change, but I leave that up to more senior contributors to advise as to the best path forward for Prometheus. |
| │ ┌────────────────────────────────────────┐ │ | ||
| │ │ id <uvarint> │ │ | ||
| │ ├────────────────────────────────────────┤ │ | ||
| │ │ size <uvarint> │ │ |
There was a problem hiding this comment.
We usually do the length as a fixed 4b first in our formats.
|
|
||
| sizeAfterSeriesRef := buf.Len() | ||
|
|
||
| // Reserve space to write the size of the metadata fields |
There was a problem hiding this comment.
This seems a bit over-complicated, and not how we approach this elsewhere. Write the metadata to a temporary buffer, gets its size and then work from there.
There was a problem hiding this comment.
Are you ok if I make the signature then take two buffers, each that are independent and reused?
i.e before:
func (e *Encoder) Metadata(series []RefMetadata, b []byte) []byte {after:
func (e *Encoder) Metadata(series []RefMetadata, b1, b2 []byte) (b1, b2 []byte) {It will be kind of strange however, since callers will only want to use the first result "b1" ("b2" will only be for intermediary buffering on subsequent calls to "Metadata(...)").
The other path I can go down is putting the buffers on the Encoder themselves. e.g.
type Encoder struct {
buffer []byte
metadataBuffer []byte
}
func NewEncoder(buffer, metadataBuffer []byte) Encoder {
return Encoder{buffer: buffer, metadataBuffer: metadataBuffer}
}
func (e *Encoder) Bytes() []byte {
return e.buffer
}
func (e *Encoder) Series(series []RefSeries) {
// existing implementation but write to e.buffer after clearing, then let caller get the bytes with e.Bytes()
}
func (e *Encoder) Metadata(metadata []RefMetadata) {
// existing implementation but write to e.buffer after clearing, also use e.metadataBuffer as separate independent buffer which is necessary since we need to write the size of entire record to e.buffer before the contents which is only known after encoding to e.metadataBuffer, then let caller get the bytes with e.Bytes()
}|
@robskillington I think we're in agreement here that the metadata implementation shouldn't be blocking your use case, but for others I think the 20-30% network increase is a bit much for "always include all metadata" to be the default. That's why I suggested |
|
I don't think we should have a default that only works for one particular remote storage, and more generally I'd aim for something that doesn't require any configuration at all. |
|
@brian-brazil regardless of what the default is, without a way for users to rein in network usage with the inclusion of metadata I can't accept this PR. A 20-30% increase for users who are charged for network egress without an option to reduce that in any way is not acceptable. That being said, I have no intention of blocking something that solves @robskillington's use case. |
|
I finally had some time to try to deploy this to a couple of dev clusters. I saw a pretty large variance in remote write bandwidth, one cluster only had a 5% increase, while another was more like Rob's findings of 20%. I am also a bit hesitant to force all users into accepting an extra 20-30% increase in egress cost for little or no benefit in many systems. Configuration would be nice, at least for the HELP text part IMO. Also, could there be a toggle around writing the new record type to the WAL, defaulted to off? That would allow users to downgrade if they haven't turned it on yet. I think the last major WAL change was compression which had a toggle around it. |
Remote write was added as the simplest thing that works, not the most efficient possible solution. I'm wary of complicating things just because some remote storages don't support every possible feature - especially as this is something that the general intention is to make more common across the ecosystem.
What's in the WAL is really unrelated. The WAL file size change is going to be in the noise. |
|
@brian-brazil you're talking about things as if Cortex and M3 are the only remote storage options.
To my knowledge Rob and M3 are the only remote storage implementation that have come forward and said Josh's PR would explicitly not work for them, so saying that periodic inclusion of metadata would only work for one particular remote storage is a bit of a reach.
Complicating things is part of the natural evolution of software whose original design didn't solve valid future use cases. There's also nothing in this discussion that's been said that could mean 'some remote storages don't support every possible feature'. This isn't a question of whether Cortex or M3 can ingest all metadata for samples within a remote write request, but whether users of remote storage systems will accept a 20-30% increase in network traffic that they have no way of opting out of. Chris and I have both stated our reluctance to not allow users to reduce remote write metadata traffic. I've also stated that wouldn't accept a remote write metadata PR that doesn't allow users to do so. If we can't move forward with something that solves both problems I suggest we discuss further with the rest of -team. |
True, on the other hand neither has any other remote storage come forward and said that it'd work. We've have however had a previous use case where Josh's PR would not have worked. The more general point is that remote write should be generic, and be possible to be used correctly by everyone out of the box - even if it's not 100% efficient. I don't think it makes sense to have a limited approach implemented in parallel with a full approach, as that's just going to end up as vestigial complication and technical debt. Particularly when in due course Cortex will presumably also require the full approach. Thus in a situation where Rob's PR is on the table, I don't believe a periodic sending of amorphous metadata similar to Josh's PR should be a configurable option, let alone the default.
Pulling up some old numbers, I figure that network costs for the remote write proportion (presuming the worst case that you're paying for both egress and ingress) are about at most a quarter of the overall ingestion&storage costs. Thus the financial impact to overall remote storage costs is going to be nearer 5%, likely a fair bit lower when querying and other overheads (e.g. spare capacity for growth) are considered. With that context, do you consider this acceptable? Failing that, I think we should look at ways to make this PR a bit more judicious in when it sends metadata. |
|
We already have ways to limit ingestion and storage costs with relabelling and retention periods, and reduce remote write network usage again with relabelling. There's also other remote write users such as https://github.com/grafana/agent, where the cost % from remote write network usage is much higher. We need to include some way to allow users to limit the impact of the inclusion of metadata on their network bill. I'm not hung up on how we implement that, so if you're issue is with having both Rob and Josh's approach available to users then we can just drop that part of the discussion and move on to something more productive. |
All those things will continue to work.
Unless I'm missing something that doesn't make a difference, as it'd be the same remote write request on the wire. You ultimately have to pay for not just the network for the request itself, but also the CPU/RAM/disk for the entire system which is processing and storing it on the other end too.
My point is that you can't look at one specific part of the bill in isolation, but rather at the total cost of the whole remote storage system. In that light does doing thing the naive way look okay?
Yes, that'd be my main issue. |
|
I am going to push on this once more: I don't think it is a good pattern to intentionally release a version of Prometheus that cannot be rolled back without manual intervention and data loss. I understand that it is not required, but I think all of Beorn's comments starting with prometheus-junkyard/tsdb#609 (comment) ring true here as well. Is it significantly harder than I think to provide a flag that controls writing metadata to the WAL? Such a flag would also alleviate the concerns around increasing bandwidth for everyone until we can add some sort of user control like what Callum mentions. For example, I would love the ability to use relabeling (or something else) to modify or remove the help text like I can for abusive labels. |
It's not, however it doesn't feel right either to be controlling remote write behaviour by disabling/enabling features at the TSDB level. We could put out a release that supports both, so you can go back one minor.
Relabelling applies to labelsets, not metadata. If there's abusive help text you'd have to drop all the samples that have that metadata. In that sense we already have that control (which incidentally Josh's PR would not have - but bandwidth there is much smaller). |
|
It's really unproductive to have this lengthy of a discussion about whether to include something (Josh's metadata RW implementation) we as -team had already agreed to merge as part of the last dev summit. I'm all for another solution that accommodates both Rob's use case and other users wishes to be able to limit metadata sent via remote write, but at the moment we have Josh's implementation that could likely be re-implemented on top of Robs changes with the addition of a config option that has the metadata sent either periodically or in addition to the samples that are being sent. Chris and I have already stated our desire to see an implementation that addresses our concerns regarding the 20-30% network usage increase. If you still disagree that a config option of some kind should be included to do something useful here we don't have a path forward, and I'll add this PR to the agenda for the next -team dev summit. |
This is true, but to add context that was as a stop gap. Given that a full solution has come to us (and much sooner than I personally was expecting), I think we should be looking at getting that full solution in rather than adding something that could be removed again in a release or two.
I generally prefer to avoid configuration where we can do without to keep things simpler for users, and as I've indicated above there are approaches that don't require any configuration. For example what if the metadata was not sent if the current sample's timestamp is less than 5m since the last sample where we sent on the metadata. |
@robskillington can you confirm whether or not this implementation would solve your use case? |
…rrent scrape iteration
6be665b to
b6e5bd1
Compare
|
Do you have an idea as to when this PR will be ready for review? Given that this is taking a a while, and that #6815 was considered an acceptable stopgap while waiting for a more complete solution, I propose that we aim to get something in for the 2.23 release and if this isn't in a largely mergable state in 2 weeks that we go back to reviewing #6815. |
|
@robskillington, are you still planning on moving forward with this PR? Even with #6815 merged, having At New Relic, we have a use case in which we convert counters to delta values during ingest and #6815 doesn't really give us what we are looking for because the time series metadata can arrive after the time series data itself. With #6815 available, I wonder if paring this down to just including type information would be sufficient for others' remaining use cases? It seems like doing so would address the concerns around network utilization and the need for extra configuration. |
|
@cmacnaug yes was hoping to get to this over the holidays, let's see how that goes. Also we might have someone else take it over if that fails. Ideally this lands in the next 4-8 weeks. @brian-brazil @csmarchbanks + others, does this sound sane? |
|
Yes, gathering metadata from the WAL and including metadata alongside sample data (with some kind of filtering options) is still something we'd like to include. |
|
@robskillington I've read through the discussions, design doc and present code on this, nice work bringing it all together! I'm thinking I might take a stab and use your work here as a base, to try and make this work on a new PR. Would you be alright with this? |
|
@tpaschalis I had not seen your comment, that would be a really wonderful thing - let me know if there's anything I can do to support, I'll look and see if I can offer anything of value in feedback on the PR, great to see the progress 👍 |
|
We have looked at this pull request during our bug scrub. It seems like metadata is appended to the WAL now, but not yet to the remote write. Given the age of the PR and the amount of work to rewrite it and get to a consensus about remote write #10312 we have decided to close this PR. Thank you for your contribution. |


Mailing list discussion:
https://groups.google.com/g/prometheus-developers/c/w-eotGenBGg/m/UZWdxzcBBwAJ
Design doc:
https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#heading=h.bik9uwphqy3g
Have tested this end-to-end with a modified remote write backend:
Background:
There are already efforts such as PR #6815 and the corresponding design document that is experimenting with syncing the entire set of metadata kept by a Prometheus server for the current active targets in memory every 15 seconds (mentioned by the doc). It is great that we can unblock experiments like this going forward and foster development, but there are some that would like to see a slightly longer term solution that is by design not sync based and can be sure to deliver no loss of any metadata in the regular case of a remote write tails the Prometheus WAL.
TODO: