prw: Remote Write 2.0 CT per Sample/Histogram#17036
Conversation
5e68754 to
c8bc644
Compare
c8bc644 to
79e3237
Compare
Relates to #16944 (comment) Signed-off-by: bwplotka <bwplotka@gmail.com>
79e3237 to
97413db
Compare
cstyan
left a comment
There was a problem hiding this comment.
This looks reasonable to me, just a few questions in comments here + should we be updating the spec. documentation as well.
| // The canonical Content-Type request header value for this message is | ||
| // "application/x-protobuf;proto=io.prometheus.write.v2.Request" | ||
| // | ||
| // Version: v2.0-rc.4 |
There was a problem hiding this comment.
Why 2.1?
We are 2.0-rc.3 now -> https://prometheus.io/docs/specs/prw/remote_write_spec_2_0/ (I will update spec in separate PR)
| // This field is reserved for backward compatibility with the deprecated fields; | ||
| // previously present in the experimental remote write period. | ||
| reserved 6; |
There was a problem hiding this comment.
should we include that this was previously created timestamp?
There was a problem hiding this comment.
Is there a need? Anyone can use git to check it, otherwise if we can, we potentially want to make the proto as clean as possible before we make it finalized/stable. If we were at 2.0 we wouldn't be able to remove it/move to reserved.
There was a problem hiding this comment.
One could argue, given experimental period, that we don't even need reserved (: But let's be nice for potential indeterministic errors if someone forgets to upgrade some PRW dep
|
I'm worried about two things:
|
|
Also, how will you generate such remote write requests? I don't think the WAL/WBL has that info. |
There is working happening to make this possible #16046 |
|
Thanks for a feedback @krajorama 💪🏽
Practically zero for now.
If in future we will use the buffering (more than one sample per series/request), you might see some impact (good question how much) vs the rest of the payload (labels). One could argue that custom marshal/unmarshal can reduce it if it's becoming a problem too.
Hm... I would challenge this. Mimir clearly maintains its own proto by hand, so authors can adjust it as you need. If you want to handle CT in a different way that Prometheus (not per sample), you can create This breaking change will do cause your current version of Mimir to NOT see CT per series (because it's not going to be send by Prometheus). Do you depend on CT now, handle it in a certain way where this would cause a problem for your users? @krajorama?
We can, we do that with histograms samples. However that would imply we will add This is actually an interesting and tempting idea. Might help with future iterations that might alter samples and its in-memory layouts (even if empty fields). Question is... do you think it's worth changing now in 2.0? It would be doable, although last minute change. To sum upRealistically speaking, I wonder if the overhead (currently none, with buffering maybe slightly) is significant enough to block the potential consistency and simplicity of this change 🤔 Switching to the world, similar to client Prom proto model where we would have: ... might actually make sense. What would be pros & cons of moving to typed sample values now? 🤔 @cstyan @krajorama |
|
For now, I think I would keep float vs histograms semantics. This also fits the direction TSDB is on right now #17071 (float vs histogram derivatives) although one could argue we currently are in state where we have separate WAL records / chunks for almost all types... (not the state we want to be) |
|
@krajorama is on PTO, but I would merge it given approval and the fact I don't see how this blocks Mimir here 🤔 Maybe no harm in waiting a bit for @krajorama to come back; maybe @bboreham or @aknuds1 could chime in too. I will prepare spec change PR too in the meantime. |
It will soon 🤞🏽 e.g. #16046 |
Neither do I, hopefully Krajo can provide a bit more detail. If it is a performance hit, I'm sure there's a workaround given that mimir is already reimplementing things at the proto level themselves. |
I wish to argue against re-using a thing named "creation time" to mean "start time for a delta sample". ISTM we could have a "start time" field to mean that the other thing. |
|
That's fair. In fact, CT was called CT because there was no delta plans around. In other, delta-temporality-first systems first (e.g. Otel) the same feature was named Start time and it's reused for CT meaning for cumulatives 🙈 I proposed to do other way around since we already decided to call it CT and Prometheus is cumulative first. What's your proposal @bboreham then? Keep CT per series and add ST (start time) now or later for delta within sample? This is tempting, because it's perhaps more clear, and potentially it offers separate Prometheus storage logic (e.g. WAL record format, tsdb chunks) for CT vs ST. In theory CT will duplicate much more across the same series, so it offers theoretical efficiency gains. However, practically speaking, in my early experiments, having separate code logic and storage formats for those does not give us much and only makes code more complex and arguably, slower:
I can do proper experiment to visualize the potential overhead for tsdb chunks and wal but so far the lazy approach for adding "ST per sample" logic to storage and remote write and reusing this for delta and cumulatives (and histogram resets!) makes the most sense to me. You did have a good idea to name it StartTime though.. while we can (: What's your preference, thought @bboreham? What we would like to know more before deciding? Should this PR add StartTime per sample (while removing CT per series) in Remote Write 2? With storage only doing ST per sample (if we agree on that) CT per series will be cumbersome for Prometheus too. |
|
The benefit of keeping "CT" naming in Prometheus is that it's clear it's Prometheus semantics and the fact it's already introduced and known in the community. |
|
I don't really care about creation-time per series vs per sample in the protocol; Prometheus only sends one sample per series in most cases. You can have some niche case like "here are five samples and the third one was a reset so has a new creation time", fine. I haven't given any thought to storing this information. Previously we used creation time to add a zero into TSDB, but I guess we could do more, with a lot of code changes in TSDB. I didn't follow what you were saying there. If you want to send the "start time of a single delta sample" (which is what I understand OpenTelemetry to mean by startTime on delta temporality) then don't name it "creation time". That's confusing. My proposal is to name that one "start time".
I have no idea how this relates to "For delta, per sample CT is a must have." There are no Prometheus semantics for deltas. |
Yes, the plan is to deprecate zero-adding as it's not ideal; and implement proper storage support. I would argue it's relative simple to implement (1k lines?).
What if it's beneficial (arguments) to reuse the same structure for creation time too? Plus already systems are doing this (e.g. Otel). What if it's worth to have a tiny bit of this confusion that can be managed with a one sentence commentary?
Have you seen planned semantics for delta? We have low cost opportunity to prepare protocol, while solving some current issues for CT (handling of worst case for CT - so frequently updated CT). What am I missing when considering that? |
|
OK, we discussed this a bit with @fionaliao, also around some ideas for delta type. She made a good point that this this protocol change somewhat depends on ct-per-sample feature implementation (is it per series, is it encoded as "st" per sample) might need some small proposal -- it's not as obvious as I initially thought. I'm confident in this protocol change and we got some approvals already. The only question I have -- if we want to rename this field from Let's park this PR for now until we know more about implementation, this can wait a few weeks (: Thanks everybody for the discussion! Moving to draft. |
But we plan to populate created time usually from scrape, so it's not going to be zero, which means it is going to be written most of the time (for counters). I guess no extra overhead over current RW2 is still true due to @bboreham 's argument about having a single sample per series per request. So we're good.
We don't depend on Prometheus CT yet. We have our own solution (a special NaN value) to support OTLP start time in Mimir. We wanted to switch to upstream Prometheus CT and this will add some extra work to get there. Again , not a blocker for you.
We already have pretty complicated code and issues with a lot of switches due to floats vs native histograms samples, so actually I take it back, this is probably a bad idea. We should maybe consider the opposite and merge floats and native histograms if we can somehow :) |
|
SGTM @krajorama thanks for pointers! FYI: Started proposal prometheus/proposals#60 to unblock this and storage implementation (WIP) |
|
Thanks for discussion! This PR is now replaced by a new PR which also renames the CT to ST as per the devsummit consensus: #17411 |
Given the recent movement for Prometheus native support of ST ([PROM-60](prometheus/proposals#60)) and plans for delta temporality ([PROM-48](prometheus/proposals#48)) it might be beneficial to make (hopefully) last change to Remote Write 2.0 before stabilizing, so: * Raname Created Timestamp to Start Timestamp * Move CT/ST from TimeSeries to Sample and Histogram messages. * Clarified optionality (0 value meaning unset) See implementation change that will follow: prometheus/prometheus#17411. Notice that only receiver part was implemented for CT/ST. Given no sending part was done we expect this feature (ST/CT) not being used, thus breakage impact is minimal. This has been confirmed with early adopters like Mimir (Grafana), Chronosphere, Thanos, Cortex and Google. See previous discussions and 3 expilcit approvals: prometheus/prometheus#17036 Additionally: * I updated link to proto * Updated links to new compliance tests * Update native histogram spec link Signed-off-by: bwplotka <bwplotka@gmail.com>
…ST (Start Timestamp) (#2762) * prw2(breaking): Move 2.0 CT to Sample; Rename to ST (Start Timestamp) Given the recent movement for Prometheus native support of ST ([PROM-60](prometheus/proposals#60)) and plans for delta temporality ([PROM-48](prometheus/proposals#48)) it might be beneficial to make (hopefully) last change to Remote Write 2.0 before stabilizing, so: * Raname Created Timestamp to Start Timestamp * Move CT/ST from TimeSeries to Sample and Histogram messages. * Clarified optionality (0 value meaning unset) See implementation change that will follow: prometheus/prometheus#17411. Notice that only receiver part was implemented for CT/ST. Given no sending part was done we expect this feature (ST/CT) not being used, thus breakage impact is minimal. This has been confirmed with early adopters like Mimir (Grafana), Chronosphere, Thanos, Cortex and Google. See previous discussions and 3 expilcit approvals: prometheus/prometheus#17036 Additionally: * I updated link to proto * Updated links to new compliance tests * Update native histogram spec link Signed-off-by: bwplotka <bwplotka@gmail.com> * Update docs/specs/prw/remote_write_spec_2_0.md Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * Update docs/specs/prw/remote_write_spec_2_0.md Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> --------- Signed-off-by: bwplotka <bwplotka@gmail.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
This proposes an insignificant "breaking change" in the experimental RW 2.0 (that's why buf lint CI is expectedly red). Insignificant, because:
reserveit's number and add it somewhere else).Relates to #16944 (comment) discussion.
I also pushed this PR's proto to custom
bwplotka-experiment1buf.build label if anyone wants to test it out.Rationales
Currently CT is per TimeSeries. For cumulatives that's reasonable as CT represents a creation/start time for a cumulative "stream" of samples. In the unlikely case where we send more than one sample and CT changes for a single series within a PRW batch, second series has to be created in the message.
The proposal is to move the CT to Sample message instead. This is because:
Those reasons might be not super strong to break users; we don't have many (if any) at this point with experimental PRW 2.0, and for sure zero with CT feature (it's not yet implemented in the mainstream). Specifically, given Prometheus uses PRW in a
single sample multiple seriesbatches, there's no other different than consistency and semantics. However if we ever change and use PRW for backfilling or longer sample batches sending, the protocol will be naturally better fitted with CT per sample.Why not special semantics like "only TimeSeries.Samples[0] needs to hold CT"
We could reduce some network load if send/receive semantic allow assuming CT by the first Sample.CT in the TimeSeries. My assumption is that it's not worth the complexity, plus we can do that later (TODO benchmark and explain).
Why not
optionalkeyword?I propose we skip
optionalkeywords here, like we avoided it in the current version.Funny enough there used to be a warning about adding optional to primitive fields, that's why we never used
optionalkeywords. However it seems it was removed (missing link now: https://cloud.google.com/apis/design/design_patterns.md#optional_primitive_fields). Nevertheless I tried and for int64 it's a ridiculous (readability, complexity, efficiency) price on code side: