prw: Remote Write 2.0 CT per Sample/Histogram by bwplotka · Pull Request #17036 · prometheus/prometheus

bwplotka · 2025-08-11T11:01:25Z

This proposes an insignificant "breaking change" in the experimental RW 2.0 (that's why buf lint CI is expectedly red). Insignificant, because:

Proto on network will still work, given the proto back. and forw. compatibility (we remove field and reserve it's number and add it somewhere else).
It's for the new feature (CT) which is not yet supported by Prometheus (sending)

Relates to #16944 (comment) discussion.

I also pushed this PR's proto to custom bwplotka-experiment1 buf.build label if anyone wants to test it out.

Rationales

Currently CT is per TimeSeries. For cumulatives that's reasonable as CT represents a creation/start time for a cumulative "stream" of samples. In the unlikely case where we send more than one sample and CT changes for a single series within a PRW batch, second series has to be created in the message.

The proposal is to move the CT to Sample message instead. This is because:

Prometheus ecosystem experiments with the the delta type. For delta, per sample CT is a must have. So first reason to make this change is being future proof. The counter arguments:

We can add CT per sample in 2.1+ (we won't be able to remove/put to reserved the TimeSeries.CreatedTimestamp though!)
PRW support for delta type is low priority (Prometheus is not expected to be a source of deltas, but rather consume it).

Literally everywhere else we will likely have CT per sample. Scrape protocols technically pass CT per sample. For storage (e.g. WAL), we will likely put CT per sample too.. It's actually extra code to try to put it per timeseries and split timeseries etc. Otel also has it per series.
Performance wise:
- Practically for Prometheus writing implementation, it's no difference, given Prometheus always batches series with one.
- Theoretical buffering cases (receive + send) TODO benchmark

Those reasons might be not super strong to break users; we don't have many (if any) at this point with experimental PRW 2.0, and for sure zero with CT feature (it's not yet implemented in the mainstream). Specifically, given Prometheus uses PRW in a single sample multiple series batches, there's no other different than consistency and semantics. However if we ever change and use PRW for backfilling or longer sample batches sending, the protocol will be naturally better fitted with CT per sample.

Why not special semantics like "only TimeSeries.Samples[0] needs to hold CT"

We could reduce some network load if send/receive semantic allow assuming CT by the first Sample.CT in the TimeSeries. My assumption is that it's not worth the complexity, plus we can do that later (TODO benchmark and explain).

Why not `optional` keyword?

I propose we skip optional keywords here, like we avoided it in the current version.

Funny enough there used to be a warning about adding optional to primitive fields, that's why we never used optional keywords. However it seems it was removed (missing link now: https://cloud.google.com/apis/design/design_patterns.md#optional_primitive_fields). Nevertheless I tried and for int64 it's a ridiculous (readability, complexity, efficiency) price on code side:

Relates to #16944 (comment) Signed-off-by: bwplotka <bwplotka@gmail.com>

cstyan

This looks reasonable to me, just a few questions in comments here + should we be updating the spec. documentation as well.

cstyan · 2025-08-11T15:51:34Z

prompb/io/prometheus/write/v2/types.proto

 // The canonical Content-Type request header value for this message is
 // "application/x-protobuf;proto=io.prometheus.write.v2.Request"
 //
+// Version: v2.0-rc.4


should this be 2.1-rc.x?

Why 2.1?

We are 2.0-rc.3 now -> https://prometheus.io/docs/specs/prw/remote_write_spec_2_0/ (I will update spec in separate PR)

cstyan · 2025-08-11T15:51:55Z

prompb/io/prometheus/write/v2/types.proto

+  // This field is reserved for backward compatibility with the deprecated fields;
+  // previously present in the experimental remote write period.
+  reserved 6;


should we include that this was previously created timestamp?

Is there a need? Anyone can use git to check it, otherwise if we can, we potentially want to make the proto as clean as possible before we make it finalized/stable. If we were at 2.0 we wouldn't be able to remove it/move to reserved.

One could argue, given experimental period, that we don't even need reserved (: But let's be nice for potential indeterministic errors if someone forgets to upgrade some PRW dep

ArthurSens

LGTM

krajorama · 2025-08-14T10:31:54Z

I'm worried about two things:

What will be the bandwidth and memory implication ? Float samples are the most common.
I realize this isn't blocker for you but in Mimir we did implement RW2+created timestamp support already. We convert into our RW1 implementation +created timestamp field. So such change would impact RW1 resource usage as well. And there's also a bit where we forward PromQL results easily from distributed queriers to their frontend where we take advantage of the similar types. See https://github.com/[mimir.proto](grafana/mimir/blob/dd0fdb56205512dd2dda42934b66a5a1160b225f/pkg/mimirpb/mimir.proto#L91). So this change has huge impact potentially :(
I'm wondering if we could have a separate DeltaSamples []DeltaSample optional field? And say that a Timeseries cannot have both at the same time.

krajorama · 2025-08-14T16:26:24Z

Also, how will you generate such remote write requests? I don't think the WAL/WBL has that info.

ArthurSens · 2025-08-14T16:38:02Z

Also, how will you generate such remote write requests? I don't think the WAL/WBL has that info.

There is working happening to make this possible #16046

bwplotka · 2025-08-18T12:43:38Z

Thanks for a feedback @krajorama 💪🏽

What will be the bandwidth and memory implication ? Float samples are the most common.

Practically zero for now.

In terms of the network, sending empty field gives zero overhead, it's simply not written to the wire.
In terms of memory, for 100% of the current, known cases, it's zero. This is because Prometheus and ecosystem always send and receive a single sample, so it's the same memory footprint vs putting it in request.

If in future we will use the buffering (more than one sample per series/request), you might see some impact (good question how much) vs the rest of the payload (labels). One could argue that custom marshal/unmarshal can reduce it if it's becoming a problem too.

I realize this isn't blocker for you but in Mimir we did implement RW2+created timestamp support already. We convert into our RW1 implementation +created timestamp field. So such change would impact RW1 resource usage as well. And there's also a bit where we forward PromQL results (...) So this change has huge impact potentially :(

Hm... I would challenge this. Mimir clearly maintains its own proto by hand, so authors can adjust it as you need. If you want to handle CT in a different way that Prometheus (not per sample), you can create Sample2 for PRW2 and decode sample CT into series one. However, the truth is that the Prometheus code leans on treating CT per sample in all its APIs (including TSDB storage one day), so holding to per series in Mimir might not be ideal for future anyway.

This breaking change will do cause your current version of Mimir to NOT see CT per series (because it's not going to be send by Prometheus). Do you depend on CT now, handle it in a certain way where this would cause a problem for your users? @krajorama?

I'm wondering if we could have a separate DeltaSamples []DeltaSample optional field? And say that a Timeseries cannot have both at the same time.

We can, we do that with histograms samples. However that would imply we will add CounterSample and rename Sample to GaugeSample, to stay consistent. As a result, what you propose is more in the direction of Prometheus scrape-side proto.

This is actually an interesting and tempting idea. Might help with future iterations that might alter samples and its in-memory layouts (even if empty fields). Question is... do you think it's worth changing now in 2.0? It would be doable, although last minute change.

To sum up

Realistically speaking, I wonder if the overhead (currently none, with buffering maybe slightly) is significant enough to block the potential consistency and simplicity of this change 🤔

Switching to the world, similar to client Prom proto model where we would have:

message TimeSeries {
  // ...
  repeated Gauge gauges = 2;
  repeated Histogram histograms = 3;
  repeated Counter counters = 7;
  repeated Delta deltas = 8;
}

message Gauge {
  double value = 1;
  int64 timestamp = 2;
}

message Counter {
  double value = 1;
  int64 timestamp = 2;
  int64 created_timestamp = 3;
}

message Delta {
  double value = 1;
  int64 timestamp = 2;
  int64 start_timestamp = 3;
}

message Histogram {
  // ...
}

... might actually make sense. What would be pros & cons of moving to typed sample values now? 🤔 @cstyan @krajorama

kgoudeaux

LGTM

bwplotka · 2025-08-22T10:21:30Z

For now, I think I would keep float vs histograms semantics. This also fits the direction TSDB is on right now #17071 (float vs histogram derivatives) although one could argue we currently are in state where we have separate WAL records / chunks for almost all types... (not the state we want to be)

bwplotka · 2025-08-22T14:10:06Z

@krajorama is on PTO, but I would merge it given approval and the fact I don't see how this blocks Mimir here 🤔

Maybe no harm in waiting a bit for @krajorama to come back; maybe @bboreham or @aknuds1 could chime in too.

I will prepare spec change PR too in the meantime.

bwplotka · 2025-08-22T18:51:19Z

Also, how will you generate such remote write requests? I don't think the WAL/WBL has that info.

It will soon 🤞🏽 e.g. #16046

cstyan · 2025-08-22T19:55:34Z

I would merge it given approval and the fact I don't see how this blocks Mimir here

Neither do I, hopefully Krajo can provide a bit more detail. If it is a performance hit, I'm sure there's a workaround given that mimir is already reimplementing things at the proto level themselves.

bboreham · 2025-08-26T15:30:03Z

represents a creation/start time

I wish to argue against re-using a thing named "creation time" to mean "start time for a delta sample".
For me, creation time is the when the whole timeseries was created.

ISTM we could have a "start time" field to mean that the other thing.

bwplotka · 2025-08-26T17:01:32Z

That's fair.

In fact, CT was called CT because there was no delta plans around. In other, delta-temporality-first systems first (e.g. Otel) the same feature was named Start time and it's reused for CT meaning for cumulatives 🙈 I proposed to do other way around since we already decided to call it CT and Prometheus is cumulative first.

What's your proposal @bboreham then? Keep CT per series and add ST (start time) now or later for delta within sample?

This is tempting, because it's perhaps more clear, and potentially it offers separate Prometheus storage logic (e.g. WAL record format, tsdb chunks) for CT vs ST. In theory CT will duplicate much more across the same series, so it offers theoretical efficiency gains.

However, practically speaking, in my early experiments, having separate code logic and storage formats for those does not give us much and only makes code more complex and arguably, slower:

WAL records and remote write is used with 1-sample, multiple series batches. No benefit in per series CT logic. Making desperate records or messages for different metric types is clunky too (you can't batch multiple samples really into one record anymore).
I expect TSDB chunks encoding compressing well enough that the advantage is minimal. Also the "chunk per CT is already causing some issues with native histograms chunks too as far as I remember (reset hints).
CT can be updated anytime across series (instrumentation resetting a counter), so in a worst case it can even be updated on every scrape, risking our "CT per series optimized storage" will actually be much worse than ST one reused for CT.

I can do proper experiment to visualize the potential overhead for tsdb chunks and wal but so far the lazy approach for adding "ST per sample" logic to storage and remote write and reusing this for delta and cumulatives (and histogram resets!) makes the most sense to me. You did have a good idea to name it StartTime though.. while we can (:

What's your preference, thought @bboreham? What we would like to know more before deciding? Should this PR add StartTime per sample (while removing CT per series) in Remote Write 2? With storage only doing ST per sample (if we agree on that) CT per series will be cumbersome for Prometheus too.

bwplotka · 2025-08-26T17:04:41Z

The benefit of keeping "CT" naming in Prometheus is that it's clear it's Prometheus semantics and the fact it's already introduced and known in the community.

bboreham · 2025-08-26T17:24:59Z

I don't really care about creation-time per series vs per sample in the protocol; Prometheus only sends one sample per series in most cases. You can have some niche case like "here are five samples and the third one was a reset so has a new creation time", fine.

I haven't given any thought to storing this information. Previously we used creation time to add a zero into TSDB, but I guess we could do more, with a lot of code changes in TSDB. I didn't follow what you were saying there.

If you want to send the "start time of a single delta sample" (which is what I understand OpenTelemetry to mean by startTime on delta temporality) then don't name it "creation time". That's confusing. My proposal is to name that one "start time".

The benefit of keeping "CT" naming in Prometheus is that it's clear it's Prometheus semantics and the fact it's already introduced and known in the community.

I have no idea how this relates to "For delta, per sample CT is a must have." There are no Prometheus semantics for deltas.

bwplotka · 2025-08-26T17:44:59Z

I haven't given any thought to storing this information. Previously we used creation time to add a zero into TSDB, but I guess we could do more, with a lot of code changes in TSDB. I didn't follow what you were saying there.

Yes, the plan is to deprecate zero-adding as it's not ideal; and implement proper storage support. I would argue it's relative simple to implement (1k lines?).

If you want to send the "start time of a single delta sample" (which is what I understand OpenTelemetry to mean by startTime on delta temporality) then don't name it "creation time". That's confusing. My proposal is to name that one "start time".

What if it's beneficial (arguments) to reuse the same structure for creation time too? Plus already systems are doing this (e.g. Otel). What if it's worth to have a tiny bit of this confusion that can be managed with a one sentence commentary?

I have no idea how this relates to "For delta, per sample CT is a must have." There are no Prometheus semantics for deltas.

Have you seen planned semantics for delta? We have low cost opportunity to prepare protocol, while solving some current issues for CT (handling of worst case for CT - so frequently updated CT). What am I missing when considering that?

bwplotka · 2025-08-29T03:43:22Z

OK, we discussed this a bit with @fionaliao, also around some ideas for delta type.

She made a good point that this this protocol change somewhat depends on ct-per-sample feature implementation (is it per series, is it encoded as "st" per sample) might need some small proposal -- it's not as obvious as I initially thought.

I'm confident in this protocol change and we got some approvals already. The only question I have -- if we want to rename this field from created_timestamp vs start_timestamp to indicated a shared purpose, given @bboreham great points.

Let's park this PR for now until we know more about implementation, this can wait a few weeks (:

Thanks everybody for the discussion! Moving to draft.

krajorama · 2025-09-01T06:55:48Z

Thanks for a feedback @krajorama 💪🏽

What will be the bandwidth and memory implication ? Float samples are the most common.

Practically zero for now.
* In terms of the network, sending empty field gives zero overhead, it's simply not written to the wire.

But we plan to populate created time usually from scrape, so it's not going to be zero, which means it is going to be written most of the time (for counters). I guess no extra overhead over current RW2 is still true due to @bboreham 's argument about having a single sample per series per request. So we're good.

* In terms of memory,  for 100% of the current, known cases, it's zero. This is because Prometheus and ecosystem always send and receive a single sample, so it's the same memory footprint vs putting it in request.
If in future we will use the buffering (more than one sample per series/request), you might see some impact (good question how much) vs the rest of the payload (labels). One could argue that custom marshal/unmarshal can reduce it if it's becoming a problem too.

I realize this isn't blocker for you but in Mimir we did implement RW2+created timestamp support already. We convert into our RW1 implementation +created timestamp field. So such change would impact RW1 resource usage as well. And there's also a bit where we forward PromQL results (...) So this change has huge impact potentially :(

Hm... I would challenge this. Mimir clearly maintains its own proto by hand, so authors can adjust it as you need. If you want to handle CT in a different way that Prometheus (not per sample), you can create Sample2 for PRW2 and decode sample CT into series one. However, the truth is that the Prometheus code leans on treating CT per sample in all its APIs (including TSDB storage one day), so holding to per series in Mimir might not be ideal for future anyway.

This breaking change will do cause your current version of Mimir to NOT see CT per series (because it's not going to be send by Prometheus). Do you depend on CT now, handle it in a certain way where this would cause a problem for your users? @krajorama?

We don't depend on Prometheus CT yet. We have our own solution (a special NaN value) to support OTLP start time in Mimir. We wanted to switch to upstream Prometheus CT and this will add some extra work to get there. Again , not a blocker for you.

I'm wondering if we could have a separate DeltaSamples []DeltaSample optional field? And say that a Timeseries cannot have both at the same time.

We can, we do that with histograms samples. However that would imply we will add CounterSample and rename Sample to GaugeSample, to stay consistent. As a result, what you propose is more in the direction of Prometheus scrape-side proto.

This is actually an interesting and tempting idea. Might help with future iterations that might alter samples and its in-memory layouts (even if empty fields). Question is... do you think it's worth changing now in 2.0? It would be doable, although last minute change.

To sum up

Realistically speaking, I wonder if the overhead (currently none, with buffering maybe slightly) is significant enough to block the potential consistency and simplicity of this change 🤔

Switching to the world, similar to client Prom proto model where we would have:
message TimeSeries {
  // ...
  repeated Gauge gauges = 2;
  repeated Histogram histograms = 3;
  repeated Counter counters = 7;
  repeated Delta deltas = 8;
}

message Gauge {
  double value = 1;
  int64 timestamp = 2;
}

message Counter {
  double value = 1;
  int64 timestamp = 2;
  int64 created_timestamp = 3;
}

message Delta {
  double value = 1;
  int64 timestamp = 2;
  int64 start_timestamp = 3;
}

message Histogram {
  // ...
}
... might actually make sense. What would be pros & cons of moving to typed sample values now? 🤔 @cstyan @krajorama

We already have pretty complicated code and issues with a lot of switches due to floats vs native histograms samples, so actually I take it back, this is probably a bad idea. We should maybe consider the opposite and merge floats and native histograms if we can somehow :)

bwplotka · 2025-09-10T14:07:54Z

SGTM @krajorama thanks for pointers!

FYI: Started proposal prometheus/proposals#60 to unblock this and storage implementation (WIP)

bwplotka · 2025-10-28T10:56:00Z

Thanks for discussion! This PR is now replaced by a new PR which also renames the CT to ST as per the devsummit consensus: #17411

Given the recent movement for Prometheus native support of ST ([PROM-60](prometheus/proposals#60)) and plans for delta temporality ([PROM-48](prometheus/proposals#48)) it might be beneficial to make (hopefully) last change to Remote Write 2.0 before stabilizing, so: * Raname Created Timestamp to Start Timestamp * Move CT/ST from TimeSeries to Sample and Histogram messages. * Clarified optionality (0 value meaning unset) See implementation change that will follow: prometheus/prometheus#17411. Notice that only receiver part was implemented for CT/ST. Given no sending part was done we expect this feature (ST/CT) not being used, thus breakage impact is minimal. This has been confirmed with early adopters like Mimir (Grafana), Chronosphere, Thanos, Cortex and Google. See previous discussions and 3 expilcit approvals: prometheus/prometheus#17036 Additionally: * I updated link to proto * Updated links to new compliance tests * Update native histogram spec link Signed-off-by: bwplotka <bwplotka@gmail.com>

…ST (Start Timestamp) (#2762) * prw2(breaking): Move 2.0 CT to Sample; Rename to ST (Start Timestamp) Given the recent movement for Prometheus native support of ST ([PROM-60](prometheus/proposals#60)) and plans for delta temporality ([PROM-48](prometheus/proposals#48)) it might be beneficial to make (hopefully) last change to Remote Write 2.0 before stabilizing, so: * Raname Created Timestamp to Start Timestamp * Move CT/ST from TimeSeries to Sample and Histogram messages. * Clarified optionality (0 value meaning unset) See implementation change that will follow: prometheus/prometheus#17411. Notice that only receiver part was implemented for CT/ST. Given no sending part was done we expect this feature (ST/CT) not being used, thus breakage impact is minimal. This has been confirmed with early adopters like Mimir (Grafana), Chronosphere, Thanos, Cortex and Google. See previous discussions and 3 expilcit approvals: prometheus/prometheus#17036 Additionally: * I updated link to proto * Updated links to new compliance tests * Update native histogram spec link Signed-off-by: bwplotka <bwplotka@gmail.com> * Update docs/specs/prw/remote_write_spec_2_0.md Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * Update docs/specs/prw/remote_write_spec_2_0.md Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> --------- Signed-off-by: bwplotka <bwplotka@gmail.com> Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>

bwplotka changed the title ~~prw: Move Remote Write 2.0 CT to Sample~~ prw: Remote Write 2.0 CT per Sample Aug 11, 2025

bwplotka force-pushed the prw2ctpersample branch 2 times, most recently from 5e68754 to c8bc644 Compare August 11, 2025 11:37

bwplotka requested review from ArthurSens, cstyan and fionaliao August 11, 2025 11:38

bwplotka force-pushed the prw2ctpersample branch from c8bc644 to 79e3237 Compare August 11, 2025 11:52

prw2: Move Remote Write 2.0 CT to Sample

97413db

Relates to #16944 (comment) Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka force-pushed the prw2ctpersample branch from 79e3237 to 97413db Compare August 11, 2025 11:58

bwplotka changed the title ~~prw: Remote Write 2.0 CT per Sample~~ prw: Remote Write 2.0 CT per Sample/Histogram Aug 11, 2025

cstyan approved these changes Aug 11, 2025

View reviewed changes

ArthurSens approved these changes Aug 12, 2025

View reviewed changes

bwplotka mentioned this pull request Aug 20, 2025

feat: Add prw2gcm binary GoogleCloudPlatform/prometheus#253

Merged

kgoudeaux approved these changes Aug 21, 2025

View reviewed changes

bwplotka mentioned this pull request Aug 22, 2025

tsdb: Fix commit order for mixed-typed series #17071

Merged

bwplotka marked this pull request as ready for review August 22, 2025 14:07

bwplotka requested a review from tomwilkie as a code owner August 22, 2025 14:08

bwplotka mentioned this pull request Aug 22, 2025

remote_write: add metric for unexpected metadata in populateV2TimeSeries #17034

Merged

bwplotka marked this pull request as draft August 29, 2025 03:43

bwplotka mentioned this pull request Sep 10, 2025

proposal: TSDB Support for Start Timestamp (ST) prometheus/proposals#60

Open

bwplotka mentioned this pull request Oct 28, 2025

prw2: Implement 2.0-rc.4 spec changes; Rename to Start Timestamp #17411

Merged

bwplotka closed this Oct 28, 2025

bwplotka mentioned this pull request Oct 28, 2025

RW 2.0-rc.4 (breaking): Move CT from TimeSeries to Sample; Rename to ST (Start Timestamp) prometheus/docs#2762

Merged

Conversation

bwplotka commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationales

Why not special semantics like "only TimeSeries.Samples[0] needs to hold CT"

Why not optional keyword?

Uh oh!

cstyan left a comment

Choose a reason for hiding this comment

Uh oh!

cstyan Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

bwplotka Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

bwplotka Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

bwplotka Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurSens left a comment

Choose a reason for hiding this comment

Uh oh!

krajorama commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krajorama commented Aug 14, 2025

Uh oh!

ArthurSens commented Aug 14, 2025

Uh oh!

bwplotka commented Aug 18, 2025

To sum up

Uh oh!

kgoudeaux left a comment

Choose a reason for hiding this comment

Uh oh!

bwplotka commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bwplotka commented Aug 22, 2025

Uh oh!

bwplotka commented Aug 22, 2025

Uh oh!

cstyan commented Aug 22, 2025

Uh oh!

bboreham commented Aug 26, 2025

Uh oh!

bwplotka commented Aug 26, 2025

Uh oh!

bwplotka commented Aug 26, 2025

Uh oh!

bboreham commented Aug 26, 2025

Uh oh!

bwplotka commented Aug 26, 2025

Uh oh!

bwplotka commented Aug 29, 2025

Uh oh!

krajorama commented Sep 1, 2025

To sum up

Uh oh!

bwplotka commented Sep 10, 2025

Uh oh!

bwplotka commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bwplotka commented Aug 11, 2025 •

edited

Loading

Why not `optional` keyword?

krajorama commented Aug 14, 2025 •

edited

Loading

bwplotka commented Aug 22, 2025 •

edited

Loading