Skip to content

prw: Remote Write 2.0 CT per Sample/Histogram#17036

Closed
bwplotka wants to merge 1 commit intomainfrom
prw2ctpersample
Closed

prw: Remote Write 2.0 CT per Sample/Histogram#17036
bwplotka wants to merge 1 commit intomainfrom
prw2ctpersample

Conversation

@bwplotka
Copy link
Member

@bwplotka bwplotka commented Aug 11, 2025

This proposes an insignificant "breaking change" in the experimental RW 2.0 (that's why buf lint CI is expectedly red). Insignificant, because:

  • Proto on network will still work, given the proto back. and forw. compatibility (we remove field and reserve it's number and add it somewhere else).
  • It's for the new feature (CT) which is not yet supported by Prometheus (sending)

Relates to #16944 (comment) discussion.

I also pushed this PR's proto to custom bwplotka-experiment1 buf.build label if anyone wants to test it out.

Rationales

Currently CT is per TimeSeries. For cumulatives that's reasonable as CT represents a creation/start time for a cumulative "stream" of samples. In the unlikely case where we send more than one sample and CT changes for a single series within a PRW batch, second series has to be created in the message.

The proposal is to move the CT to Sample message instead. This is because:

  1. Prometheus ecosystem experiments with the the delta type. For delta, per sample CT is a must have. So first reason to make this change is being future proof. The counter arguments:
  • We can add CT per sample in 2.1+ (we won't be able to remove/put to reserved the TimeSeries.CreatedTimestamp though!)
  • PRW support for delta type is low priority (Prometheus is not expected to be a source of deltas, but rather consume it).
  1. Literally everywhere else we will likely have CT per sample. Scrape protocols technically pass CT per sample. For storage (e.g. WAL), we will likely put CT per sample too.. It's actually extra code to try to put it per timeseries and split timeseries etc. Otel also has it per series.
  2. Performance wise:
    • Practically for Prometheus writing implementation, it's no difference, given Prometheus always batches series with one.
    • Theoretical buffering cases (receive + send) TODO benchmark

Those reasons might be not super strong to break users; we don't have many (if any) at this point with experimental PRW 2.0, and for sure zero with CT feature (it's not yet implemented in the mainstream). Specifically, given Prometheus uses PRW in a single sample multiple series batches, there's no other different than consistency and semantics. However if we ever change and use PRW for backfilling or longer sample batches sending, the protocol will be naturally better fitted with CT per sample.

Why not special semantics like "only TimeSeries.Samples[0] needs to hold CT"

We could reduce some network load if send/receive semantic allow assuming CT by the first Sample.CT in the TimeSeries. My assumption is that it's not worth the complexity, plus we can do that later (TODO benchmark and explain).

Why not optional keyword?

I propose we skip optional keywords here, like we avoided it in the current version.

Funny enough there used to be a warning about adding optional to primitive fields, that's why we never used optional keywords. However it seems it was removed (missing link now: https://cloud.google.com/apis/design/design_patterns.md#optional_primitive_fields). Nevertheless I tried and for int64 it's a ridiculous (readability, complexity, efficiency) price on code side:

image image

@bwplotka bwplotka changed the title prw: Move Remote Write 2.0 CT to Sample prw: Remote Write 2.0 CT per Sample Aug 11, 2025
@bwplotka bwplotka force-pushed the prw2ctpersample branch 2 times, most recently from 5e68754 to c8bc644 Compare August 11, 2025 11:37
Relates to
#16944 (comment)

Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka bwplotka changed the title prw: Remote Write 2.0 CT per Sample prw: Remote Write 2.0 CT per Sample/Histogram Aug 11, 2025
Copy link
Member

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me, just a few questions in comments here + should we be updating the spec. documentation as well.

// The canonical Content-Type request header value for this message is
// "application/x-protobuf;proto=io.prometheus.write.v2.Request"
//
// Version: v2.0-rc.4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 2.1-rc.x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2.1?

We are 2.0-rc.3 now -> https://prometheus.io/docs/specs/prw/remote_write_spec_2_0/ (I will update spec in separate PR)

Comment on lines +82 to +84
// This field is reserved for backward compatibility with the deprecated fields;
// previously present in the experimental remote write period.
reserved 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we include that this was previously created timestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need? Anyone can use git to check it, otherwise if we can, we potentially want to make the proto as clean as possible before we make it finalized/stable. If we were at 2.0 we wouldn't be able to remove it/move to reserved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could argue, given experimental period, that we don't even need reserved (: But let's be nice for potential indeterministic errors if someone forgets to upgrade some PRW dep

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@krajorama
Copy link
Member

krajorama commented Aug 14, 2025

I'm worried about two things:

  • What will be the bandwidth and memory implication ? Float samples are the most common.
  • I realize this isn't blocker for you but in Mimir we did implement RW2+created timestamp support already. We convert into our RW1 implementation +created timestamp field. So such change would impact RW1 resource usage as well. And there's also a bit where we forward PromQL results easily from distributed queriers to their frontend where we take advantage of the similar types. See https://github.com/[mimir.proto](grafana/mimir/blob/dd0fdb56205512dd2dda42934b66a5a1160b225f/pkg/mimirpb/mimir.proto#L91). So this change has huge impact potentially :(
  • I'm wondering if we could have a separate DeltaSamples []DeltaSample optional field? And say that a Timeseries cannot have both at the same time.

@krajorama
Copy link
Member

Also, how will you generate such remote write requests? I don't think the WAL/WBL has that info.

@ArthurSens
Copy link
Member

Also, how will you generate such remote write requests? I don't think the WAL/WBL has that info.

There is working happening to make this possible #16046

@bwplotka
Copy link
Member Author

Thanks for a feedback @krajorama 💪🏽

What will be the bandwidth and memory implication ? Float samples are the most common.

Practically zero for now.

  • In terms of the network, sending empty field gives zero overhead, it's simply not written to the wire.
  • In terms of memory, for 100% of the current, known cases, it's zero. This is because Prometheus and ecosystem always send and receive a single sample, so it's the same memory footprint vs putting it in request.

If in future we will use the buffering (more than one sample per series/request), you might see some impact (good question how much) vs the rest of the payload (labels). One could argue that custom marshal/unmarshal can reduce it if it's becoming a problem too.

I realize this isn't blocker for you but in Mimir we did implement RW2+created timestamp support already. We convert into our RW1 implementation +created timestamp field. So such change would impact RW1 resource usage as well. And there's also a bit where we forward PromQL results (...) So this change has huge impact potentially :(

Hm... I would challenge this. Mimir clearly maintains its own proto by hand, so authors can adjust it as you need. If you want to handle CT in a different way that Prometheus (not per sample), you can create Sample2 for PRW2 and decode sample CT into series one. However, the truth is that the Prometheus code leans on treating CT per sample in all its APIs (including TSDB storage one day), so holding to per series in Mimir might not be ideal for future anyway.

This breaking change will do cause your current version of Mimir to NOT see CT per series (because it's not going to be send by Prometheus). Do you depend on CT now, handle it in a certain way where this would cause a problem for your users? @krajorama?

I'm wondering if we could have a separate DeltaSamples []DeltaSample optional field? And say that a Timeseries cannot have both at the same time.

We can, we do that with histograms samples. However that would imply we will add CounterSample and rename Sample to GaugeSample, to stay consistent. As a result, what you propose is more in the direction of Prometheus scrape-side proto.

This is actually an interesting and tempting idea. Might help with future iterations that might alter samples and its in-memory layouts (even if empty fields). Question is... do you think it's worth changing now in 2.0? It would be doable, although last minute change.

To sum up

Realistically speaking, I wonder if the overhead (currently none, with buffering maybe slightly) is significant enough to block the potential consistency and simplicity of this change 🤔

Switching to the world, similar to client Prom proto model where we would have:

message TimeSeries {
  // ...
  repeated Gauge gauges = 2;
  repeated Histogram histograms = 3;
  repeated Counter counters = 7;
  repeated Delta deltas = 8;
}

message Gauge {
  double value = 1;
  int64 timestamp = 2;
}

message Counter {
  double value = 1;
  int64 timestamp = 2;
  int64 created_timestamp = 3;
}

message Delta {
  double value = 1;
  int64 timestamp = 2;
  int64 start_timestamp = 3;
}

message Histogram {
  // ...
}

... might actually make sense. What would be pros & cons of moving to typed sample values now? 🤔 @cstyan @krajorama

Copy link

@kgoudeaux kgoudeaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bwplotka
Copy link
Member Author

bwplotka commented Aug 22, 2025

For now, I think I would keep float vs histograms semantics. This also fits the direction TSDB is on right now #17071 (float vs histogram derivatives) although one could argue we currently are in state where we have separate WAL records / chunks for almost all types... (not the state we want to be)

@bwplotka bwplotka marked this pull request as ready for review August 22, 2025 14:07
@bwplotka bwplotka requested a review from tomwilkie as a code owner August 22, 2025 14:08
@bwplotka
Copy link
Member Author

@krajorama is on PTO, but I would merge it given approval and the fact I don't see how this blocks Mimir here 🤔

Maybe no harm in waiting a bit for @krajorama to come back; maybe @bboreham or @aknuds1 could chime in too.

I will prepare spec change PR too in the meantime.

@bwplotka
Copy link
Member Author

Also, how will you generate such remote write requests? I don't think the WAL/WBL has that info.

It will soon 🤞🏽 e.g. #16046

@cstyan
Copy link
Member

cstyan commented Aug 22, 2025

I would merge it given approval and the fact I don't see how this blocks Mimir here

Neither do I, hopefully Krajo can provide a bit more detail. If it is a performance hit, I'm sure there's a workaround given that mimir is already reimplementing things at the proto level themselves.

@bboreham
Copy link
Member

represents a creation/start time

I wish to argue against re-using a thing named "creation time" to mean "start time for a delta sample".
For me, creation time is the when the whole timeseries was created.

ISTM we could have a "start time" field to mean that the other thing.

@bwplotka
Copy link
Member Author

That's fair.

In fact, CT was called CT because there was no delta plans around. In other, delta-temporality-first systems first (e.g. Otel) the same feature was named Start time and it's reused for CT meaning for cumulatives 🙈 I proposed to do other way around since we already decided to call it CT and Prometheus is cumulative first.

What's your proposal @bboreham then? Keep CT per series and add ST (start time) now or later for delta within sample?

This is tempting, because it's perhaps more clear, and potentially it offers separate Prometheus storage logic (e.g. WAL record format, tsdb chunks) for CT vs ST. In theory CT will duplicate much more across the same series, so it offers theoretical efficiency gains.

However, practically speaking, in my early experiments, having separate code logic and storage formats for those does not give us much and only makes code more complex and arguably, slower:

  • WAL records and remote write is used with 1-sample, multiple series batches. No benefit in per series CT logic. Making desperate records or messages for different metric types is clunky too (you can't batch multiple samples really into one record anymore).
  • I expect TSDB chunks encoding compressing well enough that the advantage is minimal. Also the "chunk per CT is already causing some issues with native histograms chunks too as far as I remember (reset hints).
  • CT can be updated anytime across series (instrumentation resetting a counter), so in a worst case it can even be updated on every scrape, risking our "CT per series optimized storage" will actually be much worse than ST one reused for CT.

I can do proper experiment to visualize the potential overhead for tsdb chunks and wal but so far the lazy approach for adding "ST per sample" logic to storage and remote write and reusing this for delta and cumulatives (and histogram resets!) makes the most sense to me. You did have a good idea to name it StartTime though.. while we can (:

What's your preference, thought @bboreham? What we would like to know more before deciding? Should this PR add StartTime per sample (while removing CT per series) in Remote Write 2? With storage only doing ST per sample (if we agree on that) CT per series will be cumbersome for Prometheus too.

@bwplotka
Copy link
Member Author

The benefit of keeping "CT" naming in Prometheus is that it's clear it's Prometheus semantics and the fact it's already introduced and known in the community.

@bboreham
Copy link
Member

I don't really care about creation-time per series vs per sample in the protocol; Prometheus only sends one sample per series in most cases. You can have some niche case like "here are five samples and the third one was a reset so has a new creation time", fine.

I haven't given any thought to storing this information. Previously we used creation time to add a zero into TSDB, but I guess we could do more, with a lot of code changes in TSDB. I didn't follow what you were saying there.

If you want to send the "start time of a single delta sample" (which is what I understand OpenTelemetry to mean by startTime on delta temporality) then don't name it "creation time". That's confusing. My proposal is to name that one "start time".

The benefit of keeping "CT" naming in Prometheus is that it's clear it's Prometheus semantics and the fact it's already introduced and known in the community.

I have no idea how this relates to "For delta, per sample CT is a must have." There are no Prometheus semantics for deltas.

@bwplotka
Copy link
Member Author

I haven't given any thought to storing this information. Previously we used creation time to add a zero into TSDB, but I guess we could do more, with a lot of code changes in TSDB. I didn't follow what you were saying there.

Yes, the plan is to deprecate zero-adding as it's not ideal; and implement proper storage support. I would argue it's relative simple to implement (1k lines?).

If you want to send the "start time of a single delta sample" (which is what I understand OpenTelemetry to mean by startTime on delta temporality) then don't name it "creation time". That's confusing. My proposal is to name that one "start time".

What if it's beneficial (arguments) to reuse the same structure for creation time too? Plus already systems are doing this (e.g. Otel). What if it's worth to have a tiny bit of this confusion that can be managed with a one sentence commentary?

I have no idea how this relates to "For delta, per sample CT is a must have." There are no Prometheus semantics for deltas.

Have you seen planned semantics for delta? We have low cost opportunity to prepare protocol, while solving some current issues for CT (handling of worst case for CT - so frequently updated CT). What am I missing when considering that?

@bwplotka
Copy link
Member Author

OK, we discussed this a bit with @fionaliao, also around some ideas for delta type.

She made a good point that this this protocol change somewhat depends on ct-per-sample feature implementation (is it per series, is it encoded as "st" per sample) might need some small proposal -- it's not as obvious as I initially thought.

I'm confident in this protocol change and we got some approvals already. The only question I have -- if we want to rename this field from created_timestamp vs start_timestamp to indicated a shared purpose, given @bboreham great points.

Let's park this PR for now until we know more about implementation, this can wait a few weeks (:

Thanks everybody for the discussion! Moving to draft.

@bwplotka bwplotka marked this pull request as draft August 29, 2025 03:43
@krajorama
Copy link
Member

Thanks for a feedback @krajorama 💪🏽

What will be the bandwidth and memory implication ? Float samples are the most common.

Practically zero for now.

* In terms of the network, sending empty field gives zero overhead, it's simply not written to the wire.

But we plan to populate created time usually from scrape, so it's not going to be zero, which means it is going to be written most of the time (for counters). I guess no extra overhead over current RW2 is still true due to @bboreham 's argument about having a single sample per series per request. So we're good.

* In terms of memory,  for 100% of the current, known cases, it's zero. This is because Prometheus and ecosystem always send and receive a single sample, so it's the same memory footprint vs putting it in request.

If in future we will use the buffering (more than one sample per series/request), you might see some impact (good question how much) vs the rest of the payload (labels). One could argue that custom marshal/unmarshal can reduce it if it's becoming a problem too.

I realize this isn't blocker for you but in Mimir we did implement RW2+created timestamp support already. We convert into our RW1 implementation +created timestamp field. So such change would impact RW1 resource usage as well. And there's also a bit where we forward PromQL results (...) So this change has huge impact potentially :(

Hm... I would challenge this. Mimir clearly maintains its own proto by hand, so authors can adjust it as you need. If you want to handle CT in a different way that Prometheus (not per sample), you can create Sample2 for PRW2 and decode sample CT into series one. However, the truth is that the Prometheus code leans on treating CT per sample in all its APIs (including TSDB storage one day), so holding to per series in Mimir might not be ideal for future anyway.

This breaking change will do cause your current version of Mimir to NOT see CT per series (because it's not going to be send by Prometheus). Do you depend on CT now, handle it in a certain way where this would cause a problem for your users? @krajorama?

We don't depend on Prometheus CT yet. We have our own solution (a special NaN value) to support OTLP start time in Mimir. We wanted to switch to upstream Prometheus CT and this will add some extra work to get there. Again , not a blocker for you.

I'm wondering if we could have a separate DeltaSamples []DeltaSample optional field? And say that a Timeseries cannot have both at the same time.

We can, we do that with histograms samples. However that would imply we will add CounterSample and rename Sample to GaugeSample, to stay consistent. As a result, what you propose is more in the direction of Prometheus scrape-side proto.

This is actually an interesting and tempting idea. Might help with future iterations that might alter samples and its in-memory layouts (even if empty fields). Question is... do you think it's worth changing now in 2.0? It would be doable, although last minute change.

To sum up

Realistically speaking, I wonder if the overhead (currently none, with buffering maybe slightly) is significant enough to block the potential consistency and simplicity of this change 🤔

Switching to the world, similar to client Prom proto model where we would have:

message TimeSeries {
  // ...
  repeated Gauge gauges = 2;
  repeated Histogram histograms = 3;
  repeated Counter counters = 7;
  repeated Delta deltas = 8;
}

message Gauge {
  double value = 1;
  int64 timestamp = 2;
}

message Counter {
  double value = 1;
  int64 timestamp = 2;
  int64 created_timestamp = 3;
}

message Delta {
  double value = 1;
  int64 timestamp = 2;
  int64 start_timestamp = 3;
}

message Histogram {
  // ...
}

... might actually make sense. What would be pros & cons of moving to typed sample values now? 🤔 @cstyan @krajorama

We already have pretty complicated code and issues with a lot of switches due to floats vs native histograms samples, so actually I take it back, this is probably a bad idea. We should maybe consider the opposite and merge floats and native histograms if we can somehow :)

@bwplotka
Copy link
Member Author

SGTM @krajorama thanks for pointers!

FYI: Started proposal prometheus/proposals#60 to unblock this and storage implementation (WIP)

@bwplotka
Copy link
Member Author

Thanks for discussion! This PR is now replaced by a new PR which also renames the CT to ST as per the devsummit consensus: #17411

@bwplotka bwplotka closed this Oct 28, 2025
bwplotka added a commit to prometheus/docs that referenced this pull request Oct 28, 2025
Given the recent movement for Prometheus native support of ST
([PROM-60](prometheus/proposals#60)) and plans
for delta temporality
([PROM-48](prometheus/proposals#48)) it might be
beneficial to make (hopefully) last change to Remote Write 2.0 before
stabilizing, so:

* Raname Created Timestamp to Start Timestamp
* Move CT/ST from TimeSeries to Sample and Histogram messages.
* Clarified optionality (0 value meaning unset)

See implementation change that will follow:
prometheus/prometheus#17411. Notice that only
receiver part was implemented for CT/ST. Given no sending part was done
we expect this feature (ST/CT) not being used, thus breakage impact is
minimal. This has been confirmed with early adopters like Mimir
(Grafana), Chronosphere, Thanos, Cortex and Google.

See previous discussions and 3 expilcit approvals:
prometheus/prometheus#17036

Additionally:
* I updated link to proto
* Updated links to new compliance tests
* Update native histogram spec link

Signed-off-by: bwplotka <bwplotka@gmail.com>
bwplotka added a commit to prometheus/docs that referenced this pull request Nov 4, 2025
…ST (Start Timestamp) (#2762)

* prw2(breaking): Move 2.0 CT to Sample; Rename to ST (Start Timestamp)

Given the recent movement for Prometheus native support of ST
([PROM-60](prometheus/proposals#60)) and plans
for delta temporality
([PROM-48](prometheus/proposals#48)) it might be
beneficial to make (hopefully) last change to Remote Write 2.0 before
stabilizing, so:

* Raname Created Timestamp to Start Timestamp
* Move CT/ST from TimeSeries to Sample and Histogram messages.
* Clarified optionality (0 value meaning unset)

See implementation change that will follow:
prometheus/prometheus#17411. Notice that only
receiver part was implemented for CT/ST. Given no sending part was done
we expect this feature (ST/CT) not being used, thus breakage impact is
minimal. This has been confirmed with early adopters like Mimir
(Grafana), Chronosphere, Thanos, Cortex and Google.

See previous discussions and 3 expilcit approvals:
prometheus/prometheus#17036

Additionally:
* I updated link to proto
* Updated links to new compliance tests
* Update native histogram spec link

Signed-off-by: bwplotka <bwplotka@gmail.com>

* Update docs/specs/prw/remote_write_spec_2_0.md

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Update docs/specs/prw/remote_write_spec_2_0.md

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

---------

Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants