Skip to content

tsdb: Fix commit order for mixed-typed series#17071

Merged
beorn7 merged 5 commits intomainfrom
beorn7/tsdb
Sep 18, 2025
Merged

tsdb: Fix commit order for mixed-typed series#17071
beorn7 merged 5 commits intomainfrom
beorn7/tsdb

Conversation

@beorn7
Copy link
Member

@beorn7 beorn7 commented Aug 21, 2025

Fixes #15177

The basic idea here is to divide the samples to be commited into (sub) batches whenever we detect that the same series receives a sample of a type different from the previous one. We then commit those batches one after another, and we log them to the WAL one after another, so that we hit both birds with the same stone. The cost of the stone is that we have to track the sample type of each series in a map. Given the amount of things we already track in the appender, I hope that it won't make a dent. Note that this even addresses the NHCB special case in the WAL.

This does a few other things that I could not resist to pick up on the go:

  • It adds more zeropool.Pools and uses the existing ones more consistently. My understanding is that this was merely an oversight. Maybe the additional pool usage will compensate for the increased memory demand of the map.

  • Create the synthetic zero sample for histograms a bit more carefully. So far, we created a sample that always went into its own chunk. Now we create a sample that is compatible enough with the following sample to go into the same chunk. This changed the test results quite a bit. But IMHO it makes much more sense now.

  • Continuing past efforts, I changed more namings of Samples into Floats to keep things consistent and less confusing. (Histogram samples are also samples.) I still avoided changing names in other packages.

  • I added a few shortcuts h := a.head, saving many characters.

TODOs:

  • Make sure this isn't blowing up memory consumption too much.
  • Address @krajorama's TODOs about commit order and staleness handling.

Which issue(s) does the PR fix:

Fixes #15177

Does this PR introduce a user-facing change?

[BUGFIX] Correctly handle appending mixed-typed samples to the same series. #17071

@beorn7 beorn7 requested a review from jesusvazquez as a code owner August 21, 2025 23:21
@beorn7 beorn7 marked this pull request as draft August 21, 2025 23:24
@beorn7 beorn7 requested review from krajorama and removed request for jesusvazquez August 21, 2025 23:24
@beorn7
Copy link
Member Author

beorn7 commented Aug 21, 2025

/cc @NeerajGartia21 @juliusmh

@bwplotka
Copy link
Member

Continuing past efforts, I changed more namings of Samples into Floats to keep things consistent and less confusing. (Histogram samples are also samples.) I still avoided changing names in other packages.

I wished we did that on PRW 2.0 too (Samples -> Floats) 😱 (https://prometheus.io/docs/specs/prw/remote_write_spec_2_0/#protobuf-message).. should we still do this while we can? (:

Also related #17036 (comment)

Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good solution.

Please reuse promql/promqltest/testdata/native_histograms.test from #16823.

Regarding TODOs I've commented about the memory usage, in the appropriate function .

Regarding staleness: since you ensure the commit order we should be able to just use the memseries last(Float)HistogramValue on commit as in my PR. The caveat is that the staleness marker is always float, so it would create a new batch for histograms all the time. So I think you need the check in two places when to convert staleness NaN to histogram NaN: 1. when adding staleness to batch (optimization to not start new batch all the time). 2. on commit (if series was in older batch and also failsafe if 1. was wrong).

@beorn7
Copy link
Member Author

beorn7 commented Sep 4, 2025

@bwplotka WRT "I wished we did that on PRW 2.0 too (Samples -> Floats)": Note that the naming changes here are just internal variable names. I have so far not changed anything in protocols, JSON serialization etc. That' a whole other story. (We can discuss it, of course, but the changes here are just continuing internal name changes and should not be taken as a strong reason to change protocols.)

@beorn7
Copy link
Member Author

beorn7 commented Sep 4, 2025

@krajorama test cases have been added now (I indeed forgot to add that file). And 🎉 the tests are passing.

@beorn7
Copy link
Member Author

beorn7 commented Sep 4, 2025

WRT staleness: I think I have understood @krajorama's point. But I need more time to work on it. Probably next week…

In the meantime, I'll rebase this on top of main and run PromBench so that we see if this goes horribly wrong.

@beorn7
Copy link
Member Author

beorn7 commented Sep 4, 2025

/prombench main

@prombot
Copy link
Contributor

prombot commented Sep 4, 2025

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-17071 and main

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

  • To restart benchmark: /prombench restart main
  • To stop benchmark: /prombench cancel
  • To print help: /prombench help

@beorn7
Copy link
Member Author

beorn7 commented Sep 5, 2025

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Sep 5, 2025

Benchmark cancel is in progress.

@beorn7
Copy link
Member Author

beorn7 commented Sep 5, 2025

I don't see any significant differences in CPU or memory usage patterns.

@beorn7 beorn7 force-pushed the beorn7/tsdb branch 2 times, most recently from d3e07b9 to 3360d93 Compare September 9, 2025 17:03
@beorn7 beorn7 marked this pull request as ready for review September 9, 2025 17:24
@beorn7 beorn7 requested a review from roidelapluie as a code owner September 9, 2025 17:24
@beorn7
Copy link
Member Author

beorn7 commented Sep 9, 2025

@krajorama I hope I have implemented the staleness handling as you envisioned – and more importantly in the generally correct way. Tests are passing, at least. 🦆

@beorn7 beorn7 removed the request for review from roidelapluie September 9, 2025 17:25
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, I've commented on the staleness handling.
To prove those cases you could and probably should add tests where you have two parallel commits.

Copy link
Member Author

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained in the comment, I'm not sure if the detailed check for the previous sample is worth it, as it is only an optimization (isn't it?!?), and we are dealing with an exceedingly rare case here.

I have added code comments to explain that.

I'm not sure yet how to create a test that tests concurrent appends in a meaningful way. Let's first find out if my approach of not converting in case of concurrent appends will work.

Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more question

@beorn7 beorn7 force-pushed the beorn7/tsdb branch 2 times, most recently from 4e4c7d8 to 484f115 Compare September 17, 2025 17:18
This exposes the ommission of float histograms from the rollback.

Signed-off-by: beorn7 <beorn@grafana.com>
Fixes #15177

The basic idea here is to divide the samples to be commited into (sub)
batches whenever we detect that the same series receives a sample of a
type different from the previous one. We then commit those batches one
after another, and we log them to the WAL one after another, so that
we hit both birds with the same stone. The cost of the stone is that
we have to track the sample type of each series in a map. Given the
amount of things we already track in the appender, I hope that it
won't make a dent. Note that this even addresses the NHCB special case
in the WAL.

This does a few other things that I could not resist to pick up on the
go:

- It adds more zeropool.Pools and uses the existing ones more
  consistently. My understanding is that this was merely an oversight.
  Maybe the additional pool usage will compensate for the increased
  memory demand of the map.

- Create the synthetic zero sample for histograms a bit more
  carefully. So far, we created a sample that always went into its own
  chunk. Now we create a sample that is compatible enough with the
  following sample to go into the same chunk. This changed the test
  results quite a bit. But IMHO it makes much more sense now.

- Continuing past efforts, I changed more namings of `Samples` into
  `Floats` to keep things consistent and less confusing. (Histogram
  samples are also samples.) I still avoided changing names in other
  packages.

- I added a few shortcuts `h := a.head`, saving many characters.

TODOs:

- Address @krajorama's TODOs about commit order and staleness handling.

Signed-off-by: beorn7 <beorn@grafana.com>
Regression test for:
- #14172
- #15177

Test cases are by @krajorama, taken from commit
b48bc9d .

Signed-off-by: beorn7 <beorn@grafana.com>
With the fixed commit order, we can now handle the conversion of float
staleness markers to histogram staleness markers in a more direct way.

Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7
Copy link
Member Author

beorn7 commented Sep 17, 2025

The plan for the agent appender: Let's first get this merged, and then fix it separately (if needed at all).

Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@beorn7
Copy link
Member Author

beorn7 commented Sep 18, 2025

Summary of our thoughts about the appender in agent mode (discussed with @krajorama):

  • The agent appender only appends to WAL and does not care about staleness marker optimization.
  • As it is not writing to chunks, issue promql engine does not return expected results with mixed floats+histograms #14172 cannot happen.
  • The problem with WAL writing is still relevant, in principle, but since the agent is only scraping (and not ingesting OTLP or PRW), the case of multiple samples with different sample types for the same series in the same commit doesn't really happen.

For now, we have concluded that we don't need to change anything for agent mode.

Note for posterity: I could see the theoretical possibility of triggering the problem by mixing float histograms and integer histograms, as some exposition formats allow to mix both. But that's really an intersection of a number of circumstances where every single one is already highly unusual:

  • Scrape from a target that exposes multiple samples for the same series with explicit timestamps. (This was, for a long time, considered an invalid exposition and only worked "by accident". It was formalized in OM, but is rarely used.)
  • The samples in questions must include at least one float histogram. Float histograms are rarely exposed in scrapes. There are very few edge cases where they make sense in instrumentation. They are usually the product of recording rules and only show up in expositions via federation. Currently, instrumentation libraries don't even support float histograms (which might change, to support aforementioned edge cases). OM v1 does not even support float histograms at all.
  • The samples in question must also include at least one integer histogram. Integer histograms are common, of course, but the case that a float histogram (rare on its own) is mixed with an integer histogram for the same series would be super weird.
  • Finally, the float histogram must be earlier than the integer histogram to trigger the issue. (Which is not rare, but 50/50, but still halving the odds once more.)

@beorn7 beorn7 merged commit d5cc5e2 into main Sep 18, 2025
46 checks passed
@beorn7 beorn7 deleted the beorn7/tsdb branch September 18, 2025 11:55
krajorama added a commit to grafana/mimir-prometheus that referenced this pull request Oct 9, 2025
There is an optimization in
prometheus/prometheus#17071
that ensures that the injected zero histogram sample has the same schema
as the next sample.

It turns out that if there are at least three samples:
a normal sample (schema>0), zero sample(schema=0), normal sample (schema>0)
then the histogram rate function will find the lowest schema and normalize
all to that, meaning that the normal samples will be downscaled to schema
0. In dashboards it will look like as if we lost the resolution.

See https://github.com/prometheus/prometheus/blob/9e4d23ddafcdc00021cd8630e78bb819e84ccac9/promql/functions.go#L344

So the optimization is actually needed to not loose resolution on
read back. Alternatively we should use per sample CT as metadata
instead, but that would be a big rewrite of TSDB and PromQL.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir-prometheus that referenced this pull request Oct 9, 2025
)

* fix(tsdb): backport zero sample schema optimization from pr 17071

There is an optimization in
prometheus/prometheus#17071
that ensures that the injected zero histogram sample has the same schema
as the next sample.

It turns out that if there are at least three samples:
a normal sample (schema>0), zero sample(schema=0), normal sample (schema>0)
then the histogram rate function will find the lowest schema and normalize
all to that, meaning that the normal samples will be downscaled to schema
0. In dashboards it will look like as if we lost the resolution.

See https://github.com/prometheus/prometheus/blob/9e4d23ddafcdc00021cd8630e78bb819e84ccac9/promql/functions.go#L344

So the optimization is actually needed to not loose resolution on
read back. Alternatively we should use per sample CT as metadata
instead, but that would be a big rewrite of TSDB and PromQL.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir-prometheus that referenced this pull request Oct 10, 2025
)

* fix(tsdb): backport zero sample schema optimization from pr 17071

There is an optimization in
prometheus/prometheus#17071
that ensures that the injected zero histogram sample has the same schema
as the next sample.

It turns out that if there are at least three samples:
a normal sample (schema>0), zero sample(schema=0), normal sample (schema>0)
then the histogram rate function will find the lowest schema and normalize
all to that, meaning that the normal samples will be downscaled to schema
0. In dashboards it will look like as if we lost the resolution.

See https://github.com/prometheus/prometheus/blob/9e4d23ddafcdc00021cd8630e78bb819e84ccac9/promql/functions.go#L344

So the optimization is actually needed to not loose resolution on
read back. Alternatively we should use per sample CT as metadata
instead, but that would be a big rewrite of TSDB and PromQL.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir-prometheus that referenced this pull request Oct 10, 2025
) (#1001)

* fix(tsdb): backport zero sample schema optimization from pr 17071

There is an optimization in
prometheus/prometheus#17071
that ensures that the injected zero histogram sample has the same schema
as the next sample.

It turns out that if there are at least three samples:
a normal sample (schema>0), zero sample(schema=0), normal sample (schema>0)
then the histogram rate function will find the lowest schema and normalize
all to that, meaning that the normal samples will be downscaled to schema
0. In dashboards it will look like as if we lost the resolution.

See https://github.com/prometheus/prometheus/blob/9e4d23ddafcdc00021cd8630e78bb819e84ccac9/promql/functions.go#L344

So the optimization is actually needed to not loose resolution on
read back. Alternatively we should use per sample CT as metadata
instead, but that would be a big rewrite of TSDB and PromQL.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit that referenced this pull request Nov 25, 2025
The original implementation in #9705 for native histograms included a
technical dept #15177 where samples were committed ordered by type
not by their append order. This was fixed in #17071, but this docstring
was not updated.

I've also took the liberty to mention that we do not order by timestamp
either, thus it is possible to append out of order samples.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit that referenced this pull request Nov 25, 2025
The original implementation in #9705 for native histograms included a
technical dept #15177 where samples were committed ordered by type
not by their append order. This was fixed in #17071, but this docstring
was not updated.

I've also took the liberty to mention that we do not order by timestamp
either, thus it is possible to append out of order samples.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
wbollock pushed a commit to wbollock/prometheus that referenced this pull request Nov 25, 2025
The original implementation in prometheus#9705 for native histograms included a
technical dept prometheus#15177 where samples were committed ordered by type
not by their append order. This was fixed in prometheus#17071, but this docstring
was not updated.

I've also took the liberty to mention that we do not order by timestamp
either, thus it is possible to append out of order samples.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Will Bollock <wbollock@linode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TSDB Commit should honor ordering between floats and native histograms

4 participants