tsdb: Fix commit order for mixed-typed series by beorn7 · Pull Request #17071 · prometheus/prometheus

beorn7 · 2025-08-21T23:21:58Z

The basic idea here is to divide the samples to be commited into (sub) batches whenever we detect that the same series receives a sample of a type different from the previous one. We then commit those batches one after another, and we log them to the WAL one after another, so that we hit both birds with the same stone. The cost of the stone is that we have to track the sample type of each series in a map. Given the amount of things we already track in the appender, I hope that it won't make a dent. Note that this even addresses the NHCB special case in the WAL.

This does a few other things that I could not resist to pick up on the go:

It adds more zeropool.Pools and uses the existing ones more consistently. My understanding is that this was merely an oversight. Maybe the additional pool usage will compensate for the increased memory demand of the map.
Create the synthetic zero sample for histograms a bit more carefully. So far, we created a sample that always went into its own chunk. Now we create a sample that is compatible enough with the following sample to go into the same chunk. This changed the test results quite a bit. But IMHO it makes much more sense now.
Continuing past efforts, I changed more namings of Samples into Floats to keep things consistent and less confusing. (Histogram samples are also samples.) I still avoided changing names in other packages.
I added a few shortcuts h := a.head, saving many characters.

TODOs:

Make sure this isn't blowing up memory consumption too much.
Address @krajorama's TODOs about commit order and staleness handling.

Which issue(s) does the PR fix:

Fixes #15177

Does this PR introduce a user-facing change?

[BUGFIX] Correctly handle appending mixed-typed samples to the same series. #17071

beorn7 · 2025-08-21T23:25:21Z

/cc @NeerajGartia21 @juliusmh

bwplotka · 2025-08-22T10:28:12Z

Continuing past efforts, I changed more namings of Samples into Floats to keep things consistent and less confusing. (Histogram samples are also samples.) I still avoided changing names in other packages.

I wished we did that on PRW 2.0 too (Samples -> Floats) 😱 (https://prometheus.io/docs/specs/prw/remote_write_spec_2_0/#protobuf-message).. should we still do this while we can? (:

Also related #17036 (comment)

tsdb/head_append.go

krajorama

Looks like a good solution.

Please reuse promql/promqltest/testdata/native_histograms.test from #16823.

Regarding TODOs I've commented about the memory usage, in the appropriate function .

Regarding staleness: since you ensure the commit order we should be able to just use the memseries last(Float)HistogramValue on commit as in my PR. The caveat is that the staleness marker is always float, so it would create a new batch for histograms all the time. So I think you need the check in two places when to convert staleness NaN to histogram NaN: 1. when adding staleness to batch (optimization to not start new batch all the time). 2. on commit (if series was in older batch and also failsafe if 1. was wrong).

tsdb/head_append.go

beorn7 · 2025-09-04T17:42:20Z

@bwplotka WRT "I wished we did that on PRW 2.0 too (Samples -> Floats)": Note that the naming changes here are just internal variable names. I have so far not changed anything in protocols, JSON serialization etc. That' a whole other story. (We can discuss it, of course, but the changes here are just continuing internal name changes and should not be taken as a strong reason to change protocols.)

beorn7 · 2025-09-04T17:55:26Z

@krajorama test cases have been added now (I indeed forgot to add that file). And 🎉 the tests are passing.

beorn7 · 2025-09-04T18:25:31Z

WRT staleness: I think I have understood @krajorama's point. But I need more time to work on it. Probably next week…

In the meantime, I'll rebase this on top of main and run PromBench so that we see if this goes horribly wrong.

beorn7 · 2025-09-04T19:00:29Z

/prombench main

prombot · 2025-09-04T19:00:33Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-17071 and main

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

To restart benchmark: /prombench restart main
To stop benchmark: /prombench cancel
To print help: /prombench help

beorn7 · 2025-09-05T11:40:44Z

/prombench cancel

prombot · 2025-09-05T11:40:46Z

Benchmark cancel is in progress.

beorn7 · 2025-09-05T11:41:05Z

I don't see any significant differences in CPU or memory usage patterns.

beorn7 · 2025-09-09T17:25:43Z

@krajorama I hope I have implemented the staleness handling as you envisioned – and more importantly in the generally correct way. Tests are passing, at least. 🦆

krajorama

Almost there, I've commented on the staleness handling.
To prove those cases you could and probably should add tests where you have two parallel commits.

tsdb/head_append.go

beorn7

As explained in the comment, I'm not sure if the detailed check for the previous sample is worth it, as it is only an optimization (isn't it?!?), and we are dealing with an exceedingly rare case here.

I have added code comments to explain that.

I'm not sure yet how to create a test that tests concurrent appends in a meaningful way. Let's first find out if my approach of not converting in case of concurrent appends will work.

tsdb/head_append.go

krajorama

one more question

tsdb/head_append.go

This exposes the ommission of float histograms from the rollback. Signed-off-by: beorn7 <beorn@grafana.com>

@krajorama

Fixes #15177 The basic idea here is to divide the samples to be commited into (sub) batches whenever we detect that the same series receives a sample of a type different from the previous one. We then commit those batches one after another, and we log them to the WAL one after another, so that we hit both birds with the same stone. The cost of the stone is that we have to track the sample type of each series in a map. Given the amount of things we already track in the appender, I hope that it won't make a dent. Note that this even addresses the NHCB special case in the WAL. This does a few other things that I could not resist to pick up on the go: - It adds more zeropool.Pools and uses the existing ones more consistently. My understanding is that this was merely an oversight. Maybe the additional pool usage will compensate for the increased memory demand of the map. - Create the synthetic zero sample for histograms a bit more carefully. So far, we created a sample that always went into its own chunk. Now we create a sample that is compatible enough with the following sample to go into the same chunk. This changed the test results quite a bit. But IMHO it makes much more sense now. - Continuing past efforts, I changed more namings of `Samples` into `Floats` to keep things consistent and less confusing. (Histogram samples are also samples.) I still avoided changing names in other packages. - I added a few shortcuts `h := a.head`, saving many characters. TODOs: - Address @krajorama's TODOs about commit order and staleness handling. Signed-off-by: beorn7 <beorn@grafana.com>

@krajorama

Regression test for: - #14172 - #15177 Test cases are by @krajorama, taken from commit b48bc9d . Signed-off-by: beorn7 <beorn@grafana.com>

With the fixed commit order, we can now handle the conversion of float staleness markers to histogram staleness markers in a more direct way. Signed-off-by: beorn7 <beorn@grafana.com>

Signed-off-by: beorn7 <beorn@grafana.com>

beorn7 · 2025-09-17T17:24:05Z

The plan for the agent appender: Let's first get this merged, and then fix it separately (if needed at all).

krajorama

LGTM

beorn7 · 2025-09-18T11:53:46Z

Summary of our thoughts about the appender in agent mode (discussed with @krajorama):

The agent appender only appends to WAL and does not care about staleness marker optimization.
As it is not writing to chunks, issue promql engine does not return expected results with mixed floats+histograms #14172 cannot happen.
The problem with WAL writing is still relevant, in principle, but since the agent is only scraping (and not ingesting OTLP or PRW), the case of multiple samples with different sample types for the same series in the same commit doesn't really happen.

For now, we have concluded that we don't need to change anything for agent mode.

Note for posterity: I could see the theoretical possibility of triggering the problem by mixing float histograms and integer histograms, as some exposition formats allow to mix both. But that's really an intersection of a number of circumstances where every single one is already highly unusual:

Scrape from a target that exposes multiple samples for the same series with explicit timestamps. (This was, for a long time, considered an invalid exposition and only worked "by accident". It was formalized in OM, but is rarely used.)
The samples in questions must include at least one float histogram. Float histograms are rarely exposed in scrapes. There are very few edge cases where they make sense in instrumentation. They are usually the product of recording rules and only show up in expositions via federation. Currently, instrumentation libraries don't even support float histograms (which might change, to support aforementioned edge cases). OM v1 does not even support float histograms at all.
The samples in question must also include at least one integer histogram. Integer histograms are common, of course, but the case that a float histogram (rare on its own) is mixed with an integer histogram for the same series would be super weird.
Finally, the float histogram must be earlier than the integer histogram to trigger the issue. (Which is not rare, but 50/50, but still halving the odds once more.)

There is an optimization in prometheus/prometheus#17071 that ensures that the injected zero histogram sample has the same schema as the next sample. It turns out that if there are at least three samples: a normal sample (schema>0), zero sample(schema=0), normal sample (schema>0) then the histogram rate function will find the lowest schema and normalize all to that, meaning that the normal samples will be downscaled to schema 0. In dashboards it will look like as if we lost the resolution. See https://github.com/prometheus/prometheus/blob/9e4d23ddafcdc00021cd8630e78bb819e84ccac9/promql/functions.go#L344 So the optimization is actually needed to not loose resolution on read back. Alternatively we should use per sample CT as metadata instead, but that would be a big rewrite of TSDB and PromQL. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

) * fix(tsdb): backport zero sample schema optimization from pr 17071 There is an optimization in prometheus/prometheus#17071 that ensures that the injected zero histogram sample has the same schema as the next sample. It turns out that if there are at least three samples: a normal sample (schema>0), zero sample(schema=0), normal sample (schema>0) then the histogram rate function will find the lowest schema and normalize all to that, meaning that the normal samples will be downscaled to schema 0. In dashboards it will look like as if we lost the resolution. See https://github.com/prometheus/prometheus/blob/9e4d23ddafcdc00021cd8630e78bb819e84ccac9/promql/functions.go#L344 So the optimization is actually needed to not loose resolution on read back. Alternatively we should use per sample CT as metadata instead, but that would be a big rewrite of TSDB and PromQL. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

) (#1001) * fix(tsdb): backport zero sample schema optimization from pr 17071 There is an optimization in prometheus/prometheus#17071 that ensures that the injected zero histogram sample has the same schema as the next sample. It turns out that if there are at least three samples: a normal sample (schema>0), zero sample(schema=0), normal sample (schema>0) then the histogram rate function will find the lowest schema and normalize all to that, meaning that the normal samples will be downscaled to schema 0. In dashboards it will look like as if we lost the resolution. See https://github.com/prometheus/prometheus/blob/9e4d23ddafcdc00021cd8630e78bb819e84ccac9/promql/functions.go#L344 So the optimization is actually needed to not loose resolution on read back. Alternatively we should use per sample CT as metadata instead, but that would be a big rewrite of TSDB and PromQL. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

The original implementation in #9705 for native histograms included a technical dept #15177 where samples were committed ordered by type not by their append order. This was fixed in #17071, but this docstring was not updated. I've also took the liberty to mention that we do not order by timestamp either, thus it is possible to append out of order samples. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

The original implementation in prometheus#9705 for native histograms included a technical dept prometheus#15177 where samples were committed ordered by type not by their append order. This was fixed in prometheus#17071, but this docstring was not updated. I've also took the liberty to mention that we do not order by timestamp either, thus it is possible to append out of order samples. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: Will Bollock <wbollock@linode.com>

beorn7 requested a review from jesusvazquez as a code owner August 21, 2025 23:21

beorn7 force-pushed the beorn7/tsdb branch from 96a63ec to 0cc2932 Compare August 21, 2025 23:24

beorn7 marked this pull request as draft August 21, 2025 23:24

beorn7 requested review from krajorama and removed request for jesusvazquez August 21, 2025 23:24

beorn7 mentioned this pull request Aug 21, 2025

TSDB Commit should honor ordering between floats and native histograms #15177

Closed

beorn7 force-pushed the beorn7/tsdb branch from 0cc2932 to a1fc5e4 Compare August 21, 2025 23:27

bwplotka mentioned this pull request Aug 22, 2025

prw: Remote Write 2.0 CT per Sample/Histogram #17036

Closed

bwplotka reviewed Aug 22, 2025

View reviewed changes

tsdb/head_append.go Show resolved Hide resolved

krajorama reviewed Sep 2, 2025

View reviewed changes

tsdb/head_append.go Show resolved Hide resolved

beorn7 force-pushed the beorn7/tsdb branch from a751433 to ee99521 Compare September 4, 2025 17:57

beorn7 force-pushed the beorn7/tsdb branch from ee99521 to 651f7a6 Compare September 4, 2025 18:32

prombot added the prombench label Sep 4, 2025

beorn7 force-pushed the beorn7/tsdb branch 2 times, most recently from d3e07b9 to 3360d93 Compare September 9, 2025 17:03

beorn7 marked this pull request as ready for review September 9, 2025 17:24

beorn7 requested a review from roidelapluie as a code owner September 9, 2025 17:24

beorn7 removed the request for review from roidelapluie September 9, 2025 17:25

krajorama reviewed Sep 10, 2025

View reviewed changes

tsdb/head_append.go Outdated Show resolved Hide resolved

tsdb/head_append.go Outdated Show resolved Hide resolved

tsdb/head_append.go Show resolved Hide resolved

tsdb/head_append.go Outdated Show resolved Hide resolved

tsdb/head_append.go Show resolved Hide resolved

beorn7 commented Sep 10, 2025

View reviewed changes

tsdb/head_append.go Outdated Show resolved Hide resolved

tsdb/head_append.go Outdated Show resolved Hide resolved

tsdb/head_append.go Show resolved Hide resolved

tsdb/head_append.go Show resolved Hide resolved

tsdb/head_append.go Show resolved Hide resolved

beorn7 force-pushed the beorn7/tsdb branch from 2fbed52 to fc5d049 Compare September 10, 2025 12:25

bwplotka reviewed Sep 11, 2025

View reviewed changes

tsdb/head_append.go Outdated Show resolved Hide resolved

krajorama reviewed Sep 11, 2025

View reviewed changes

tsdb/head_append.go Show resolved Hide resolved

beorn7 force-pushed the beorn7/tsdb branch 2 times, most recently from 4e4c7d8 to 484f115 Compare September 17, 2025 17:18

beorn7 added 5 commits September 17, 2025 19:22

tsdb: Extend TestDataNotAvailableAfterRollback

46cfc9f

This exposes the ommission of float histograms from the rollback. Signed-off-by: beorn7 <beorn@grafana.com>

promqltest: Add regression test for mixed-sample commit order

385d280

Regression test for: - #14172 - #15177 Test cases are by @krajorama, taken from commit b48bc9d . Signed-off-by: beorn7 <beorn@grafana.com>

tsdb: Refactor staleness marker handling

b1fbf4f

With the fixed commit order, we can now handle the conversion of float staleness markers to histogram staleness markers in a more direct way. Signed-off-by: beorn7 <beorn@grafana.com>

tsdb: Include floatHistograms in headAppender.Rollback()

bd0bf66

Signed-off-by: beorn7 <beorn@grafana.com>

beorn7 force-pushed the beorn7/tsdb branch from 484f115 to bd0bf66 Compare September 17, 2025 17:22

krajorama approved these changes Sep 18, 2025

View reviewed changes

beorn7 merged commit d5cc5e2 into main Sep 18, 2025
46 checks passed

beorn7 deleted the beorn7/tsdb branch September 18, 2025 11:55

krajorama mentioned this pull request Sep 23, 2025

fix(tsdb): appender does not honor append order #16823

Closed

krajorama mentioned this pull request Oct 7, 2025

perf(tsdb): reuse map of sample types to speed up head appender #17291

Merged

krajorama mentioned this pull request Oct 9, 2025

fix(tsdb): backport zero sample schema optimization from pr 17071 grafana/mimir-prometheus#1000

Merged

krajorama mentioned this pull request Oct 10, 2025

fix(tsdb): backport zero sample schema optimization from pr 17071 grafana/mimir-prometheus#1001

Merged

krajorama mentioned this pull request Nov 4, 2025

fix(metrics/wal): missing NHCBs failure grafana/alloy#4762

Merged

2 tasks

krajorama mentioned this pull request Nov 25, 2025

chore(storage): update docstring #17609

Merged

Conversation

beorn7 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue(s) does the PR fix:

Does this PR introduce a user-facing change?

Uh oh!

beorn7 commented Aug 21, 2025

Uh oh!

bwplotka commented Aug 22, 2025

Uh oh!

Uh oh!

krajorama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beorn7 commented Sep 4, 2025

Uh oh!

beorn7 commented Sep 4, 2025

Uh oh!

beorn7 commented Sep 4, 2025

Uh oh!

beorn7 commented Sep 4, 2025

Uh oh!

prombot commented Sep 4, 2025

Uh oh!

beorn7 commented Sep 5, 2025

Uh oh!

prombot commented Sep 5, 2025

Uh oh!

beorn7 commented Sep 5, 2025

Uh oh!

beorn7 commented Sep 9, 2025

Uh oh!

krajorama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krajorama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beorn7 commented Sep 17, 2025

Uh oh!

krajorama left a comment

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

beorn7 commented Aug 21, 2025 •

edited

Loading