Skip to content

tsdb: add support for OOO exemplars in CircularExemplarStorage#17469

Merged
krajorama merged 9 commits intoprometheus:mainfrom
juliusmh:jmh/ooo_exemplars
Jan 7, 2026
Merged

tsdb: add support for OOO exemplars in CircularExemplarStorage#17469
krajorama merged 9 commits intoprometheus:mainfrom
juliusmh:jmh/ooo_exemplars

Conversation

@juliusmh
Copy link
Contributor

@juliusmh juliusmh commented Nov 3, 2025

What this PR does

Support out-of-order exemplar ingestion in the CircularExemplarStorage. The out-of-order window is controlled by the regular OOOTimeWindow flag for samples.

AddExemplar

During ingestion, exemplars are appended to / discarded from the circular buffer as usual, but their position in the (doubly) linked list is controlled to maintain temporal ordering. Adding elements to the head or the tail of the linked list is trivial (and optimized), adding elements in the middle requires us to find an insertion point, which is achieved by traversing the linked list. A back link was introduced (doubly linked list) which dramatically speeds up finding the insertion point (out-of-order exemplars lie usually very close to the newest exemplar).

Resize

Grow extends the buffer to the right and rearranges entries. Shrink the buffer if it has enough space to accommodate the new size by trimming it. Otherwise, we calculate a specific range to delete starting from ce.nextIndex to (ce.nextIndex + (oldSize - newSize)) % oldSize. Entries at these indices are removed and the remaining one are rearranged. Resizing to 0 (disable exemplars) returns early.

When rearranging buffer entries we adjust next/prev and index.newest/oldest pointers to point to the correct entries. This is handled in copyExemplarRanges.

Benchmarks

BenchmarkResizeExemplars
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Apple M4 Pro
                                            │ /tmp/bench_resize_old │       /tmp/bench_resize_new        │
                                            │        sec/op         │   sec/op     vs base               │
ResizeExemplars/grow-100000-to-200000-14              2276.8µ ± 14%   850.0µ ± 2%  -62.67% (p=0.002 n=6)
ResizeExemplars/shrink-100000-to-50000-14             1019.3µ ±  1%   281.1µ ± 3%  -72.42% (p=0.002 n=6)
ResizeExemplars/grow-1000000-to-2000000-14             22.86m ±  3%   11.09m ± 3%  -51.49% (p=0.002 n=6)
ResizeExemplars/shrink-1000000-to-500000-14           10.064m ±  1%   3.040m ± 1%  -69.80% (p=0.002 n=6)
geomean                                                4.807m         1.685m       -64.95%

                                            │ /tmp/bench_resize_old │       /tmp/bench_resize_new        │
                                            │         B/op          │     B/op      vs base              │
ResizeExemplars/grow-100000-to-200000-14               11.12Mi ± 0%   12.21Mi ± 0%  +9.82% (p=0.002 n=6)
ResizeExemplars/shrink-100000-to-50000-14              2.784Mi ± 0%   3.055Mi ± 0%  +9.74% (p=0.002 n=6)
ResizeExemplars/grow-1000000-to-2000000-14             113.6Mi ± 0%   122.1Mi ± 0%  +7.43% (p=0.002 n=6)
ResizeExemplars/shrink-1000000-to-500000-14            28.45Mi ± 0%   30.52Mi ± 0%  +7.30% (p=0.002 n=6)
geomean                                                17.79Mi        19.31Mi       +8.56%

                                            │ /tmp/bench_resize_old │       /tmp/bench_resize_new       │
                                            │       allocs/op       │ allocs/op   vs base               │
ResizeExemplars/grow-100000-to-200000-14              1036.000 ± 0%   3.000 ± 0%  -99.71% (p=0.002 n=6)
ResizeExemplars/shrink-100000-to-50000-14              511.000 ± 0%   2.000 ± 0%  -99.61% (p=0.002 n=6)
ResizeExemplars/grow-1000000-to-2000000-14           10517.000 ± 0%   4.000 ± 0%  -99.96% (p=0.002 n=6)
ResizeExemplars/shrink-1000000-to-500000-14           5131.000 ± 0%   2.000 ± 0%  -99.96% (p=0.002 n=6)
geomean                                                 2.312k        2.632       -99.89%

BenchmarkAddExemplar
goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb
cpu: Apple M4 Pro
                                                      │ /tmp/bench_add_ex_old │       /tmp/bench_add_ex_new2       │
                                                      │        sec/op         │    sec/op     vs base              │
AddExemplar/10000/1000-14                                         413.1µ ± 2%   421.1µ ± 10%  +1.96% (p=0.015 n=6)
AddExemplar/100000/1000-14                                        4.225m ± 1%   4.133m ±  5%       ~ (p=0.065 n=6)
AddExemplar/1000000/1000-14                                       42.93m ± 1%   41.94m ±  0%  -2.30% (p=0.002 n=6)
AddExemplar/10000/10000-14                                        440.2µ ± 0%   430.9µ ±  0%  -2.12% (p=0.002 n=6)
AddExemplar/100000/10000-14                                       4.243m ± 0%   4.149m ±  1%  -2.21% (p=0.002 n=6)
AddExemplar/1000000/10000-14                                      43.90m ± 0%   43.41m ±  5%       ~ (p=0.065 n=6)
AddExemplar/10000/100000-14                                       485.9µ ± 1%   462.8µ ±  0%  -4.75% (p=0.002 n=6)
AddExemplar/100000/100000-14                                      4.407m ± 2%   4.192m ±  0%  -4.89% (p=0.002 n=6)
AddExemplar/1000000/100000-14                                     45.43m ± 0%   44.45m ±  1%  -2.16% (p=0.002 n=6)
AddExemplar_OutOfOrder/empty/reverse-14                                         39.27m ±  1%
AddExemplar_OutOfOrder/empty/out-of-order-14                                    44.95m ±  0%
AddExemplar_OutOfOrder/empty/multi-in-order-14                                  459.8µ ±  1%
AddExemplar_OutOfOrder/empty/multi-reverse-14                                   461.1µ ±  0%
AddExemplar_OutOfOrder/empty/multi-out-of-order-14                              605.1µ ±  1%
AddExemplar_OutOfOrder/empty/in-order-14                                        38.86m ±  0%
AddExemplar_OutOfOrder/full-one/out-of-order-14                                 44.92m ±  0%
AddExemplar_OutOfOrder/full-one/multi-in-order-14                               462.7µ ±  1%
AddExemplar_OutOfOrder/full-one/multi-reverse-14                                462.9µ ±  0%
AddExemplar_OutOfOrder/full-one/multi-out-of-order-14                           608.0µ ±  1%
AddExemplar_OutOfOrder/full-one/in-order-14                                     38.88m ±  0%
AddExemplar_OutOfOrder/full-one/reverse-14                                      39.29m ±  0%
AddExemplar_OutOfOrder/full-multiple/multi-reverse-14                           467.9µ ±  1%
geomean                                                           4.384m        4.004m        -2.22%

                                                      │ /tmp/bench_add_ex_old │       /tmp/bench_add_ex_new2        │
                                                      │         B/op          │     B/op       vs base              │
AddExemplar/10000/1000-14                                        9.735Ki ± 0%    9.735Ki ± 0%       ~ (p=0.636 n=6)
AddExemplar/100000/1000-14                                       98.80Ki ± 0%    98.80Ki ± 0%       ~ (p=0.054 n=6)
AddExemplar/1000000/1000-14                                     1012.9Ki ± 0%   1012.9Ki ± 0%  -0.00% (p=0.032 n=6)
AddExemplar/10000/10000-14                                       9.740Ki ± 0%    9.740Ki ± 0%       ~ (p=0.675 n=6)
AddExemplar/100000/10000-14                                      98.80Ki ± 0%    98.81Ki ± 0%       ~ (p=0.591 n=6)
AddExemplar/1000000/10000-14                                    1012.9Ki ± 0%   1012.9Ki ± 0%       ~ (p=0.126 n=6)
AddExemplar/10000/100000-14                                      9.742Ki ± 0%    9.743Ki ± 0%       ~ (p=0.411 n=6)
AddExemplar/100000/100000-14                                     98.80Ki ± 0%    98.80Ki ± 0%       ~ (p=0.065 n=6)
AddExemplar/1000000/100000-14                                   1012.9Ki ± 0%   1012.9Ki ± 0%  -0.00% (p=0.048 n=6)
AddExemplar_OutOfOrder/empty/reverse-14                                          39.06Ki ± 0%
AddExemplar_OutOfOrder/empty/out-of-order-14                                     39.06Ki ± 0%
AddExemplar_OutOfOrder/empty/multi-in-order-14                                   312.0Ki ± 0%
AddExemplar_OutOfOrder/empty/multi-reverse-14                                    312.0Ki ± 0%
AddExemplar_OutOfOrder/empty/multi-out-of-order-14                               352.5Ki ± 0%
AddExemplar_OutOfOrder/empty/in-order-14                                         39.06Ki ± 0%
AddExemplar_OutOfOrder/full-one/out-of-order-14                                  39.06Ki ± 0%
AddExemplar_OutOfOrder/full-one/multi-in-order-14                                312.0Ki ± 0%
AddExemplar_OutOfOrder/full-one/multi-reverse-14                                 312.0Ki ± 0%
AddExemplar_OutOfOrder/full-one/multi-out-of-order-14                            352.5Ki ± 0%
AddExemplar_OutOfOrder/full-one/in-order-14                                      39.06Ki ± 0%
AddExemplar_OutOfOrder/full-one/reverse-14                                       39.06Ki ± 0%
AddExemplar_OutOfOrder/full-multiple/multi-reverse-14                            311.8Ki ± 0%
geomean                                                          99.15Ki         112.0Ki       +0.00%

                                                      │ /tmp/bench_add_ex_old │       /tmp/bench_add_ex_new2        │
                                                      │       allocs/op       │  allocs/op   vs base                │
AddExemplar/10000/1000-14                                          499.0 ± 0%    499.0 ± 0%       ~ (p=1.000 n=6) ¹
AddExemplar/100000/1000-14                                        4.999k ± 0%   4.999k ± 0%       ~ (p=1.000 n=6) ¹
AddExemplar/1000000/1000-14                                       50.00k ± 0%   50.00k ± 0%       ~ (p=1.000 n=6) ¹
AddExemplar/10000/10000-14                                         499.0 ± 0%    499.0 ± 0%       ~ (p=1.000 n=6) ¹
AddExemplar/100000/10000-14                                       4.999k ± 0%   4.999k ± 0%       ~ (p=1.000 n=6) ¹
AddExemplar/1000000/10000-14                                      50.00k ± 0%   50.00k ± 0%       ~ (p=1.000 n=6)
AddExemplar/10000/100000-14                                        499.0 ± 0%    499.0 ± 0%       ~ (p=1.000 n=6) ¹
AddExemplar/100000/100000-14                                      4.999k ± 0%   4.999k ± 0%       ~ (p=1.000 n=6) ¹
AddExemplar/1000000/100000-14                                     50.00k ± 0%   50.00k ± 0%       ~ (p=0.242 n=6)
AddExemplar_OutOfOrder/empty/reverse-14                                         5.000k ± 0%
AddExemplar_OutOfOrder/empty/out-of-order-14                                    5.000k ± 0%
AddExemplar_OutOfOrder/empty/multi-in-order-14                                  19.90k ± 0%
AddExemplar_OutOfOrder/empty/multi-reverse-14                                   19.91k ± 0%
AddExemplar_OutOfOrder/empty/multi-out-of-order-14                              21.67k ± 0%
AddExemplar_OutOfOrder/empty/in-order-14                                        5.000k ± 0%
AddExemplar_OutOfOrder/full-one/out-of-order-14                                 5.000k ± 0%
AddExemplar_OutOfOrder/full-one/multi-in-order-14                               19.90k ± 0%
AddExemplar_OutOfOrder/full-one/multi-reverse-14                                19.91k ± 0%
AddExemplar_OutOfOrder/full-one/multi-out-of-order-14                           21.67k ± 0%
AddExemplar_OutOfOrder/full-one/in-order-14                                     5.000k ± 0%
AddExemplar_OutOfOrder/full-one/reverse-14                                      5.000k ± 0%
AddExemplar_OutOfOrder/full-multiple/multi-reverse-14                           19.90k ± 0%
geomean                                                           4.996k        7.818k       -0.00%
¹ all samples are equal

Which issue(s) does the PR fix:

Closes #13577

Does this PR introduce a user-facing change?

[FEATURE] Circular exemplar buffer supports out-of-order exemplars. 

@juliusmh juliusmh force-pushed the jmh/ooo_exemplars branch 9 times, most recently from fc8c11a to 7325e9b Compare November 14, 2025 08:56
@juliusmh juliusmh marked this pull request as ready for review November 14, 2025 10:08
return nil
}
return storage.ErrOutOfOrderExemplar
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for samples we have ErrOutOfBounds i wonder if we shouldn't add another error type now that out of order is technically not a problem

exemplars []exemplar.Exemplar
resize int64
wantExemplars []exemplar.Exemplar
wantNextIndex int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but this is a bit of an implementation detail of the buffer, might make these tests brittle if the implemetation changes

@juliusmh
Copy link
Contributor Author

@dimitarvdimitrov I consolidated the shrink and grow functions around copyExemplarRanges which handles copying exemplars and applying shifts.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, left a few more comments, but otherwise I think this is good. Solid work, thanks @juliusmh!

@juliusmh juliusmh force-pushed the jmh/ooo_exemplars branch 3 times, most recently from a6201a6 to a916ad9 Compare November 18, 2025 23:16
@juliusmh
Copy link
Contributor Author

Summary of resize operation on the circular buffer (_ indicating empty entries):

Growing:

  • aaa<next>bbbbb -> bbbbbaaa<next>____

Shrinking:

  • aaa<next>____<next+diff> -> aaa<next>__ (if there is enough room to shrink)
  • aaa<next>bbb<next+diff>cc -> ccaaa<next>__ (if (next + diff) < l)
  • aaa<next+diff>bb<next>ccc -> bb<next>__ (else)

@juliusmh juliusmh force-pushed the jmh/ooo_exemplars branch 2 times, most recently from e9aebe3 to 7b3a6b7 Compare November 25, 2025 13:56
@krajorama krajorama self-assigned this Nov 26, 2025
@krajorama
Copy link
Member

/prombench v3.7.3

@prombot
Copy link
Contributor

prombot commented Nov 26, 2025

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-17469 and v3.7.3

After the successful deployment (check status here), the benchmarking results can be viewed at:

Available Commands:

  • To restart benchmark: /prombench restart v3.7.3
  • To stop benchmark: /prombench cancel
  • To print help: /prombench help

@krajorama
Copy link
Member

/prombench cancel

@juliusmh juliusmh force-pushed the jmh/ooo_exemplars branch 4 times, most recently from 24987ea to 69dbaeb Compare December 8, 2025 16:09
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't detect any faults in the code, but it does occur to me that we now have a fair amount of complexity supporting some uncommon operations.

Growing the buffer we might expect to happen, but should be fairly simple since we will likely have to reallocate and copy all the data.

Shrinking should only happen when someone made a mistake or gave up on exemplars, so we don't really need to do it efficiently or avoid dropping much of the data.

I'll approve to unblock, but interested in your thoughts.

@juliusmh
Copy link
Contributor Author

[..] it does occur to me that we now have a fair amount of complexity supporting some uncommon operations.

When growing or shrinking, we have to copy exemplar ranges and remap index and buffer entries. IMO, it's the only way to combine a linked-list and ensure circular behavior. Grow/shrink operations are very similar in their implementation: both rely on copyExemplarRange which holds most of the complexity, is 48 lines and well-tested.

@juliusmh juliusmh force-pushed the jmh/ooo_exemplars branch 2 times, most recently from c860e17 to f605211 Compare January 2, 2026 11:42
if indexExists {
outOfOrder := e.Ts >= ce.exemplars[idx.oldest].exemplar.Ts && e.Ts < ce.exemplars[idx.newest].exemplar.Ts
if outOfOrder {
insertionIndex = ce.findInsertionIndex(e, idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that for every native histogram we'd run findInsertionIndex for every exemplar, so O(n^2) complexity.

To optimize this away, maybe we could add a cache in the Appender (headAppenderBase) that keeps the index of the last found duplicate exemplar and pass that to AddExemplar. When we go into this out-or-order code, we'd check if the cached index entry's ref equals the current entry and start iteration from the cached entry if yes.

And of course AddExemplar will have to return the last found duplicate's index to the Appender for caching.

WDYT? I can imagine doing it in a separate PR as well because it would be a quite self contained optimization + benchmark needed.

I'm assuming that since appender is single thread, we always write exemplars in order to the commit queue and WAL so this cache would work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a great improvement, but if possible I'd like to keep it out of this (already lengthy) PR. Such a cache would also have to deal with resize operations though (which transform buffer and index pointers), so perhaps it would be best to built it into the CircularExemplarBuffer or store it along side the index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the cache needs to be per appender/WAL reader, because the circular buffer has multiple parallel users, so the cache would make sense per user. But yes, it's a bit complicated - as cache invalidation always is. But conversely it means that when we shrink/grow the storage, we don't have to update the cache, just invalidate it somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resize operations are incredibly rare; there is no need to preserve a cache across them.

@juliusmh juliusmh requested a review from krajorama January 3, 2026 13:47
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with comment:

I'm unsure of the performance impact due to native histograms having multiple exemplars. The default number of exemplars for native histograms is 10 (at least in client_golang), and not all metrics are native histograms so it's probably ok? And technically exemplar storage is experimental , so we can experiment, but it would be nice to either implement the caching we talked about or a feature flag or prove from measurements there's no big degradation, before the next Prometheus release.

I guess we'll see in Grafana Labs measurements the impact - and there we can disable this in our fork if needed.

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before, I'm ok to go ahead with this but have noted some more points that came to me on a further read-through.

tsdb/exemplar.go Outdated
@@ -262,7 +271,7 @@ func (ce *CircularExemplarStorage) validateExemplar(idx *indexEntry, e exemplar.
// to check for that would be expensive (versus just comparing with the most recent one) especially
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may need updating? It seems overly talkative to me; telling the story of how we got here should be in the commit message not the code, but explaining what we do and don't need to check for is relevant.
Is "during the scrape" relevant at all? This code is also used when exemplars are received via remote-write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to the following docstring:

	// Reject exemplars older than the OOO time window relative to the newest exemplar.
	// Exemplars with the same timestamp are ordered by value then label hash to detect
	// duplicates without iterating through all stored exemplars, which would be too
	// expensive under lock. Exemplars with equal timestamps but different values or
	// labels are allowed to support multiple buckets of native histograms.

if indexExists {
outOfOrder := e.Ts >= ce.exemplars[idx.oldest].exemplar.Ts && e.Ts < ce.exemplars[idx.newest].exemplar.Ts
if outOfOrder {
insertionIndex = ce.findInsertionIndex(e, idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resize operations are incredibly rare; there is no need to preserve a cache across them.

if (e.Ts < newestExemplar.Ts && e.Ts <= newestExemplar.Ts-ce.oooTimeWindowMillis) ||
(e.Ts == newestExemplar.Ts && e.Value < newestExemplar.Value) ||
(e.Ts == newestExemplar.Ts && e.Value == newestExemplar.Value && e.Labels.Hash() < newestExemplar.Labels.Hash()) {
if appended {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the error message "out of order exemplar" and the counter prometheus_tsdb_exemplar_out_of_order_exemplars_total now seem mis-named; they count exemplars which are outside the OOO window, or ones with the same timestamp and different value or exemplar labels.
The help text for the counter is a little more correct - "Total number of out of order exemplar ingestion failed attempts."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair the naming was wrong even before this PR. We already said OOO for an exemplar that had the same timestamp but different value / labels.

Technically we can change it to something else since this is an experimental feature, but I think it's enough to have one big change this in this PR,

Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
…rcular exemplar storage

Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
… deleting its last exemplar

Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
…ter explain exemplar rejection; remove unrelated changes

Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
@krajorama krajorama merged commit 22463b1 into prometheus:main Jan 7, 2026
44 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for ingesting out of order exemplars

5 participants