tsdb: add support for OOO exemplars in CircularExemplarStorage#17469
tsdb: add support for OOO exemplars in CircularExemplarStorage#17469krajorama merged 9 commits intoprometheus:mainfrom
Conversation
d662594 to
9bf8b20
Compare
fc8c11a to
7325e9b
Compare
| return nil | ||
| } | ||
| return storage.ErrOutOfOrderExemplar | ||
| } |
There was a problem hiding this comment.
for samples we have ErrOutOfBounds i wonder if we shouldn't add another error type now that out of order is technically not a problem
| exemplars []exemplar.Exemplar | ||
| resize int64 | ||
| wantExemplars []exemplar.Exemplar | ||
| wantNextIndex int |
There was a problem hiding this comment.
nitpick, but this is a bit of an implementation detail of the buffer, might make these tests brittle if the implemetation changes
7325e9b to
a8aba66
Compare
|
@dimitarvdimitrov I consolidated the shrink and grow functions around |
dimitarvdimitrov
left a comment
There was a problem hiding this comment.
SGTM, left a few more comments, but otherwise I think this is good. Solid work, thanks @juliusmh!
a6201a6 to
a916ad9
Compare
|
Summary of resize operation on the circular buffer (_ indicating empty entries): Growing:
Shrinking:
|
e9aebe3 to
7b3a6b7
Compare
|
/prombench v3.7.3 |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
|
/prombench cancel |
24987ea to
69dbaeb
Compare
bboreham
left a comment
There was a problem hiding this comment.
I didn't detect any faults in the code, but it does occur to me that we now have a fair amount of complexity supporting some uncommon operations.
Growing the buffer we might expect to happen, but should be fairly simple since we will likely have to reallocate and copy all the data.
Shrinking should only happen when someone made a mistake or gave up on exemplars, so we don't really need to do it efficiently or avoid dropping much of the data.
I'll approve to unblock, but interested in your thoughts.
69dbaeb to
13ddf65
Compare
When growing or shrinking, we have to copy exemplar ranges and remap index and buffer entries. IMO, it's the only way to combine a linked-list and ensure circular behavior. Grow/shrink operations are very similar in their implementation: both rely on |
c860e17 to
f605211
Compare
| if indexExists { | ||
| outOfOrder := e.Ts >= ce.exemplars[idx.oldest].exemplar.Ts && e.Ts < ce.exemplars[idx.newest].exemplar.Ts | ||
| if outOfOrder { | ||
| insertionIndex = ce.findInsertionIndex(e, idx) |
There was a problem hiding this comment.
My only concern is that for every native histogram we'd run findInsertionIndex for every exemplar, so O(n^2) complexity.
To optimize this away, maybe we could add a cache in the Appender (headAppenderBase) that keeps the index of the last found duplicate exemplar and pass that to AddExemplar. When we go into this out-or-order code, we'd check if the cached index entry's ref equals the current entry and start iteration from the cached entry if yes.
And of course AddExemplar will have to return the last found duplicate's index to the Appender for caching.
WDYT? I can imagine doing it in a separate PR as well because it would be a quite self contained optimization + benchmark needed.
I'm assuming that since appender is single thread, we always write exemplars in order to the commit queue and WAL so this cache would work.
There was a problem hiding this comment.
That sounds like a great improvement, but if possible I'd like to keep it out of this (already lengthy) PR. Such a cache would also have to deal with resize operations though (which transform buffer and index pointers), so perhaps it would be best to built it into the CircularExemplarBuffer or store it along side the index.
There was a problem hiding this comment.
I believe the cache needs to be per appender/WAL reader, because the circular buffer has multiple parallel users, so the cache would make sense per user. But yes, it's a bit complicated - as cache invalidation always is. But conversely it means that when we shrink/grow the storage, we don't have to update the cache, just invalidate it somehow.
There was a problem hiding this comment.
Resize operations are incredibly rare; there is no need to preserve a cache across them.
f605211 to
8c233d5
Compare
krajorama
left a comment
There was a problem hiding this comment.
Approved with comment:
I'm unsure of the performance impact due to native histograms having multiple exemplars. The default number of exemplars for native histograms is 10 (at least in client_golang), and not all metrics are native histograms so it's probably ok? And technically exemplar storage is experimental , so we can experiment, but it would be nice to either implement the caching we talked about or a feature flag or prove from measurements there's no big degradation, before the next Prometheus release.
I guess we'll see in Grafana Labs measurements the impact - and there we can disable this in our fork if needed.
bboreham
left a comment
There was a problem hiding this comment.
As before, I'm ok to go ahead with this but have noted some more points that came to me on a further read-through.
tsdb/exemplar.go
Outdated
| @@ -262,7 +271,7 @@ func (ce *CircularExemplarStorage) validateExemplar(idx *indexEntry, e exemplar. | |||
| // to check for that would be expensive (versus just comparing with the most recent one) especially | |||
There was a problem hiding this comment.
This comment may need updating? It seems overly talkative to me; telling the story of how we got here should be in the commit message not the code, but explaining what we do and don't need to check for is relevant.
Is "during the scrape" relevant at all? This code is also used when exemplars are received via remote-write.
There was a problem hiding this comment.
I updated it to the following docstring:
// Reject exemplars older than the OOO time window relative to the newest exemplar.
// Exemplars with the same timestamp are ordered by value then label hash to detect
// duplicates without iterating through all stored exemplars, which would be too
// expensive under lock. Exemplars with equal timestamps but different values or
// labels are allowed to support multiple buckets of native histograms.
| if indexExists { | ||
| outOfOrder := e.Ts >= ce.exemplars[idx.oldest].exemplar.Ts && e.Ts < ce.exemplars[idx.newest].exemplar.Ts | ||
| if outOfOrder { | ||
| insertionIndex = ce.findInsertionIndex(e, idx) |
There was a problem hiding this comment.
Resize operations are incredibly rare; there is no need to preserve a cache across them.
| if (e.Ts < newestExemplar.Ts && e.Ts <= newestExemplar.Ts-ce.oooTimeWindowMillis) || | ||
| (e.Ts == newestExemplar.Ts && e.Value < newestExemplar.Value) || | ||
| (e.Ts == newestExemplar.Ts && e.Value == newestExemplar.Value && e.Labels.Hash() < newestExemplar.Labels.Hash()) { | ||
| if appended { |
There was a problem hiding this comment.
Both the error message "out of order exemplar" and the counter prometheus_tsdb_exemplar_out_of_order_exemplars_total now seem mis-named; they count exemplars which are outside the OOO window, or ones with the same timestamp and different value or exemplar labels.
The help text for the counter is a little more correct - "Total number of out of order exemplar ingestion failed attempts."
There was a problem hiding this comment.
To be fair the naming was wrong even before this PR. We already said OOO for an exemplar that had the same timestamp but different value / labels.
Technically we can change it to something else since this is an experimental feature, but I think it's enough to have one big change this in this PR,
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
…rcular exemplar storage Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
… deleting its last exemplar Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
…ter explain exemplar rejection; remove unrelated changes Signed-off-by: Julius Hinze <julius.hinze@grafana.com>
5976efc to
278d5b3
Compare
What this PR does
Support out-of-order exemplar ingestion in the
CircularExemplarStorage. The out-of-order window is controlled by the regular OOOTimeWindow flag for samples.AddExemplar
During ingestion, exemplars are appended to / discarded from the circular buffer as usual, but their position in the (doubly) linked list is controlled to maintain temporal ordering. Adding elements to the head or the tail of the linked list is trivial (and optimized), adding elements in the middle requires us to find an insertion point, which is achieved by traversing the linked list. A back link was introduced (doubly linked list) which dramatically speeds up finding the insertion point (out-of-order exemplars lie usually very close to the newest exemplar).
Resize
Grow extends the buffer to the right and rearranges entries. Shrink the buffer if it has enough space to accommodate the new size by trimming it. Otherwise, we calculate a specific range to delete starting from
ce.nextIndexto(ce.nextIndex + (oldSize - newSize)) % oldSize. Entries at these indices are removed and the remaining one are rearranged. Resizing to 0 (disable exemplars) returns early.When rearranging buffer entries we adjust
next/prevandindex.newest/oldestpointers to point to the correct entries. This is handled incopyExemplarRanges.Benchmarks
BenchmarkResizeExemplars
BenchmarkAddExemplar
Which issue(s) does the PR fix:
Closes #13577
Does this PR introduce a user-facing change?