refactor(appenderV2)[PART1]: add AppenderV2 interface; add TSDB AppenderV2 implementation#17629
refactor(appenderV2)[PART1]: add AppenderV2 interface; add TSDB AppenderV2 implementation#17629
Conversation
1d6ad03 to
cb83cf5
Compare
saswatamcode
left a comment
There was a problem hiding this comment.
Generally looks really good, will try this out! 🚀
Some small nits
cb83cf5 to
54e51c8
Compare
|
Ran quick benchmark. We expect:
EDIT: It was benchmark issue, not code. Updated PR description with the latest numbers (we are now faster). |
3e22da5 to
a1d088e
Compare
krajorama
left a comment
There was a problem hiding this comment.
Looking good, I've added a couple of comments.
Test coverage is pretty good, I think we need an extra test in head_append_v2_test.go for the metadata in WAL case.
Also maybe have db_append_v2_test.go ? We'll have to convert those eventually anyway.
For tests only, we had various ways of opening DB. Reduced to one instead of: * Open * newTestDB * newTestDBOpts * openTestDB This so #17629 is smaller and bit easier. Also for test maintainability and consistency. Signed-off-by: bwplotka <bwplotka@gmail.com>
For tests only, we had various ways of opening DB. Reduced to one instead of: * Open * newTestDB * newTestDBOpts * openTestDB This so #17629 is smaller and bit easier. Also for test maintainability and consistency. Signed-off-by: bwplotka <bwplotka@gmail.com>
For tests only, we had various ways of opening DB. Reduced to one instead of: * Open * newTestDB * newTestDBOpts * openTestDB This so #17629 is smaller and bit easier. Also for test maintainability and consistency. Signed-off-by: bwplotka <bwplotka@gmail.com>
For tests only, we had various ways of opening DB. Reduced to one instead of: * Open * newTestDB * newTestDBOpts * openTestDB This so #17629 is smaller and bit easier. Also for test maintainability and consistency. Signed-off-by: bwplotka <bwplotka@gmail.com>
For tests only, we had various ways of opening DB. Reduced to one instead of: * Open * newTestDB * newTestDBOpts * openTestDB This so #17629 is smaller and bit easier. Also for test maintainability and consistency. Signed-off-by: bwplotka <bwplotka@gmail.com>
96308f4 to
82b783f
Compare
|
Addressed all, PTAL! @krajorama |
82b783f to
e363adb
Compare
* promql: fix histogram_fraction issue when lower falls within the first bucket (#17424) Signed-off-by: Mohammad Alavi <m.alavi1986@gmail.com> * prepare release 3.8.0-rc.0 Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * test: skip TestRemoteWrite_ReshardingWithoutDeadlock temporarily as flaky (#17534) (#17543) (cherry picked from commit 35c3232) Signed-off-by: machine424 <ayoubmrini424@gmail.com> Signed-off-by: Jan Fajerski <jfajersk@redhat.com> Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com> * chore(deps): bump prometheus/promci from 0.4.7 to 0.5.0 Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * chore(deps): bump prometheus/promci from 0.5.0 to 0.5.1 Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * chore(deps): bump prometheus/promci from 0.5.1 to 0.5.2 Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * chore(deps): bump prometheus/promci from 0.5.2 to 0.5.3 Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * prw2: Move Remote Write 2.0 CT to be per Sample; Rename to ST (start timestamp) (#17411) Relates to prometheus/prometheus#16944 (comment) Signed-off-by: bwplotka <bwplotka@gmail.com> (cherry picked from commit cefefc6) * chore: prepare 3.8.0-rc.1 entry Signed-off-by: bwplotka <bwplotka@gmail.com> * [chore]: bump common dep to support RFC7523 3.1 Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> * Update Prometheus Agent doc (#17591) * Add a nav title to fix docs website generator. * Make it more clear that "Prometheus Agent" is a mode, not a seaparate service. * Add to index. * Cleanup some wording. * Add a downsides section. Signed-off-by: SuperQ <superq@gmail.com> (cherry picked from commit d0d2699) * chore(deps): bump github.com/prometheus/common from 0.67.3 to 0.67.4 (#17594) Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * prepare release v3.8.0-rc.1 Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * prepare release v3.8.0 Signed-off-by: Jan Fajerski <jfajersk@redhat.com> * chore: Fix function name typo in createBatchSpan comment Signed-off-by: zjumathcode <pai314159@2980.com> * feat: Add flag that blocks lvl 1 compactions until upload is confirmed in an external JSON file (#17435) * Delay compactions until Thanos uploads all blocks Using Thanos sidecar with Prometheus requires us to disable TSDB compactions on Prometheus side by setting --storage.tsdb.min-block-duration and --storage.tsdb.max-block-duration to the same value. See https://thanos.io/tip/components/sidecar.md. The main problem this avoids is that Prometheus might compact given block before Thanos uploads it, creating a gap in Thanos metrics. Thanos does not upload compacted blocks because that would upload the same sample multiple times. You can tell Thanos to upload compacted blocks but that is aimed at one time migrations. This patch creates a bridge between Thanos and Prometheus by allowing Prometheus to read the shipper file Thanos creates, where it tracks which blocks were already uploaded, and using that data delays compaction of blocks until they are marked as uploaded by Thanos. Thanks to this both services can coordinate with each other (in a way) and we can stop disabling compaction on Prometheus side when Thanos uploads are enabled. The reason to have this is that disabling compactions have very dramatic performance cost. Since most time series exist for longer than a single block duration (2h by default) large chunks of block index will reference the same series, so 10 * 2h blocks will each have an index that is usually fairly big and is almost the same for all 10 blocks. Compaction de-duplicates the index so merging 10 blocks together would leave us with a single index that is around the same size as each of these 10 2h blocks would have (plus some extra for series that only exists in some blocks, but not all). Every range query that iterates over all 10 blocks would then have to read each index and so we're doing 10x more work then if we had a single compacted block. Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com> * Rename structs and functions to make this more generic Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com> * Address review comments Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com> * Cache UploadMeta for 1 minute Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com> --------- Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com> * RW2: Allow custom scope in azuread (#17483) Signed-off-by: Ben Edmunds <sammybenblue2@gmail.com> * docs: Describe how time() is set to start at 0 in unit tests The return value of functions relating to the current time, e.g. time(), is set by promtool to start at timestamp 0 at the start of a test's evaluation. This has the very nice consequence that tests can run reliably without depending on when they are run. It does, however, mean that tests will give out results that can be unexpected by users. If this behaviour is documented, then users will be empowered to write tests for their rules that use time-dependent functions. (Closes: prometheus/docs#1464) Signed-off-by: Gabriel Filion <lelutin@torproject.org> * refactor(tsdb): use one test newTestDB constructor (#17638) For tests only, we had various ways of opening DB. Reduced to one instead of: * Open * newTestDB * newTestDBOpts * openTestDB This so prometheus/prometheus#17629 is smaller and bit easier. Also for test maintainability and consistency. Signed-off-by: bwplotka <bwplotka@gmail.com> * Add start_timestamp field for unit tests. This commit adds support for configuring a custom start timestamp for Prometheus unit tests, allowing tests to use realistic timestamps instead of starting at Unix epoch 0. Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com> * Fix serialization for empty `ignoring()` in combination with `group_x()` Currently both the backend and frontend printers/formatters/serializers incorrectly transform the following expression: ``` up * ignoring() group_left(__name__) node_boot_time_seconds ``` ...into: ``` up * node_boot_time_seconds ``` ...which yields a different result (including the metric name in the result vs. no metric name). We need to keep empty `ignoring()` modifiers if there is a grouping modifier present. Signed-off-by: Julius Volz <julius.volz@gmail.com> * Simplify StartTime assignment in unit test setup. Remove redundant IsZero check since promqltest.LazyLoader already handles zero StartTime by defaulting to Unix epoch. Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com> * Update golangci-lint and add modernize check (#17640) * add modernize check Signed-off-by: dongjiang1989 <dongjiang1989@126.com> * fix golangci lint Signed-off-by: dongjiang1989 <dongjiang1989@126.com> --------- Signed-off-by: dongjiang1989 <dongjiang1989@126.com> * fix lint --------- Signed-off-by: Mohammad Alavi <m.alavi1986@gmail.com> Signed-off-by: Jan Fajerski <jfajersk@redhat.com> Signed-off-by: machine424 <ayoubmrini424@gmail.com> Signed-off-by: bwplotka <bwplotka@gmail.com> Signed-off-by: Jorge Turrado <jorge.turrado@mail.schwarz> Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: zjumathcode <pai314159@2980.com> Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com> Signed-off-by: Ben Edmunds <sammybenblue2@gmail.com> Signed-off-by: Gabriel Filion <lelutin@torproject.org> Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com> Signed-off-by: Julius Volz <julius.volz@gmail.com> Signed-off-by: dongjiang1989 <dongjiang1989@126.com> Co-authored-by: Mohammad Alavi <m.alavi1986@gmail.com> Co-authored-by: Jan Fajerski <jfajersk@redhat.com> Co-authored-by: Jan Fajerski <jan--f@users.noreply.github.com> Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com> Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Co-authored-by: Jorge Turrado <jorge.turrado@mail.schwarz> Co-authored-by: Ben Kochie <superq@gmail.com> Co-authored-by: zjumathcode <pai314159@2980.com> Co-authored-by: Łukasz Mierzwa <l.mierzwa@gmail.com> Co-authored-by: Ben Edmunds <Tigger2014@users.noreply.github.com> Co-authored-by: Julien <291750+roidelapluie@users.noreply.github.com> Co-authored-by: Gabriel Filion <lelutin@torproject.org> Co-authored-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: dongjiang <dongjiang1989@126.com> Co-authored-by: Jeanette Tan <jeanette.tan@grafana.com>
krajorama
left a comment
There was a problem hiding this comment.
Epic work! I've checked the test files side by side , but not every line, just whether the new test uses the new appender or not.
I'm fine with not so many tests on metadata, but I feel like the testing on the zero sample injection is maybe too light? Not sure, it's a lot tests.
Signed-off-by: bwplotka <bwplotka@gmail.com>
… (starting point) Signed-off-by: bwplotka <bwplotka@gmail.com>
…ting point) Signed-off-by: bwplotka <bwplotka@gmail.com>
…(starting point) Signed-off-by: bwplotka <bwplotka@gmail.com>
6e2ef5f to
5d05dcf
Compare
Signed-off-by: bwplotka <bwplotka@gmail.com> tmp Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
5d05dcf to
e7e4509
Compare
krajorama
left a comment
There was a problem hiding this comment.
LGTM, usual caveat from me: I'm not TSDB maintainer/generic maintainer.
|
@krajorama adding you #17663 @jesusvazquez @codesome @bboreham any objections to merge this/on this AppenderV2 shape? |
jesusvazquez
left a comment
There was a problem hiding this comment.
LGTM a good chunk of work went on here. It's true that in the past 3 years between OOO and Native Histograms the appender code kind of grew uncontrolled so I appreciate you taking the time to improve it.
|
Thanks everyone for review 💪🏽 |
|
|
||
| if isStale { | ||
| // For stale values we never attempt to process metadata/exemplars, claim the success. | ||
| return ref, nil |
There was a problem hiding this comment.
@bwplotka While vendoring this change into Mimir, Cursor pointed out that this might not be quite right:
When appending a stale sample, the function returns the input
refparameter instead ofstorage.SeriesRef(s.ref). If the caller passedref=0(unknown reference) and a series was created or looked up viagetOrCreate, the actual series reference ins.refwould be lost. The normal return path at line 231 correctly returnsstorage.SeriesRef(s.ref), making this early return inconsistent and potentially breaking series reference caching for callers.
I also notice that on line 170 and 174 ref is used rather than s.ref, which seems like the same issue? (I'm no expert on this logic though.)
|
a |

This is a first PR towards moving to a new, unified, cleaner, more flexible
AppenderV2interface, for simpler review.See
NOTE: This PR does NOT change any production flow (no one is using the new AppenderV2 logic that TSDB now implements). As a result there's no point to run
prombenchon this PR.Changes
Appender.Appender.Appenderinside various helps (e.g. createBlock). I decided to not port those as those likely don't test appending itself. We will port once we will switch prod to AppenderV2.I suggest review by commit, especially if you are curious how Appender vs AppenderV2 implementation (and tests) are different.
On top of that I removed
BenchmarkHead_WalCommitinstead of adding V2 flow to it. This benchmark is broken on main, was odd. It feels we should write one from scratch if we need it:Benchmarks
For 3 exemplar case (histogram case), the new interface is clearly faster (one append vs 4)
Does this PR introduce a user-facing change?