Skip to content

refactor(remote)[PART4]: switch remote receivers (RW, OTEL) to AppenderV2#17675

Closed
bwplotka wants to merge 13 commits intomainfrom
bwplotka/a2-remote
Closed

refactor(remote)[PART4]: switch remote receivers (RW, OTEL) to AppenderV2#17675
bwplotka wants to merge 13 commits intomainfrom
bwplotka/a2-remote

Conversation

@bwplotka
Copy link
Member

@bwplotka bwplotka commented Dec 12, 2025

Related to #17632

Depends on:

This PR:

  • Adds support for AppenderV2 to our appender mock.
  • Add common way for handling Append errors and statistics
  • Switches the storage/remote to storage.AppenderV2. This means OTLP and RW receiving will now use strictly AppenderV2 TSDB implementation.
  • Removes otlptranslator/prometheusremotewrite/CombinedAppender. This means it's a breaking change for Mimir (we ensured with @krajorama Mimir can now nicely use standard AppenderV2. I think this is the point when we could test Mimir conversion (cc @aknuds1 as I think you wanted to try it out 🤗 )
  • Removed samplesAppendedWithoutMetadata because no such errors can happen anymore (the only error it was guarding against was mismatched series ref, which couldn't really happened in pre PR handler flow I think too 🤔 ). Unless this metrics was suppose to reflect OTLP samples without Prometheus metadata (this was not what was implemented). cc @aknuds1
samplesAppendedWithoutMetadata: promauto.With(reg).NewCounter(prometheus.CounterOpts{
			Namespace: "prometheus",
			Subsystem: "api",
			Name:      "otlp_appended_samples_without_metadata_total",
			Help:      "The total number of samples ingested from OTLP without corresponding metadata.",
		}),

Does this PR introduce a user-facing change?

NONE

@bwplotka bwplotka changed the title refactor(remote)[PART3]: switch remote receivers (RW, OTEL) to AppenderV2 refactor(remote)[PART4]: switch remote receivers (RW, OTEL) to AppenderV2 Dec 12, 2025
@bwplotka bwplotka force-pushed the bwplotka/a2-remote branch 2 times, most recently from 8121456 to dc5c344 Compare December 13, 2025 06:22
@bwplotka bwplotka marked this pull request as ready for review December 13, 2025 06:25
@bwplotka bwplotka requested review from krajorama and ywwg and removed request for tomwilkie December 13, 2025 06:30
@bwplotka bwplotka force-pushed the bwplotka/a2-remote branch 7 times, most recently from f544bf9 to 61a888f Compare December 15, 2025 10:31
@bwplotka bwplotka force-pushed the bwplotka/scrapeloop-ref branch 4 times, most recently from a8e2fad to d6712cd Compare December 15, 2025 11:42
…tstorage.Appender mock

Signed-off-by: bwplotka <bwplotka@gmail.com>

debug
@bwplotka bwplotka force-pushed the bwplotka/scrapeloop-ref branch from d6712cd to aea8139 Compare December 16, 2025 09:33
Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka bwplotka force-pushed the bwplotka/scrapeloop-ref branch from d3f8c3f to 10624b9 Compare December 16, 2025 13:54
@bwplotka bwplotka force-pushed the bwplotka/a2-remote branch 4 times, most recently from 7121a95 to 5d55839 Compare December 17, 2025 09:09
// Used to ensure we only update metadata and created timestamps once, and to share storage.SeriesRefs.
// To detect hash collision it also stores the labels.
// There is no overflow/conflict list, the TSDB will handle that part.
refs map[uint64]seriesRef
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I fully understood the purpose of it. Was it only to cache refs for efficiency for separate Append vs UpdateMetadata/AppendExemplar for same series? If yes, then it's safe to remove.

cc @krajorama

Signed-off-by: bwplotka <bwplotka@gmail.com>

comm

Signed-off-by: bwplotka <bwplotka@gmail.com>
…erV2

Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member Author

This should be rdy for PTAL cc @aknuds1 @krajorama

But no rush, if you are busy we can tackle this after xmas 🤗

Copy link
Member

@ywwg ywwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would feel a lot better if there were a way to select which appender is in use while we are in transition rather than doing a hard cutover. There is so much changing in all of these PRs that is seems plausible there will be bugs (some have already been found in review), and we want it to be easy to roll back to V1 in event of emergency. If the two appenders can co-exist as seems implied by the code I've read, it should be possible to put V2 behind a default-on feature flag that can be flipped

h.logger.Debug("Error while adding exemplar in AppendExemplar", "series", ls.String(), "exemplar", fmt.Sprintf("%+v", e), "err", err)
}
}
es = append(es, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(why not make es a slice with a preallocated length of len(ts.Exemplars)?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!

Ideally we reuse from previous slice which should be enough for that 1-10 max elements. But we could optimize further:

es = es[:0]
if  cap(es) < len(ts.Exemplars) {
    es = make([]exemplar.Exemplar, 0, len(ts.Exemplar))
}

Benchmarks would tell if this really matter, so I did the most readable yet efficient enough solution for now... I would postpone this optimization for later, can add TODO, wdyt?

factoryAr := func(context.Context) api_v1.AlertmanagerRetriever { return h.notifier }
FactoryRr := func(context.Context) api_v1.RulesRetriever { return h.ruleManager }

var app storage.Appendable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is an example where it looks like we could use either one appender or the other -- The perfect place to have a switchable selection for rollbacks.

Copy link
Member Author

@bwplotka bwplotka Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good place IF we want to double down on tech debt of extra ~3k LOC of potentially prod code to keep (most of the code I am removing in this PR would need to be left out). I acknowledge the risk but I propose we don't do dual path anywhere else than scrape and TSDB. We can revert this PR/commit if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka bwplotka requested a review from ywwg December 18, 2025 08:06
@bwplotka
Copy link
Member Author

bwplotka commented Dec 18, 2025

Dual Appender Path or Switch

@ywwg I would feel a lot better if there were a way to select which appender is in use while we are in transition rather than doing a hard cutover. There is so much changing in all of these PRs that is seems plausible there will be bugs (some have already been found in review), and we want it to be easy to roll back to V1 in event of emergency. If the two appenders can co-exist as seems implied by the code I've read, it should be possible to put V2 behind a default-on feature flag that can be flipped

Thanks for pointing this out, understandable opinion.

Dual appender here and in others pieces would be possible, but it's a big ask. Keeping two paths is enormous work. We are talking about ~15k LOC of dual path tech debt we already added to critical code (TSDB, agent), and doing dual path for more components adds likely ~10k+ more. Ideally we switch straight away for less complex/less imported functionality. Otherwise, the moment you fix/develop something in dual-appender enabled code, which path you fix? Which part you read to debug? How much of confusion we do? Both? new only? Who will maintain dual bug porting process?

Initially with @krajorama (see Proposal section on #17632 (comment)) we were thinking to only keep dual appender path for scrape and implementations head and db then and immediately remove v1:

  • From scrape the moment we got ack from Otel/Alloy for moving to v2.
  • From TSDB/agent the moment we got ack from Cortex/Thanos/Mimir/Alloy.

Receiver implementation is important, but practically speaking I'd say 1000x less people use receiver than scrape, so it gives us a good % of users for a "canary" switch in in 3.9 or so. We fix all the issues until we get it right. I propose we balance out the risk of switch vs doing extra work for dual path and risking maintainability issues here.

Concrete dual path tech debt problem

Let me give you example of why it's worth switching immediately. The reason why switch PR also simplifies some code is thanks to the new AppenderV2 having one method. For example receiverAppender abstractions would be really hard to pull out, because repeating same code on ~6 methods would defy point of DRY-ing here. With dual path we are talking about a second abstraction too for AppenderV2.

To sum up, AppenderV2 switch not only changes the flow of appends to be one append vs 3 (sample, meta, exemplar), but unlocks great consistency and simplifications achieved easily.

One could argue that doing both switch to AppenderV2 and some DRY and consistency abstractions is bad, makes a PR bigger. There's some truth, but alternative yields way bigger PRs, and more work for everyone, because the DRY-ing to some extend IS making the review easier (less code to review, easier to fix bugs). And because we switch we don't need to leave old code behind that is unused and only serving as a potential backup.

Proposal

For the reasons above, I propose we keep the original plan of full switch for all parts that not scrape, TSDB/agent. We swarm on it and derisk by switching individual components one by one (remote, PromQL tests, scrape, etc), revertable on demand. But we keep balance of tech debt vs risk.

This is meant to be a refactor, so no significant functionality should be changed to user. We risk changes and bugs, but we also risk it now by leaving the messy AppenderV1 alive for long and especially supporting 2 paths.

It feels adding feature flag for refactor is like adding a feature flag to every bigger PR that changes prod code without breaking user functionality. It would be safer, but would it be productive? 🙃

Signed-off-by: bwplotka <bwplotka@gmail.com>
@ywwg
Copy link
Member

ywwg commented Dec 18, 2025

I don't really agree that this is a simple refactor -- there is a huge amount of code changing, a bunch of new logic to detect what kind of append is being asked for, massively rewritten tests. In my experience, a "noop refactor" usually contains extremely minimal changes above the layer of the refactor, whereas components using this new API also seem to be changing a lot. I am pushing back on this characterization because just saying "it's a refactor, don't worry" is very unconvincing to me.

I would prefer to have a solid plan in place for how to handle a situation where we get a bug report that something major is broken with this rewrite. If we can't have a simple flag, then which PRs need to be reverted? Do we need to roll back 4 PRs totalling more than 20,000 lines of code? Or can we construct a final switch-over PR that gives us a clear path and breathing room to fix any problems that come up?

There is a significant risk here in creating this much churn that if there are problems, downstream projects and users will be stuck on the last version of Prometheus released before this rewrite, waiting for us to work out the issues before being able to resync with HEAD. I am not saying this is an inevitability, I am saying that it feels wise to have a clear plan for that possibility.

@bwplotka
Copy link
Member Author

I acknowledge your push back, also I never said this is a "simple" refactor. I wonder if your worries are not coming from some place of fear (how familiar you are with this component?) which exaggerates some risks here.

I would prefer to have a solid plan in place for how to handle a situation where we get a bug report that something major is broken with this rewrite. If we can't have a simple flag, then which PRs need to be reverted? Do we need to roll back 4 PRs totalling more than 20,000 lines of code?

No, you need to roll back this PR in the worse case (~1500 removed LOC total).

Or can we construct a final switch-over PR that gives us a clear path and breathing room to fix any problems that come up?

This is the final switch over PR for remote component. (:

I wonder how this PR is different to a similar PR recently that switches OTEL to a new interface (#16951). This PR is essentially reverting the whole thing, plus doing this for RW. And this PR changes more or less same amount of LOC (actually less lines). Why that one was reviewed and accepted then? 🙃

@bwplotka
Copy link
Member Author

To sum up, together with #17631 there are few high level feedback points around the way we change API in various places:

A) @ywwg wishes for components to support dual-appender path for the migration time with runtime switch flag.
B) @ywwg if we can't do dual appender path, the PR that switches should be as small as possible and only do the switching. Essentially concrete plan what to do if AppenderV2 path is buggy.
C) @krajorama prefers limited diff/PR size, even if it means splitting trivial refactors (e.g. changing deprecated vars on the way, renames) to separate PRs.

For (A) it's unrealistic and I would need to spend 10h to show you by adding another ~8k LOC PR that copies appenderv1 paths and does them to v2 next to v1. Much more stuff to review only to finally do the swtich and remove the whole thing ASAP to avoid big tech-debt. Abstractions and usage have to change, unless we do complex common abstraction, but that would change prod appenderv1 code defying the point?

For (B) I agree, but literally this (#16951 PR is an example "final switch" for OTLP). It's not going to be a lot of smaller than this PR. I could try to split Remote Write receive vs OTLP receive switch to make things smaller along the (B) point. Would that help?

For (C) is a matter of taste, but I can try to minimize unrelated fixes, but it's some of them are trivial and it's really hard to ignore trivial trash to fix (: I can try.

Would some split towards (B) and (C) would work for you here?

Copy link
Contributor

@aknuds1 aknuds1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks mostly good after reviewing the Grafana Mimir vendoring, but I'm fairly certain I spotted a bug.

My main criticism otherwise is there are too many stylistic changes in the OTLP code. These are better proposed separately from this refactor, so the diff can be minimized.

Comment on lines +315 to +316
if rollbackErr := appender.Rollback(); rollbackErr != nil {
f.logger.Error("Squashed rollback error on commit", "err", rollbackErr)
Copy link
Contributor

@aknuds1 aknuds1 Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit]

Suggested change
if rollbackErr := appender.Rollback(); rollbackErr != nil {
f.logger.Error("Squashed rollback error on commit", "err", rollbackErr)
f.logger.Error("Squashed secondary rollback error on commit", "err", rollbackErr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in separate PR, trying to limit unrelated changes as per feedback.

case err == nil:
err = rollbackErr
case rollbackErr != nil:
f.logger.Error("Squashed rollback error on rollback", "err", rollbackErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the following more descriptive?

Suggested change
f.logger.Error("Squashed rollback error on rollback", "err", rollbackErr)
secondaryErr := appender.Rollback()
switch {
case err == nil:
err = secondaryErr
case secondaryErr != nil:
f.logger.Error("Squashed secondary error on rollback", "err", rollbackErr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you propose to rollback twice? Something is off with this suggestion 🤔

// Best effort timestamp check within a single request. Doing this across message would incur overhead.
// We are doing this, since storage don't check ordering within a single Append.
err = validateTimestamp(hp.Timestamp, seriesLastTs)
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge this and the following if err == nil block? It would simplify the code.

for _, e := range pErr.ExemplarErrors {
if errors.Is(err, storage.ErrOutOfOrderExemplar) {
a.oooExemplarErrCount++
a.logger.Debug("Out of order exemplar", "series", ls.String(), "timestamp", t, "sampleType", sTyp, "exemplars", opts.Exemplars, "err", e.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be debug or warn level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this level I would say debug. Mimics RW1.0 flow.

// Partial error due to exemplars, account for that.
a.stats.Exemplars += len(opts.Exemplars) - len(pErr.ExemplarErrors)
for _, e := range pErr.ExemplarErrors {
if errors.Is(err, storage.ErrOutOfOrderExemplar) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug to me.

Suggested change
if errors.Is(err, storage.ErrOutOfOrderExemplar) {
if errors.Is(e, storage.ErrOutOfOrderExemplar) {

@ywwg
Copy link
Member

ywwg commented Dec 19, 2025

For (B) I agree, but literally this (#16951 PR is an example "final switch" for OTLP). It's not going to be a lot of smaller than this PR. I could try to split Remote Write receive vs OTLP receive switch to make things smaller along the (B) point. Would that help?

Thank you for the thoughtful responses. It is true that I am not very familiar with the code being touched, but I also have too many horror stories of things breaking because a PR was "too big to roll back". The last thing I want is a panic fix-forward as we try to plug holes in the dam.

Yes I think splitting out the final switch PRs would help make this easier, thank you. I agree that maintaining dual appender interfaces for any significant period of time is unsustainable.

@bwplotka
Copy link
Member Author

Splitting this PR into smaller ones as per feedback points. All comments e.g. @aknuds1 are being addressed in subsequent PRs.

For now:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants