Skip to content

feat(storage): add new appender interface and compatibility layer#17104

Closed
bwplotka wants to merge 1 commit intomainfrom
combined-appender
Closed

feat(storage): add new appender interface and compatibility layer#17104
bwplotka wants to merge 1 commit intomainfrom
combined-appender

Conversation

@bwplotka
Copy link
Member

@bwplotka bwplotka commented Aug 29, 2025

This PR proposed a new storage interface AppenderV2 for TSDB along the compatibility layer to translate from and to Appender.

The interface is taken from @krajorama PR who already did amazing research on various possibilities.

// Metadata stores a series' metadata information.
type Metadata struct {
	metadata.Metadata

	// MetricFamilyName (optional) stores metric family name of the series that
	// this metadata relates too. If the client of the AppenderV2 has this information
	// (e.g. from scrape), it's recommended to pass it to the appender.
	// Some implementation store Metadata per metric family and this information
	// allows them to avoid slow and prone to error metric family detection.
	MetricFamilyName string
}

// AppenderV2 provides batched appends against a storage for either float or histogram samples.
// It must be completed with a call to Commit or Rollback and must not be reused afterwards.
//
// Operations on the Appender interface are not goroutine-safe.
//
// The type of samples (float64, histogram, etc) appended for a given series must remain same within an Appender.
// The behaviour is undefined if samples of different types are appended to the same series in a single Commit().
type AppenderV2 interface {
	// AppendSample appends a float sample and related exemplars, metadata, and created timestamp to the storage.
	AppendSample(ref SeriesRef, ls labels.Labels, meta Metadata, ct, t int64, v float64, es []exemplar.Exemplar) (SeriesRef, error)

	// AppendHistogram appends a float or int histogram sample and related exemplars, metadata, and created timestamp to the storage.
	AppendHistogram(ref SeriesRef, ls labels.Labels, meta Metadata, ct, t int64, h *histogram.Histogram, fh *histogram.FloatHistogram, es []exemplar.Exemplar) (SeriesRef, error)

	// Commit submits the collected samples and purges the batch. If Commit
	// returns a non-nil error, it also rolls back all modifications made in
	// the appender so far, as Rollback would do. In any case, an Appender
	// must not be used anymore after Commit has been called.
	Commit() error

	// Rollback rolls back all modifications made in the appender so far.
	// Appender has to be discarded after rollback.
	Rollback() error

	// SetOptions configures the appender with specific append options such as
	// discarding out-of-order samples even if out-of-order is enabled in the TSDB.
	SetOptions(opts *AppendV2Options)
}

We seem to generally agree with what we want from this interface, so I propose we add it properly AND slowly migrate to it, instead of adding it only for OTLP endpoint.

Rationales for AppenderV2:

  • It makes metadata and exemplars explicitly connected to a sample/histogram. This is how every ingestion protocol is using those information and it's compatible in how we use it in TSDB and Prometheus ecosystem.
  • It unblocks OTLP ingestion that avoids efficiency penalty of re-translation to PRW, WHILE not breaking Mimir.
  • It unblocks any work around proper CT handling (or delta type handling) (see Created Timestamp: Store CT in WAL #14218 and ST: Prometheus propagates ST in Remote-Write 2.0 #14220).
  • It's significantly easier to use and implement (6 well scoped methods instead of 9 methods that weirdly interconnected (you need to AppendSample first/after etc)

In the chained PRs I will show (without breaking Prometheus) how we can slowly use and migrate to the new interface in:

  • Scrape
  • TSDB
  • Agent
  • ...etc

I did diverge a bit from @krajorama interface here, notably:

  • metricName is not related in this context, ref is (or is it important?)
  • I did swap t, ct with ct, t because ct has to be before, conceptually.

Does this PR introduce a user-facing change?

NONE

@bwplotka

This comment was marked as outdated.

@bboreham
Copy link
Member

"CombinedAppender" seems to be a name that has meaning if you know the history.
Personally I am not clear what is being combined.

It would be better to have a name that has meaning at the time someone reads it.
Is this a case for v2 ?

@bwplotka
Copy link
Member Author

bwplotka commented Aug 29, 2025

Agree, v2 is fair. I was thinking about storagev2 package for this too, but we don't want to really change the read interface a lot, so I discarded that idea.

Happy to update it to AppenderV2 if that is OK with everyone.

EDIT: It's now AppenderV2

@bwplotka bwplotka changed the title feat(storage): add new CombinedAppender interface and compatibility layer feat(storage): add new appender interface and compatibility layer Aug 29, 2025
@krajorama
Copy link
Member

I did diverge a bit from @krajorama interface here, notably:

* metricName is not related in this context, ref is (or is it important?)

When the storage is remote storage over remote-write 1.0, then the metadata is keyed on metric family name, not series __name__ label. See metric_family_name. Also the storage might store metadata per metric family for other reasons as well. And also scrape stores metadata per family as well.

So basically this is needed for backwards compatibility. Some implementations can just ignore it.

* I did swap `t, ct` with `ct, t` because ct has to be before, conceptually.

👍

Does this PR introduce a user-facing change?

NONE

@bwplotka
Copy link
Member Author

bwplotka commented Sep 1, 2025

When the storage is remote storage over remote-write 1.0, then the metadata is keyed on metric family name, not series name label. See metric_family_name. Also the storage might store metadata per metric family for other reasons as well. And also scrape stores metadata per family as well. So basically this is needed for backwards compatibility. Some implementations can just ignore it.

Got it. This is fair, but for backward compatibility, why not offering instead the old MetadataUpdater interface for those cases? Similar to how AsAppender compatibility converter is added in this PR 🤔

@krajorama
Copy link
Member

MetadataUpdater

That interface also doesn't receive the metric family name, so that's not helpful either.

We could just keep the AppenderV2 and what we use in OTLP endpoint separate. Call CombinedAppender for OTLP actually OTLPAppender. In Prometheus it could call AppenderV2 and just drop the metric family name. And others could still get the metric family name.

krajorama added a commit that referenced this pull request Sep 3, 2025
…stamp

Following discussion in #17104

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@bwplotka
Copy link
Member Author

bwplotka commented Sep 3, 2025

Copying good questions from Slack. by @ringerc

Craig Ringer
Yesterday at 22:51
Out of interest, can this appender potentially handle the float vs native histogram ordering issue (#14172, #17025) better?
And given mention of the pain that was caused by native histograms being introduced, is it intended to be more flexible in the face of new types? E.g. the proposal for native summaries here #16949 and the old discussion of one day possibly having string-valued series #9139 ?

Bartek Plotka
2 minutes ago
Good points! Good question if the bug can be fixed by the interface change. Looking on @beorn fix (https://github.com/prometheus/prometheus/pull/17071/files#diff-c3ae701c35bb1de34870207029d1fb3390613fbba65e6f2f9bffe059bff89001) I see it's more of how we handle things after the append. Wonder if a single Append for both floats and histograms would help for this case (and it would be useful for clients e.g. scrape)

Bartek Plotka
1 minute ago
The latter is a good question -> other than ordering challenges I think it will be as easy as adding new method for each "sampleType"

@bwplotka
Copy link
Member Author

bwplotka commented Sep 3, 2025

That interface also doesn't receive the metric family name, so that's not helpful either.
We could just keep the AppenderV2 and what we use in OTLP endpoint separate. Call CombinedAppender for OTLP actually OTLPAppender. In Prometheus it could call AppenderV2 and just drop the metric family name. And others could still get the metric family name.

I don't entirely understand why OTLP client has to calculate metric family name from OTLP ingestion (I don't think it has this info ready, right?) to ingest data to Prometheus natively. Do you know why? @krajorama

When the storage is remote storage over remote-write 1.0, then the metadata is keyed on metric family name, not series name label. See metric_family_name.

When Prometheus would append directly to remote write 1.0? Also official remote write does not have metadata support 🙈 Sounds like this problem is only applicable for Mimir. I'm sorry to be brutal here and I'm obviously biased here (I don't work for Grafana), but it feels that the fact Mimir is still using unofficial, soon to be deprecated remote write v1, with experimental flows, which was never released in a spec is slowly affecting Prometheus codebase decisions more and more.

What would motivate Mimir to stop depending on this v1 flow, so we can remove the experimental code from Prometheus?

Also the storage might store metadata per metric family for other reasons as well.

I don't understand it. With the current direction of OM2 discussions and type and unit labels, we already know we need to support metadata per series.

And also scrape stores metadata per family as well.

I'm not sure I agree -- this flow is only used for scrape target purposes AFAIK. This is not needed in OTLP ingestion, right?

Summary

Indeed two interfaces for Mimir purposes would be fine for now, but let's be explicit in the comment that we do that for Mimir code (so we know can remove it when Mimir move to non v1 flow).

Otherwise... we could also add ctx as a first paramter for extensibility here. Or (maybe more dangerous for API accidental extension scope) we could add ...Option.

context.Context is a good idea nevertheless here -- for anything that might come later. Should we add this? @krajorama

@krajorama
Copy link
Member

I don't entirely understand why OTLP client has to calculate metric family name from OTLP ingestion (I don't think it has this info ready, right?) to ingest data to Prometheus natively. Do you know why? @krajorama

That's a good question. If it only happens for RW1.0 sake and otherwise zero use in Prometheus AND we have all the information to calculate it inside our AppenderV2 implementation, we could move the logic to Mimir and drop the code/parameter in Prometheus right away. Needs a short POC on my part, give me a couple of days.

@krajorama
Copy link
Member

krajorama commented Sep 4, 2025

MetadataUpdater

That interface also doesn't receive the metric family name, so that's not helpful either.

We could just keep the AppenderV2 and what we use in OTLP endpoint separate. Call CombinedAppender for OTLP actually OTLPAppender. In Prometheus it could call AppenderV2 and just drop the metric family name. And others could still get the metric family name.

I don't entirely understand why OTLP client has to calculate metric family name from OTLP ingestion (I don't think it has this info ready, right?) to ingest data to Prometheus natively. Do you know why? @krajorama

That's a good question. If it only happens for RW1.0 sake and otherwise zero use in Prometheus AND we have all the information to calculate it inside our AppenderV2 implementation, we could move the logic to Mimir and drop the code/parameter in Prometheus right away. Needs a short POC on my part, give me a couple of days.

I've started to look into this. For all metrics the OTLP endpoint already calculates the Prometheus metric name (both in baseline and this PR). In case of classic histograms in particular this is used to generate the classic histogram series by appending (_bucket, _count, _sum) suffixes. So I don't see an immediate gain in moving the calculation to Mimir.

On the other hand , Mimir is implementing https://prometheus.io/docs/prometheus/latest/querying/api/#querying-metric-metadata which is kind of a weird endpoint (in both Prom and Mimir) because it returns the metadata keyed on the metric family name (most of the time, except counters where it returns the metric name with _total 🥇 ). So it's not really about RW1, it's more deep than that. In fact in Prometheus we serve the endpoint from the scrape cache, so appender.UpdateMetadata has nothing to do with it.

So I think the huge performance gain of the #16951 PR more than makes up for passing an extra argument that we calculate anyway.

Indeed two interfaces for Mimir purposes would be fine for now, but let's be explicit in the comment that we do that for Mimir code (so we know can remove it when Mimir move to non v1 flow).

We can either do that (keep metric family name in AppenderV2) or as I suggested have AppenderV2 clean (without extra arg), but have an OTLPAppender that has the extra param until we figure out what to do with the metadata in general.

Otherwise... we could also add ctx as a first paramter for extensibility here. Or (maybe more dangerous for API accidental extension scope) we could add ...Option.

context.Context is a good idea nevertheless here -- for anything that might come later. Should we add this? @krajorama

Hmm. I'm not a fan of adding something that might never get used, but it's hard to argue against context. Up to you.

@bwplotka
Copy link
Member Author

bwplotka commented Sep 4, 2025

We discussed on Slack with @krajorama , but tldr - sounds like a metric family name is useful for Mimir in some context and if we keep per mf metadata in Prometheus further.

We ruled out explicit arguments, as it will be hard to remove later. CTX and variadic options are on the table, but perhaps we could extend metadata struct to include mfname. We have that in some other metadata representations (API?)

@krajorama
Copy link
Member

We discussed on Slack with @krajorama , but tldr - sounds like a metric family name is useful for Mimir in some context and if we keep per mf metadata in Prometheus further.

We ruled out explicit arguments, as it will be hard to remove later. CTX and variadic options are on the table, but perhaps we could extend metadata struct to include mfname. We have that in some other metadata representations (API?)

I'll make some measurements with new metadata struct passed by value and reference.

@krajorama
Copy link
Member

According to the first measurement with #16951 using pass by value:

+// Metadata extends metadata.Metadata with the metric family name.
+// The metric family name may be empty.
+type Metadata struct {
+       metadata.Metadata
+       MetricFamilyName string
+}
+
 // CombinedAppender is similar to storage.Appender, but combines updates to
 // metadata, created timestamps, exemplars and samples into a single call.
 type CombinedAppender interface {
        // AppendSample appends a sample and related exemplars, metadata, and
        // created timestamp to the storage.
-       AppendSample(metricFamilyName string, ls labels.Labels, meta metadata.Metadata, ct, t int64, v float64, es []exemplar.Exemplar) error
+       AppendSample(ls labels.Labels, meta Metadata, ct, t int64, v float64, es []exemplar.Exemplar) error
        // AppendHistogram appends a histogram and related exemplars, metadata, and
        // created timestamp to the storage.
-       AppendHistogram(metricFamilyName string, ls labels.Labels, meta metadata.Metadata, ct, t int64, h *histogram.Histogram, es []exemplar.Exemplar) error
+       AppendHistogram(ls labels.Labels, meta Metadata, ct, t int64, h *histogram.Histogram, es []exemplar.Exemplar) error
 }
goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote/otlptranslator/prometheusremotewrite
cpu: Intel(R) Core(TM) Ultra 7 155U
                                                                                                                                                           │ baseline.txt │        passbyvalue.txt        │
                                                                                                                                                           │    sec/op    │   sec/op     vs base          │
PrometheusConverter_FromMetrics/resource_attribute_count:_0/histogram_count:_1000/non-histogram_count:_1000/labels_per_metric:_2/exemplars_per_series:_0-4   11.51m ± 10%   11.68m ± 3%  ~ (p=0.393 n=10)

                                                                                                                                                           │ baseline.txt │        passbyvalue.txt         │
                                                                                                                                                           │     B/op     │     B/op      vs base          │
PrometheusConverter_FromMetrics/resource_attribute_count:_0/histogram_count:_1000/non-histogram_count:_1000/labels_per_metric:_2/exemplars_per_series:_0-4   6.369Mi ± 0%   6.369Mi ± 0%  ~ (p=0.971 n=10)

                                                                                                                                                           │ baseline.txt │        passbyvalue.txt        │
                                                                                                                                                           │  allocs/op   │  allocs/op   vs base          │
PrometheusConverter_FromMetrics/resource_attribute_count:_0/histogram_count:_1000/non-histogram_count:_1000/labels_per_metric:_2/exemplars_per_series:_0-4    82.48k ± 0%   82.48k ± 0%  ~ (p=0.419 n=10)

@bwplotka
Copy link
Member Author

bwplotka commented Sep 5, 2025

Yea, so I think this means metadata fits on the stack, am I understanding this correctly?

@bwplotka
Copy link
Member Author

bwplotka commented Sep 5, 2025

Updated my PR @krajorama

Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nicely taking shape, we're getting down to the nitty gritty details

// The type of samples (float64, histogram, etc) appended for a given series must remain same within an Appender.
// The behaviour is undefined if samples of different types are appended to the same series in a single Commit().
type AppenderV2 interface {
// AppendSample appends a float sample and related exemplars, metadata, and created timestamp to the storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define how errors should be handled , what is the contract?

So far we have the OTLP endpoint (Prom+Mimir) and scrape (Prom) use cases, and I can think of Remote Write as well .

I think all share that just because an exemplar fails, we'd like to update the metadata.

Probably same for CT zero sample insert, you'd still want to ingest everything else.

See implementation in #16951 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is good point to agree on the initial error setup. I think we have two choices:

A) Don't mention specific partial error guarantees on interface level.
B) Mention the current (likely) implementation: This is problematic because tomorrow this might change, plus it definitely changes on different Appender config options (e.g. enableCTzero) - the same input would produce different outcome.
C) Return complex errors that will give ultimate control and mention that we try to append all information and return PartialError{sampleErr, metaErr, zeroCTErr, exemplarErr} when one of pieces failed. This feels unnecessary given our clients drop non core errors anyway.

So what about mix of A+B:

Suggested change
// AppendSample appends a float sample and related exemplars, metadata, and created timestamp to the storage.
// AppendSample appends a float sample and related exemplars, metadata, and created timestamp to the storage.
//
// Implementations MUST return error if historically core sample data (value and timestamp) were not successfully appended. Implementations MAY return nil error (success) if the core data was appended, but any of the auxiliary data appends like exemplars, CT as zero, or metadata failed.

I think this might be a good balance of the client expectations vs implementation changing. I think with this it's obvious that exemplar failure should not block other append attempts. WDYT? @krajorama

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the mix of A+B

Copy link
Member Author

@bwplotka bwplotka Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added the A+B and fixed the behaviour.

Unfortunately scrape is still depending on some informations from the failed e.g. exemplar appends (to add warning log line and increment some scrape metric, see TODOs in compatibility layer).

We need to double check if we want to move this log/metric to implementation (TSDB) or keep it on some clients that want this logic (only scrape for now, but likely OTLP and RW should follow same, no?) with the option (C) (special error handling).

Given it's the TSDB (or implementation) role to append sample together with exemplars, I think we should do that warning logging (AND metric) on TSDB side. We could keep the same metric metric name (scrape namespace) for compatibility. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's the TSDB (or implementation) role to append sample together with exemplars, I think we should do that warning logging (AND metric) on TSDB side. We could keep the same metric metric name (scrape namespace) for compatibility. Thoughts?

TSDB interface will have two kinds of users: scrape and remote write/OTLP (server). Scrape is internal totally so that can naturally move error handling and logging to TSDB. However remote write (especially 2.0)/OTLP needs to return some information to the user (partial error 4xx and statistics).

Just from the numbers above we have 1 scrape vs 2 remote cases :)

We could solve returning errors with https://pkg.go.dev/github.com/hashicorp/go-multierror or that PartialError (CombinedError 😄 ) and having a Statistics() function to get counters from the interface at any point , even after Commit. But then you have to check for those errors on the caller side. That could be still more efficient than separate calls to the Appender with hash lookups. Need benchmark probably.

Good interface is hard to do, so I think we need to bite the bullet and do option C and measure. If it's still more efficient for scrape than current approach, we're good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was thinking along those lines today too. C is also much more cleaner for everyone IMO (especially when TSDB changes or for other people).

I also think we should drop CT zero injection feature from appender. This should be (and can be) done on caller side. This will simplify error handling and our code too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok, maybe encapsulating CT is easier.

I also noticed you implemented A+B for OTLP ingestion in your combined appender WITHOUT caring about propagating errors ;p

TSDB interface will have two kinds of users: scrape and remote write/OTLP (server). Scrape is internal totally so that can naturally move error handling and logging to TSDB. However remote write (especially 2.0)/OTLP needs to return some information to the user (partial error 4xx and statistics).

I assume you would like to fix that later on (or in this PR)

Copy link
Member Author

@bwplotka bwplotka Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah no, you implemented sort of C (returning the last important err as they are) - cool, will match that.

@bwplotka bwplotka force-pushed the combined-appender branch 2 times, most recently from 96dd7bf to 850c164 Compare September 6, 2025 13:01
krajorama added a commit that referenced this pull request Sep 7, 2025
To be more inline with future direction in #17104
Also use some helper variables in appender tests to make it more
readable in shorter.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@bwplotka bwplotka force-pushed the combined-appender branch 2 times, most recently from 0c88201 to b4f470f Compare September 11, 2025 08:21
bwplotka added a commit that referenced this pull request Sep 11, 2025
Depends on #17104

Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member Author

Added a few TODOs and started to see how OTLP ingestion combined appender would look like as a first user #17175 (draft)

bwplotka added a commit that referenced this pull request Sep 11, 2025
Depends on #17104

Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka

This comment was marked as outdated.

@dashpole
Copy link
Contributor

FWIW, the metric family name would also be useful in the OTel collector's appender implementation in the prometheus receiver. Today, we trim common suffixes to get the family name.

@bwplotka
Copy link
Member Author

Replacement: #17610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants