-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Delivery
Putting delivery section on top for a clear status, see proposal below.
#17610 is the "preview" PR with all changes so far. Once buildable, we can prombench and see the effects (e.g. import from downstream code).
M1
-
PART1: Add
AppenderV2interface; add TSDB implementation (next toAppender): refactor(appenderV2)[PART1]: add AppenderV2 interface; add TSDB AppenderV2 implementation #17629 -
PART2: Refactor scrape package; add adapters for multi version of Appender use: refactor(scrape)[PART2]: simplified scrapeLoop constructors & tests; add teststorage.Appendable mock #17631
-
PART3: Add
AppenderV2interface to TSDB Agent refactor(tsdb/agent)[PART3]: add AppenderV2 support to agent #17677 -
PART4: Switch OTLP, RW receiving:
- feat(teststorage)[PART4a]: Add AppendableV2 support for mock Appendable #17834
- feat(storage)[PART4b]: add AppenderV2 to the rest of storage.Storage implementations + mock exemplar fix #17835
- OTLP refactor: move OTLP handler to separate file #17990
- OTLP tests: move OTLP handler tests to teststorage.Appendable; increase test coverage. #17992
- refactor: switch OTLP handler to AppendableV2 #17996
- [ ]
-
Switch scrape to use
AppenderV2fully, allow Appender mode for importers (e.g. Otel). -
Swich rule-manager
-
Switch everything else (e.g. promqltest etc).
-
Understand the impact for Thanos, Mimir, Cortex, Otel (upgrade PRs on those repos).
-
Migrate Otel collector
prometheusreceivertoAppenderV2.- WIP; initial feedback is that AppenderV2 helps a lot.
M2
- Remove storage.Appender (deprecate first with a clear ETA e.g. Q2 2026). Not removing/delaying means supporting unused ~10k LOC with tests.
Proposal
Mentioned in various discussions (OTLP CombinedAppender, early PR, PROM-60, Slack), but this deserves a dedicated issue.
We propose a replacement to our existing storage.Appender in a form of storage.AppenderV2 that is:
- simpler (3 vs 9 methods).
- have transactional exemplars, metadata and ST attachments
- allow extensibility with hybrid param + struct parameters
- efficient enough
- is not breaking Prometheus binary guarantees/behaviour.
No feature flag is planned, given complexity of supporting both paths for the whole codebase, plus it should not break Prometheus users.
We propose to add dual appender v2 path to scrape and TSDB/agent only. Rest of the code switches to v2 atomically (PR). Once Prometheus switches fully to v2 we then remove v1:
- From
scrapethe moment we got ack from Otel/Alloy for moving to v2. - From
TSDB/agentthe moment we got ack from Cortex/Thanos/Mimir/Alloy.
storage.AppenderV2 interface
The interface is inspired by @krajorama PR who already did amazing research on various possibilities..
// AppendV2Options provides optional, auxiliary data and configuration for AppenderV2.Append.
type AppendV2Options struct {
// MetricFamilyName (optional) provides metric family name for the appended sample's
// series. If the client of the AppenderV2 has this information
// (e.g. from scrape) it's recommended to pass it to the appender.
//
// Provided string bytes are unsafe to reuse, it only lives for the duration of the AppendSample call.
//
// Some implementations use this avoid slow and prone to error metric family detection for:
// * Metadata per metric family storages (e.g. Prometheus metadata WAL/API/RW1)
// * Strictly complex types storages (e.g. OpenTelemetry Collector).
//
// NOTE(krajorama): Example purpose is highlighted in OTLP ingestion: OTLP calculates the
// metric family name for all metrics and uses it for generating summary,
// histogram series by adding the magic suffixes. The metric family name is
// passed down to the appender in case the storage needs it for metadata updates.
// Known user of this is Mimir that implements /api/v1/metadata and uses
// Remote-Write 1.0 for this. Might be removed later if no longer
// needed by any downstream project.
// NOTE(bwplotka): Long term, once Prometheus uses complex types on storage level
// the MetricFamilyName can be removed as MetricFamilyName will equal to __name__ always.
MetricFamilyName string
// Metadata (optional) attached to the appended sample.
// Metadata strings are safe for reuse.
// IMPORTANT: Appender v1 was only providing update. This field MUST be
// set (if known) even if it didn't change since the last iteration.
// This moves the responsibility for metadata storage options to TSDB.
Metadata metadata.Metadata
// Exemplars (optional) attached to the appended sample.
// Exemplar slice MUST be sorted by Exemplar.TS.
// Exemplar slice is unsafe for reuse.
Exemplars []exemplar.Exemplar
// RejectOutOfOrder tells implementation that this append should not be out
// of order. An OOO append MUST be rejected with storage.ErrOutOfOrderSample
// error.
RejectOutOfOrder bool
}
// AppenderV2 provides batched appends against a storage for all types of samples.
// It must be completed with a call to Commit or Rollback and must not be reused afterwards.
//
// Operations on the AppenderV2 interface are not goroutine-safe.
//
// The order of samples appended via the AppenderV2 is preserved within each
// series. I.e. samples are not reordered per timestamp, or by float/histogram
// type.
type AppenderV2 interface {
AppenderTransaction // Commit and Rollback
// Append appends a sample and related exemplars, metadata, and (st) start timestamp to the storage.
//
// ref (optional) represents the stable ID for the given series identified by ls (excluding metadata).
// Callers MAY provide back the ref to help implementation avoid ls -> ref computation, otherwise ref MUST be 0 (unknown).
//
// ls represents labels for the sample's series.
//
// st (optional) represents sample start timestamp. 0 means unknown. Implementations
// are responsible for any potential ST storage logic (e.g. ST zero injections).
//
// t represents sample timestamp.
//
// v, h, fh represents sample value for each sample type.
// Callers MUST only provide one of the sample types (either v, h or fh).
// Implementations can detect the type of the sample with the following switch:
//
// switch {
// case fh != nil: It's a float histogram append.
// case h != nil: It's a histogram append.
// default: It's a float append.
// }
// TODO(bwplotka): We plan to experiment on using generics for complex sampleType, but do it after we unify interface (derisk) and before we add native summaries.
//
// Implementations MUST attempt to append sample even if metadata, exemplar or (st) start timestamp appends fail.
// Implementations MAY return AppendPartialError as an error. Use errors.As to detect.
// Implementations MUST return valid SeriesRef that represents ls.
// NOTE(bwplotka): Given OTLP and native histograms and the relaxation of the requirement for
// type and unit suffixes in metric names we start to hit cases of ls being not enough for id
// of the series (metadata matters). Current solution is to enable 'type-and-unit-label' features for those cases, but we may
// start to extend the id with metadata one day.
Append(ref SeriesRef, ls labels.Labels, st, t int64, v float64, h *histogram.Histogram, fh *histogram.FloatHistogram, opts AppendV2Options) (SeriesRef, error)
}Requirements
A. [MUST] Support attaching additional data to sample/series in one "transaction/append", notably metadata and exemplars.
B. [MUST] Allows always-on metadata passing (instead of updates only).
C. [MUST] Support ST natively (for PROM-60 and delta PROM-48).
D. [MUST] Clean existing tech debt for sustainability. Minimal amount of methods, simple and reliable to use and implement.
E. [MUST] Not change Prometheus binary core TSDB semantics.
F. [SHOULD] Enable future Prometheus evolution for complex samples only (i.e. no classic histogram and summaries one day); allows flexibility for future changes.
G. [SHOULD] Allow passing metric family name (RWv1, OTLP, Mimir)
Motivation
Current storage.Appender organically grown to 9 more or less duplicated methods, 5 sub-interfaces, weird stateful options (SetOptions). It happened, because making breaking changes (e.g. refactor, adding more parameters) are extremely time consuming to update, as it's is heavily used (163 uses in Prometheus alone, much more with wider ecosystem like Otel, Thanos, Mimir, Cortex etc) and it's used in non-trivial DB hot paths. As a result it was easier to just add "one more method". For this reason, when designing PROM-60 it's clear we need to step in and clean it up.
Let's go through the above goals and motivate them.
- Re (A): It makes metadata and exemplars explicitly connected to a sample/histogram. This is how every ingestion protocol is structured (RW, scrape, OTLP) and it's compatible with the known implementations (TSDB). Specifically TSDB, while it currently stores exemplars and metadata in "separate" storages/structures, it's tightly connected to sample and series. We made a lot of work to ensure it's properly and efficiency connected (e.g. don't commit sample before metadata etc), which could be avoided in the next iteration (less races, less edge cases, less ref/lset lookups).
- Re (B): Together with (A) it allows better metadata support now and in future. Avoiding client need to care if metadata changed or not, simplified a lot (e.g. on RW, OTEL receives) and for TSDB it's trivial (!) to check if metadata change (we literally have a field for it if needed).
- Re (C): Enables proper ST handling and delta type. Lack of ST support also forced us to add ST features (e.g. zero injection) on client side, obscuring TSDB API and causing a lot of unnecessary code, where TSDB could easily implement this.
- Re (D): Large API surface of
Appenderwas multiplied by wide usage we have now. I counted ~11 wrappers and at least ~4 test mocks. It's time to reuse some of it, but also reduce amount of methods needed for those wrappers. It's an extreme tech debt we pay in Prometheus and ecosystem.
- Re (H): It prepares us for future use of generics for complex types, especially with incoming native summaries.
Risks & Mitigations
- Otel (coding) breaking change is deferred, but we need to test if Otel CAN switch to a new interface. cc @dashpole
- Other downstream users will be broken straight away. We need to understand impact (and usefulness) for downstream Thanos, Cortex, Mimir cc @saswatamcode @krajorama @alanprot
- Exemplar per series vs per sample.