promql: implement `integral()` function by jjo · Pull Request #14440 · prometheus/prometheus

jjo · 2024-07-09T00:03:14Z

Promql: implement integral(v range-vector, strategy=2 scalar) function.

Refs: some previous discussion at #1335 and https://groups.google.com/g/prometheus-developers/c/3EuF6ffNt_k.

What

Implement integral(v range-vector, strategy=2 scalar) function.

Why

There are use cases where sum_over_time() or avg_over_time()
multiplied by the length of the whole interval are cumbersome to use
(need to factor the length of the interval and/or sampling period),
or plainly N/A (e.g. if you had periods of uneven scraping intervals).

Some relevant use cases (added to examples.md):

Calculate the total cost over past week given the per-node hourly cost:

integral(hourly_cost[7d]) / 3600
Calculate the total energy consumed in kWh over last day, given the instant
power consumption in Watts:

integral(power_rate[1d]) / 1000 / 3600
For those "boolean" metrics that produce 1s or 0s, integrating them over
time will give the time in seconds that the value was 1, for example, to
get the amount of time a particular metric has been absent during the past
day (sampled every 5 minutes):

integral(absent(metric_name)[1d:5m])
As another case for measuring amount of time, for example a boolean metric
representing a service level agreement (SLA) being met divided by the total
time, will give the SLA ratio:

integral(service_sla_ok[7d]) / (3600 * 24 * 7)

How

Formula-wise the implementation is mostly straightforward, but details
are important:

implements three strategies to calculate the integral via trapezoidal
area aggregation: mid-point (aka Riemann's, default for strategy=2),
left-point (strategy=0) and right-point (strategy=1); this is done to
let the user choose the best one to use depending on their value
"alignment" with the integrated interval.
there's specific NaN handling (mapped to zero) to simplify implementing
of the above strategies, mid-point in particular.
mid-point averaging (i.e. (prevVal+currVal)/2) is implemented with
kahanSumInc() to keep big-vs-small numbers support for neighboring
samples.
result needs to support being wrapped inside another aggregate like
sum_over_time(), i.e. interval limit handling is very important to
avoid double-aggregating the edge(s).

Testing

Extensive testing cases added to functions.test.

Strategy

Excerpt from functions.go:

    // Discrete integral using simple (trapezoidal rule).
    switch strategy {
    case 0:
        // Left-point rectangle rule.
        //
        //   metric
        //     ^
        //     |
        //     |         v1
        //     |         *........>.                  v4
        //     |         |#########:                   *........>.
        //     |         |######## v2                  |         :
        //     |         |######## *........>.         |         :
        //     |         |#########|#########:         |         :
        //     |         |#########|#########:         |         :
        //     |         |#########|###### (NaN)......>|       (NaN)
        //     +---------+---------+---------+---------+---------+-------> t
        //               t1        t2        t3        t4        t5
        //               [<--------------------------->)
        //
        // integral(metric)[] @t4: +         +         +
        //                         v1        v2        0
        //                         *         *
        //                      (t2-t1)   (t3-t2)
        value = prevVal
    case 1:
        // Right-point rectangle rule.
        //
        //   metric
        //     ^
        //     |
        //     |         v1
        //     |:<.......*                            v4
        //     |:        |                   .<........*
        //     |:        |         v2        :#########|
        //     |:        |<........*         :#########|
        //     |:        |#########|         :#########|
        //     |:        |#########|         :#########|
        //     |:        |#########|<......(NaN) ######|<......(NaN)
        //     +---------+---------+---------+---------+---------+-------> t
        //               t1        t2        t3        t4        t5
        //               (<--------------------------->]
        //
        // integral(metric)[] @t4: +         +         +
        //                         v2        0         v4
        //                         *                   *
        //                      (t2-t1)             (t4-t3)
        value = currVal
    case 2:
        // Mid-point rule (default).
        // With NaN as zero, "neighboring" non-zero values are
        // aggregated (halved at each interval eval).
        //
        //   metric
        //     ^
        //     |
        //     |         v1
        //     |    ....>*<....                       v4
        //     |    :    |####:                   ....>*<....
        //     |    :    |####:    v2             :####|    :
        //     |    :    |####:...>*<....         :####|    :
        //     |    :    |####:####|####:         :####|    :
        //     |    :    |####:####|####:         :####|    :
        //     |    :    |####:####|####:.>(NaN)<.:####|    :.>(NaN)
        //     +---------+---------+---------+---------+---------+-------> t
        //               t1        t2        t3        t4        t5
        //               [<--------------------------->]
        //
        // integral(metric)[] @t4: +         +         +
        //                      (v1+v2)/2 (v2+0)/2   (0+v4)/2
        //                         *         *         *
        //                      (t2-t1)   (t3-t2)   (t4-t3)
        //
        if prevVal != 0 || currVal != 0 {
            // **NOTE: this is currently implemented with kahanSumInc() rather than native sum op**
            value = (currVal + prevVal) / 2
        }
    }

Signed-off-by: JuanJo Ciarlante juanjosec@gmail.com

juliusv · 2024-07-09T19:35:02Z

CC @beorn7 because he usually has an opinion on anything math-y (no idea yet if we would want to add this function) :)

jjo · 2024-07-09T21:26:19Z

CC @beorn7 because he usually has an opinion on anything math-y (no idea yet if we would want to add this function) :)

Thanks @juliusv !, just look at this awesome example I just added to the PR desc and examples.md (then try to get back time with existing functions, with a non-contrived promQL ;) ->

For those metrics that produce 1s or 0s, integrating them over time will
give the time in seconds that the value was 1, for example, to get the
amount of time a particular metric has been absent during the past day
(sampled every 5 minutes):

integral(absent(metric_name)[1d:5m])

beorn7 · 2024-07-10T10:44:21Z

I have this on my (long) review queue and have a detailed look ASAP.

At first glance, this looks like avg_over_time with proper weight, which is something I always wanted. On the other hand, there are a million concerns in my head (not only those raised in the linked issues and discussions). Just as a disclaimer ahead of things that I generally would like something like this and are generally positive, but I have to run this against a number of criteria in detail.

beorn7 · 2024-07-25T16:12:32Z

We have reached the beginning of my (unplugged) vacations. I'll come back to this when I'm back (~2w), but feedback from other domain experts would be very welcome in the meantime.

Ref: some previous discussion at prometheus#1335 and https://groups.google.com/g/prometheus-developers/c/3EuF6ffNt_k. \## What Implement `integral(v range-vector, strategy=2 scalar)` function. \## Why There are use cases where `sum_over_time()` or `avg_over_time()` multiplied by the length of the whole interval are cumbersome to use (need to factor the length of the interval and/or sampling period), or plainly N/A (e.g. if you had periods of uneven scraping intervals). Couple relevant use cases (added to examples.md): * Calculate the total cost over past week given the per-node hourly cost: integral(hourly_cost[7d]) / 3600 * Calculate the total energy consumed in Joules over last day, given the instant power consumption in Watts: integral(power_rate[1d]) \## How Formula-wise the implementation is mostly straightforward, but details are important: * implements three strategies to calculate the integral via trapezoidal area aggregation: mid-rule (aka _Riemann_'s, default for strategy=2), left-wise (strategy=0) and right-wise (strategy=1); this is done to let the user choose the best one to use depending on their value "alignment" with the integrated interval. * there's specific _NaN_ handling (mapped to zero) to simplify handling of the above strategies, mid-rule in particular. * result needs to support being wrapped inside another aggregate like `sum_over_time()`, i.e. interval limit handling is very important to avoid double-aggregating these cases. \## Testing Extensive testing cases added to `functions.test` \### Strategy Excerpt from functions.go: ``` // Discrete integral using simple (trapezoidal rule). switch strategy { case 0: // Left-wise rectangle rule. // // metric // ^ // | // | v1 // | *........>. v4 // | |#########: *........>. // | |######## v2 | : // | |######## *........>. | : // | |#########|#########: | : // | |#########|#########: | : // | |#########|###### (NaN)......>| (NaN) // +---------+---------+---------+---------+---------+-------> t // t1 t2 t3 t4 t5 // [<--------------------------->) // // integral(metric)[] @t4: + + + // v1 v2 0 // * * // (t2-t1) (t3-t2) value = prevVal case 1: // Right-wise rectangle rule. // // metric // ^ // | // | v1 // |:<.......* v4 // |: | .<........* // |: | v2 :#########| // |: |<........* :#########| // |: |#########| :#########| // |: |#########| :#########| // |: |#########|<......(NaN) ######|<......(NaN) // +---------+---------+---------+---------+---------+-------> t // t1 t2 t3 t4 t5 // (<--------------------------->] // // integral(metric)[] @t4: + + + // v2 0 v4 // * * // (t2-t1) (t4-t3) value = currVal case 2: // Mid-point rule (default). // With NaN as zero, "neighboring" non-zero values are // aggregated (halved at each interval eval). // // metric // ^ // | // | v1 // | ....>*<.... v4 // | : |####: ....>*<.... // | : |####: v2 :####| : // | : |####:...>*<.... :####| : // | : |####:####|####: :####| : // | : |####:####|####: :####| : // | : |####:####|####:.>(NaN)<.:####| :.>(NaN) // +---------+---------+---------+---------+---------+-------> t // t1 t2 t3 t4 t5 // [<--------------------------->] // // integral(metric)[] @t4: + + + // (v1+v2)/2 (v2+0)/2 (0+v4)/2 // * * * // (t2-t1) (t3-t2) (t4-t3) // if prevVal != 0 || currVal != 0 { value = (currVal + prevVal) / 2 } } ``` Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

* use _left-point_ and _right-point_ to document `strategy=0,1` better idiom (I guess?) also matching well-known _mid-point_ wording * add `integral(absent(metric_name)[1d:5m])` example \m/ * `s/Joules/kWh/` (PR description and `examples.md`), as it's more familiar and likely SEO friendlier ;) Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

…en fix integral(metricbignum, 2) case to give a small non-zero number \o/ Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

…> `[1m:200ms]`, also adjust eval range parameters to only go thru period once (1m to 2m step 10s) Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

…c integral case Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

beorn7

Tiny code comment. More high-level comment will follow soon.

beorn7 · 2024-08-08T13:12:37Z

promql/functions.go

+	}
+
+	return aggrOverTime(vals, enh, func(s Series) float64 {
+		var sum, c float64


Maybe c → cSum for consistency with value, cValue below.

beorn7 · 2024-08-08T15:04:53Z

First of all, thanks for the high-quality PR with all the test cases, comprehensive documentation, and thoughtful explanations.

I have a bunch of considerations at various levels of detail.

Let's start with the most simple one: This needs to be behind the promql-experimental-functions feature flag. This gives us freedom to iterate later, which means we don't have to resolve all my detailed concerns (see further down) right now, but we can experiment and see what's useful in practice.

The next one is about UX: The strategy argument is quite a source of confusion, first of all in terms of readability. As every code, PromQL is more often read than written, and a reader seeing integral(hourly_cost[7d], 1) / 3600 will have no idea what the 1 is about. Even if they know it's an argument to define the strategy, they probably have to look up what 1 means. We don't have 1st class enums in PromQL, but I would claim that we should better have a string argument for it, for the sake of readability, e.g. integral(hourly_cost[7d], "right_point") / 3600. And once we are there, we can probably go down that road even further and pull the name into the function name, e.g. integral_right_point(hourly_cost[7d]), with the slight disadvantage that we have to create a whole zoo of functions, but it feels to me that's most in line with how PromQL functions have been designed in the past. However what I would ask for in the first iteration is to pick the strategy that we believe is most useful (presumably the mid-point one, which is the default in the PR right now) and don't offer the other initially. We can then see in the wild what users use this for and if the other strategies are even needed.

Which gets us to the use cases. Let's iterate through the examples you have given above.

integral(hourly_cost[7d]) / 3600 and integral(power_rate[1d]) / 1000 / 3600 seem to me like the same use case. Both have to deal with the problem that we have collected a metric that is already rate'd, so to speak, and we need to get back to an increase in a given timeframe. The strictly Promethean approach would be to change the metrics collection so that you get counters like cost_total and energy_joule_total. Obviously, the whole point of the new function is to handle cases where that's not possible. So I would rate these use cases as niche, but acceptable.

I think integral(absent(metric_name)[1d:5m]) isn't really good evidence for the need of the new function because here, via the sub-query, the user is in control of the sampling interval anyway, so you could get the exact same result via sum_over_time or avg_over_time with some factor.

Similarly, I would ask where the service_sla_ok in integral(service_sla_ok[7d]) / (3600 * 24 * 7) is coming from. It's probably coming from a rule that evaluate more basic metrics, so that it's probably possible to craft a fairly readable query that builds on top of those more basic metrics and doesn't need the detour via integral.

I think the true magic of integral is that it takes the sample spacing into account. The most valid use cases are IMHO those where the user is not in control of the sampling interval (or – even more intriguing, see below – where a series stops or starts mid-interval). So I was wondering if integral is just a sum_over_time with properly weighted samples (by time) – in which case it could maybe just be implemented as an optional parameter for sum_over_time (bringing us back to the strategy parameter above), or it should be named time_weighted_sum_over_time or something, to make it more descriptive for less mathematically inclined.

As hinted at above, I always wanted a "weighted" version of sum_over_time (and/or avg_over_time), and when thinking about how to work around the current absence thereof, sub-queries come to mind. The problem with sum_over_time(hourly_cost[7d]) / 3600 is that I don't know the sampling interval of hourly_cost, and that it might very well have changed over the last 7d. However, I can do "my own sampling" with sub-queries, which works especially well with these long intervals as in this example: sum_over_time(hourly_cost[7d:1h]) will give me a fairly good result for my total cost over the last 7d, and if I don't have a query cost problem, I could also do sum_over_time(hourly_cost[7d:1s]) / 3600 for more precise sampling, essentially "implementing" the left-point rule above in pure PromQL. The question really is if these clunky and possibly costly queries are clunky and costly enough to justify a whole new function.

One interesting byproduct of the sub-query approach is that it deals quite well with a series disappearing or springing into existence during the interval, especially if it happens multiple times, mostly because it implicitly takes staleness into account. The behavior of the implementation in this PR is handling this in a possibly problematic way. I would need ASCII art now or a photograph of my whiteboard for a proper explanation but let's try:

If the hourly_cost series disappears, it could mean multiple things: Maybe the data collection didn't work out. Or maybe that node was down and rightfully did not cause any cost. That's where a counter has the strength of just tracking everything correctly even if the data collection intermittently fails. With "pre-rated" metrics, we have to assume that "no metric collections" means the rate is zero at that time. The implementation in this PR essentially makes the same assumption, by only integrating the "area underneath the curve" between two collected samples within the interval (which systematically ignores the area next to the beginning and the end of the interval, which is a problem of its own that I'm not discussing here yet). For one, this requires at least two samples in the interval, but that's maybe a reasonable requirement. But the problematic scenario is if the series exists for a few samples at the beginning of the interval, then disappears, and then comes back towards the end of the interval. The current implementation would simply calculate a giant very long rectangle for the gap in the middle, while the subquery approach would sense the staleness marker and consider the gap in the middle as a time of zero power consumption.

This is just the current state of my stream of consciousness. My semi-educated gut feeling currently tells me we might just be better off with sub-queries as a work around because it expresses pretty clearly what we are doing in terms of existing PromQL primitives. It's a bit clunky but doesn't come with surprises for PromQL experts.

The points against this gut feeling are maybe the following: If the use case is more than just niche, we might argue that it should not rely on something like a work-around via sub-queries. What would have even more significance (IMHO) would be if we actually needed "more magic" to handle the case of radically different scrape intervals and series dropping in and out, i.e. something we could not emulate with sub-queries. But for that, the current implementation needed more tweaks in any case. Note that the current implementation of rate and friends uses sophisticated heuristics to find out if a series has just started or stopped, but it does not take into account staleness, and that's probably fine for counters. But for "already rated" metrics, the staleness might actually be very important, so the integral function might be the first function acting on range vectors that would need to look for staleness markers, which is a whole new adventure (and where the implicit handling via the sub-queries gives me some peace of mind as this is charted territory).

I'll stop her, as said, this is incomplete and in no way conclusive. Looking forward to further thoughts.

beorn7 · 2024-08-14T13:27:53Z

After a few days of letting my thoughts settle, I'll try to give this at least some preliminary conclusion.

First of all, I will add this as an agenda item for the dev-summit. We have an in-person one, where we usually cover a lot of ground, and it would be cool to tap into the collective wisdom. Even if it doesn't get enough interest to be discussed in detail, that's also a signal to take into account.

About the actual thing: By now I feel pretty confident that we need to somehow address gaps within the range, and we also have to handle the space between the beginning of the range and the first sample on the one hand, and the last sample and the end of the range on the other hand. Neither of those is addressed by the current approach in this PR.

I have now actually taken a photograph of my whiteboard to illustrate my rough ideas around it:

Here I'm just using the mid-point strategy, as that has good chances of being the best default, and maybe it should be the only strategy we offer for starters (see thoughts above).

I think we should make use of staleness markers, because that's one of the tools we already have available. The picture shows the case where the series has a gap in the middle of the range, which we can detect via the staleness marker. In that case, we could calculate the area up to the midpoint between the last sample before the staleness marker and the staleness marker itself. We would do a similar thing if the series just stops in the middle of the range and doesn't start again. A tougher call is how to detect the start of a series. If we have at least two samples, at the (assumed) beginning of a series, we can just extend the area "to the left" by half of the interval between those two samples. That can be seen in the right part of the picture. A similar strategy had to be applied of the series starts during the range, but of course we would not go beyond the beginning of the range (and we would never go beyond a staleness marker, or maybe not even half-way towards a staleness marker). There are many more edge cases to discuss (what to do if there is only one sample in the whole range, or there is exactly one sample between a staleness marker and the end of the range, …). But this should give a broad idea of what I'm up to here.

Similarly as for the endless discussion around a better rate calculation (AKA "the x-rate debate"), I think the heuristics here would benefit from a knowledge of the configured scrape interval and an explicit knowledge of what's the first and last sample of any contiguous piece of a series.

juliusv · 2024-08-14T17:59:32Z

I think we should make use of staleness markers

Makes sense, but just note range vector selectors currently don't get populated with any staleness markers (staleness markers are only considered for instant vector selectors, and that's it), so this would have to change in the engine. And then all other functions which currently rely on range vectors not containing staleness markers would have to take them into account.

beorn7 · 2024-08-14T19:44:33Z

Yeah, that's kind of what troubles me. I think for rate calculation, it's deliberate to not take staleness markers into account. As said above, that's the beauty of counters. We just check the count of events. We don't care if the series stopped to exist in the middle of the range. (We do care about starting after the beginning and stopping before the end of the range. And for that, we have a heuristic in place. And it's consistent with how it handles both ends. We could utilize a staleness marker for the end, and we could even use it to refine our heuristics to calculate the average sample interval. However, I wouldn't apply those subtle changes right now. I'd rather include those consideration into the pending re-thinking of the whole rate calculation (cf. "xrate debate").)

However, for this integral function, we are in a somewhat different position because now gaps in the middle of the range are really important. Also, we do the whole exercise here to accommodate for irregular scrape intervals including gaps. In cases where those changes aren't happening, we are fine with the current sum_over_time. (A compromise would be to treat changing scrape intervals and early starts/stops, but ignore gaps in the middle of a range.)

Implementation-wise, we could make the staleness marker available to function using ranges if they need it. It's nasty, because so far, we can abstract away the staleness marker very early in the stack, essentially in the TSDB query, and PromQL doesn't have to be aware of it at all.

Increasingly often, it feels to me that we need a better model of the start and end and sampling interval of each (contiguous piece of a) series. This could help with an actually improved rate calculation, and would also allow a clean implementation of an integral function.

samjewell · 2024-09-10T21:49:05Z

If the hourly_cost series disappears, it could mean multiple things: Maybe the data collection didn't work out. Or maybe that node was down and rightfully did not cause any cost. That's where a counter has the strength of just tracking everything correctly even if the data collection intermittently fails. With "pre-rated" metrics, we have to assume that "no metric collections" means the rate is zero at that time.

I don’t agree with this part. I’m only forced to integrate using ‘sum_over_time’ when I can’t get to the original counter metric. This is typically because I’m downstream of some recording rule, which is doing aggregation and relabelling. This being the case, missing samples are not gaps in the underlying data, but gaps in the recording rule. My assumption is that the rule failed to evaluate in the time allowed for evaluation, hence the gap. Meaning I believe a good assumption for the value where the gap lies would not be a zero rate, but a rate identical to the immediately preceding datapoint in that rate metric.

Any thoughts @jjo ? As the original PR author, I’d love to hear your take on this. I think you understand the real world conditions when this function would be used, more than most people.

samjewell · 2024-09-10T21:53:36Z

One more thought.

I agree with Bjorn that I’d like to just add one function to start with (I’m fine with midpoint) and see if that is enough. (YAGNI- “you ain’t gonna need it” - for the other two)

And my naming preference is for “integral”. A “weighted sum over time” isn’t very meaningful to me personally. I think it’s more than just weighted- weighted implies only giving more weight to the samples that are more spaced out- it doesn’t imply scaling up the result for all samples.

MichaHoffmann · 2024-09-22T14:02:58Z

How about monte-carlo integration - shouldnt that be somewhat robust towards missing samples?

beorn7 · 2024-09-25T11:48:24Z

As discussed with @MichaHoffmann , Monte-Carlo integration works with a random distribution of samples, representative for the full time series. However, what we are trying to solve here is precisely the problem that the samples in the TSDB are not always representative. If the scrape interval changes within the range (the most common problematic case), we systematically over-sample the part where the scrape interval is shorter. If samples are missing (because of scrape problems), we systematically under-sample that part.

One could argue that you already get Monte-Carlo integration if just using avg_over_time.

bboreham · 2025-01-07T11:14:06Z

Hello from the bug-scrub!
@beorn7 could you summarise where all the deliberations reached? Is this addition acceptable?
Would it be more acceptable if behind the experimental functions feature flag?

@jjo there are a number of merge conflicts.

beorn7 · 2025-01-08T16:10:48Z

@beorn7 could you summarise where all the deliberations reached? Is this addition acceptable?

If I had to decide this now, and if I had to decide alone, I would say I don't see sufficient use of this addition in its current state.

But I don't feel I understand this deeply enough to really make the call (or to come up with a better solution). I would love to get other opinions, but so far, very little has materialized. I submitted the dev-summit agenda item in the hope of getting more feedback. (I don't think that the dev-summit is the suitable format to do the deep dive into all the gory details I partially dug up above, but it's a good way of finding out if others had similar thoughts and how much interest there is for anything like this.) However, it didn't get a lot of votes so far and never made it to the top of the agenda so far. So I refer back to what I said above:

First of all, I will add this as an agenda item for the dev-summit. We have an in-person one, where we usually cover a lot of ground, and it would be cool to tap into the collective wisdom. Even if it doesn't get enough interest to be discussed in detail, that's also a signal to take into account.

So the one message we get from the (lack of) feedback is that there is not a lot of interest in this feature among the active developers. Which enforces my gut feeling that it might not be worth going down the rabbit hole of finding out how to implement this feature in a useful way.

Would it be more acceptable if behind the experimental functions feature flag?

I think such an involved new function should always be introduced behind the experimental functions feature flag. Everything I said above was under the assumption that we would do so.

The recurring theme with these "hard" problems (integral, but also "xrate") is that we need a better understanding of the beginning and end of a series. The fundamental point of view would be to solve that first and then come back to the problems at hand.

However, as usual, I don't want to block anything that others want by my maybe too fundamental and fatalistic view. But it needs somebody else to champion and shepherd this.

pedmon · 2025-04-04T13:58:44Z

Hi, I don't have anything to contribute to this excellent discussion. I more just wanted to voice my support for this feature. The integral feature has many uses which the primary author notes. I would add in my own use case which is to sum over compute usage to get CPU-hours and GPU-hours for a user on a large scale cluster. The regular sum does not do the right thing nor have the useful units, a proper integral would solve this problem.

Thanks for your hard work on this, its fascinating to see all the details involved. I hope to see this adopted soon.

beorn7 · 2025-06-27T20:38:12Z

Almost a year after adding this as an agenda item for the dev-summit, it was finally discussed. See notes.

Sadly, the outcome I had hoped for didn't happen. Ideally, one or more people really deeply into this topic would have popped up, joining forces with @jjo to get this across the finishing line. As I said above, I don't think my current understanding of this topic is sufficient to accept or reject this feature. But it was also pretty obvious that nobody else at the dev-summit had a deeper understanding than me. As also said above, I don't want to block this, but I would like to see someone with a good understanding of this to champion this before we merge this in whatever form.

Another take-away from the dev-summit is IMHO that there is no burning desire to have this feature (as already assumed above). It would have meant something if a bunch of people at the dev-summit had stated that they don't understand the details but they really need a feature like this. (Whereas my own inner state currently is that I have something close to a burning desire, but I feel that the current state of this approach is not there yet to really satisfy my desire.)

I currently lack the time to deepen my understanding here. Just by writing this comment, I have to resist the urge to get nerd-sniped by this. After re-reading all of the above, my current gut feeling is that we are in a not-too-bad state because we can use the sum_over_time(hourly_cost[7d:1s]) / 3600 work-around described above, which arguably solves the problem conceptually better than the integral function in this PR (and only has query execution cost and a somewhat clunky UX as downsides).

What I would like to drop here (as I have in the dev-summit): There is currently work ongoing on an improved rate calculation (see this proposal and all the stuff linked from it). At the same time, there is active work on a more "native" treatment of metrics with delta temporality. All of these things touch some common ground, and I expect that we will learn a lot from these efforts, how they are used in real-world scenarios, and what will work and what not. I could see that these learnings will also inform how to approach an integral function. (I have a few more detailed thoughts, but I won't dump them into this already too long comment.)

ringerc · 2025-08-01T01:09:37Z

While this looks useful, some of the use case would also be addressed by permitting end() and start() to appear in arithmetic expressions (#16965)

krajorama · 2025-12-02T12:03:05Z

Hello from the bug scrub!

Seems to be still under discussion. Might be interesting for @roidelapluie to look at #16965.

jjo changed the title ~~add integral function~~ promql: implement integral(v range-vector, strategy=2 scalar) function Jul 9, 2024

jjo changed the title ~~promql: implement integral(v range-vector, strategy=2 scalar) function~~ promql: implement integral() function Jul 9, 2024

jjo marked this pull request as ready for review July 9, 2024 00:49

jjo requested review from Nexucis, juliusv and roidelapluie as code owners July 9, 2024 00:49

jjo force-pushed the jjo/add-integral-function branch from 33fa69e to ff4aa52 Compare July 12, 2024 19:08

jjo force-pushed the jjo/add-integral-function branch from dff8c4f to 528d281 Compare July 19, 2024 22:27

jjo force-pushed the jjo/add-integral-function branch from 9351300 to 6235ca2 Compare July 31, 2024 18:26

jjo added 14 commits August 5, 2024 15:21

web/ui fixes

03b663d

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

function.ts: fix variadic=1 and scalar type

9a587d6

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

promql.terms.ts: fix info to show all integral strategies

2499c51

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

add NewInvalidIntegralStrategyWarning() and test

f8a0abf

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

ab-use kahanSumInc() to implement mid-point currVal+prevVal/2, th…

11fa941

…en fix integral(metricbignum, 2) case to give a small non-zero number \o/ Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

improve integral(sin|cos) tests: increase resolution as [2m:1s] -…

78abd78

…> `[1m:200ms]`, also adjust eval range parameters to only go thru period once (1m to 2m step 10s) Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

functions.test: more sins integral precision adjustment, add cuadrati…

53c1ee0

…c integral case Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

rebase 2024-07-12

9246a35

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

rebase 2024-07-19

f0d257e

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

add SLA boolean metric example for integral(), adjust examples.md

5f640ef

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

rebase 2024-07-31

de34bf0

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

rebase 2024-08-05

d0143e0

Signed-off-by: JuanJo Ciarlante <juanjosec@gmail.com>

jjo force-pushed the jjo/add-integral-function branch from 6235ca2 to d0143e0 Compare August 5, 2024 18:22

beorn7 reviewed Aug 8, 2024

View reviewed changes

beorn7 mentioned this pull request Aug 27, 2024

Add User Defined Functions to PromQL #12851

Closed

juliusv mentioned this pull request Sep 10, 2024

docs: Suggest mathematically correct use of sum_over_time #14838

Closed

beorn7 mentioned this pull request Sep 25, 2024

PromQL: add annotations for short ranges in rate like functions #14980

Closed

github-actions bot added the stale label Dec 1, 2024

github-actions bot removed the stale label Jan 11, 2025

beorn7 mentioned this pull request Apr 23, 2025

Proposal: OTEL delta temporality support prometheus/proposals#48

Open

austindyoung mentioned this pull request Feb 26, 2026

[mirror] prometheus/prometheus#14440 promql: implement integral() function austindyoung/prometheus-driftfence-mirror-fork#188

Open

Conversation

jjo commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Testing

Strategy

Uh oh!

juliusv commented Jul 9, 2024

Uh oh!

jjo commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beorn7 commented Jul 10, 2024

Uh oh!

beorn7 commented Jul 25, 2024

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

beorn7 Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Aug 8, 2024

Uh oh!

beorn7 commented Aug 14, 2024

Uh oh!

juliusv commented Aug 14, 2024

Uh oh!

beorn7 commented Aug 14, 2024

Uh oh!

samjewell commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samjewell commented Sep 10, 2024

Uh oh!

MichaHoffmann commented Sep 22, 2024

Uh oh!

beorn7 commented Sep 25, 2024

Uh oh!

bboreham commented Jan 7, 2025

Uh oh!

beorn7 commented Jan 8, 2025

Uh oh!

pedmon commented Apr 4, 2025

Uh oh!

beorn7 commented Jun 27, 2025

Uh oh!

ringerc commented Aug 1, 2025

Uh oh!

krajorama commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jjo commented Jul 9, 2024 •

edited

Loading

jjo commented Jul 9, 2024 •

edited

Loading

samjewell commented Sep 10, 2024 •

edited

Loading