PromQL engine: Delay deletion of __name__ label to the end of the query evaluation#14477
Conversation
|
/prombench main |
|
/prombench cancel |
|
Benchmark cancel is in progress. |
|
@bboreham I have pushed a change that removes what I believe are unnecessary duplicate-detection function calls. This is now done in one pass only at the end of the query evaluation. The rationale is the following:
I would expect this to help with CPU usage. I have run some benchmarks and I don't see significant differences compared to main. Would it make sense to run another round of prombench to see how it behaves? In the meantime I will try to reproduce the original issue to make sure we are not introducing a regression here. also cc @beorn7 in case I am missing something important here |
|
I have tried to reproduce the issue in #4562 by renaming a metric and running similar queries. I picked an internal prometheus metric ( metric_relabel_configs:
- source_labels: [__name__]
regex: '(.*go_memstats_frees_total.*)'
replacement: '${1}_new'
target_label: __name__The rate still gives an error as expected: The aggregation works and does not show gaps during the transition: |
c445dc2 to
c3c9d5e
Compare
|
The feature is now behind a feature flag and this PR is ready for review 🙇 |
aa3e625 to
43a4b0a
Compare
|
My (unplugged) vacations start ~now. I didn't get to this in time, sorry for that. I guess we should run prombench on this with the latest improvements (but of course, for that you have to hardcode setting the feature flag first so that prombench actually uses the new code). Happy to do the detailed review once I'm back, but if @bboreham or anyone else beats me to it, even better. |
|
I'm finally back from vacations and the subsequent catch-up… Note that we had some interesting discussion on the original issue starting here. The concerns are valid but we decided to go forward with the approach in this PR as we want to find out the practical implications and maybe get some inspiration how to deal with them. That's after all the reason why we are doing this as an experimental feature. |
|
@jcreixell could you (temporarily) hard-wire the feature flag to "set" in this PR so that I can trigger another Prombench run? |
|
Note for later: https://github.com/prometheus/prometheus/blob/main/docs/feature_flags.md needs an update, too. |
3ee2c72 to
aef7424
Compare
|
@beorn7 I have now rebased and flipped the flag, feel free to trigger prombench. I have also added some tentative content to the feature flags doc. |
|
/prombench main |
|
/prombench cancel |
|
Benchmark cancel is in progress. |
|
Prombench looks good. If at all, I can only see a slightly lower CPU usage with this PR, no other noticeable differences. Of course, Pormbench isn't perfectly measuring everything, but I think we can greenlight this from a performance perspective. (It's behind a feature flag, so if we see regressions in the wild, we can just rollback.) @jcreixell could you re-enable the normal feature flag behavior? @bboreham do you have any other concerns? I will put this on my list for a detailed code-level review, provided there are no other objections against doing this at all. |
|
@beorn7 ready on my side |
|
Will review ASAP (more likely early next week than this week…). |
beorn7
left a comment
There was a problem hiding this comment.
Looks very good overall. My comments are just about comments… :o)
Very general thought about naming: Why "ShouldDropName" and not just "DropName"? The "Should" sounds like there are code paths where it is not dropped or something. That feels misleading to me. Plus, the name seems unnecessarily long.
781e680 to
658be86
Compare
658be86 to
d21dc53
Compare
…ry evaluation - This change allows optionally preserving the `__name__` label via the `label_replace` and `label_join` functions, and helps prevent the dreaded "vector cannot contain metrics with the same labelset" error. - The implementation extends the `Series` and `Sample` structs with a boolean flag indicating whether the `__name__` label should be deleted at the end of the query evaluation. - The `label_replace` and `label_join` functions can still access the value of the `__name__` label, even if it has been previously marked for deletion. If `__name__` is used as target label, it won't be dropped at the end of the query evaluation. - Fixes prometheus#11397 - See #2 for previous discussion, including the decision to create this PR and benchmark it before considering other alternatives (like refactoring `labels.Labels`). - See #1 for an alternative implementation using a special label instead of boolean flags. - Note: a feature flag `promql-delayed-name-removal` has been added as it changes the behavior of some "weird" queries (see prometheus#11397 (comment)) Example (this always fails, as `__name__` is being dropped by `count_over_time`): ``` count_over_time({__name__!=""}[1m]) => Error executing query: vector cannot contain metrics with the same labelset ``` Before: ``` label_replace(count_over_time({__name__!=""}[1m]), "__name__", "count_$1", "__name__", "(.+)") => Error executing query: vector cannot contain metrics with the same labelset ``` After: ``` label_replace(count_over_time({__name__!=""}[1m]), "__name__", "count_$1", "__name__", "(.+)") => count_go_gc_cycles_automatic_gc_cycles_total{instance="localhost:9090", job="prometheus"} 1 count_go_gc_cycles_forced_gc_cycles_total{instance="localhost:9090", job="prometheus"} 1 ... ``` Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
d21dc53 to
a708458
Compare
|
@beorn7 I addressed your last comment and squashed the commits, this should be ready to go! |
Signed-off-by: Björn Rabenstein <github@rabenste.in>
beorn7
left a comment
There was a problem hiding this comment.
Will merge on green. Thank you very much.
|
Should #10921 be closed? |
|
Maybe. I'll comment on that issue. |
#### What this PR does Implement prometheus/prometheus#14477 in MQE. #### Which issue(s) this PR fixes or relates to Fixes grafana/mimir-squad#3065 #### Checklist - [x] Tests updated. - [x] Documentation added. - [x] `CHANGELOG.md` updated - the order of entries should be `[CHANGE]`, `[FEATURE]`, `[ENHANCEMENT]`, `[BUGFIX]`. If changelog entry is not needed, please add the `changelog-not-needed` label to the PR. - [x] [`about-versioning.md`](https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/about-versioning.md) updated with experimental features. --------- Co-authored-by: Charles Korn <charleskorn@users.noreply.github.com>


__name__label via thelabel_replaceandlabel_joinfunctions, and helps prevent the dreaded "vector cannot contain metrics with the same labelset" error.SeriesandSamplestructs with a boolean flag indicating whether the__name__label should be deleted at the end of the query evaluation.label_replaceandlabel_joinfunctions can still access the value of the__name__label, even if it has been previously marked for deletion. If__name__is used as target label, it won't be dropped at the end of the query evaluation.vector cannot contain metrics with the same labelset#11397labels.Labels).promql-delayed-name-removalhas been added as it changes the behavior of some "weird" queries (see Stop removing __name__ from results: preventvector cannot contain metrics with the same labelset#11397 (comment))Example (this always fails, as
__name__is being dropped bycount_over_time):Before:
After: