Skip to content

PromQL engine: Delay deletion of __name__ label to the end of the query evaluation#14477

Merged
beorn7 merged 2 commits intoprometheus:mainfrom
jcreixell:delete-name-label-deletion
Aug 29, 2024
Merged

PromQL engine: Delay deletion of __name__ label to the end of the query evaluation#14477
beorn7 merged 2 commits intoprometheus:mainfrom
jcreixell:delete-name-label-deletion

Conversation

@jcreixell
Copy link
Contributor

@jcreixell jcreixell commented Jul 15, 2024

Example (this always fails, as __name__ is being dropped by count_over_time):

count_over_time({__name__!=""}[1m])

=> Error executing query: vector cannot contain metrics with the same labelset

Before:

label_replace(count_over_time({__name__!=""}[1m]), "__name__", "count_$1", "__name__", "(.+)")

=> Error executing query: vector cannot contain metrics with the same labelset

After:

label_replace(count_over_time({__name__!=""}[1m]), "__name__", "count_$1", "__name__", "(.+)")

=> 
count_go_gc_cycles_automatic_gc_cycles_total{instance="localhost:9090", job="prometheus"} 1
count_go_gc_cycles_forced_gc_cycles_total{instance="localhost:9090", job="prometheus"} 1
...

@jcreixell jcreixell requested a review from roidelapluie as a code owner July 15, 2024 19:26
@bboreham
Copy link
Member

/prombench main

@prombot
Copy link
Contributor

prombot commented Jul 16, 2024

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-14477 and main

After successful deployment, the benchmarking results can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@bboreham
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Jul 18, 2024

Benchmark cancel is in progress.

@jcreixell
Copy link
Contributor Author

@bboreham I have pushed a change that removes what I believe are unnecessary duplicate-detection function calls. This is now done in one pass only at the end of the query evaluation. The rationale is the following:

  • Duplicate detection was initially introduced in Add duplicate-labelset check for range/instant vectors #4589 to prevent issues with duplicated metrics in certain scenarios like renaming an exporter metric and using a regexp to match both the old and new name, in combination with name-dropping operations.
  • Since the __name__ label is used to calculate the hash that is used to detect duplicates, and it's now not being dropped until the very end, it will never detect duplicates until .DropMetricName() is executed, so the other places where duplicate detection is called are mostly a no-op.

I would expect this to help with CPU usage. I have run some benchmarks and I don't see significant differences compared to main. Would it make sense to run another round of prombench to see how it behaves?

In the meantime I will try to reproduce the original issue to make sure we are not introducing a regression here.

also cc @beorn7 in case I am missing something important here

@jcreixell
Copy link
Contributor Author

I have tried to reproduce the issue in #4562 by renaming a metric and running similar queries. I picked an internal prometheus metric (go_memstats_frees_total), collected some data, then added the following rule:

    metric_relabel_configs:
    - source_labels: [__name__]
      regex: '(.*go_memstats_frees_total.*)'
      replacement: '${1}_new'
      target_label: __name__

The rate still gives an error as expected:
Screenshot from 2024-07-19 10-57-21

The aggregation works and does not show gaps during the transition:

Screenshot from 2024-07-19 10-18-41

@jcreixell jcreixell force-pushed the delete-name-label-deletion branch 3 times, most recently from c445dc2 to c3c9d5e Compare July 22, 2024 15:49
@jcreixell
Copy link
Contributor Author

The feature is now behind a feature flag and this PR is ready for review 🙇

@jcreixell jcreixell force-pushed the delete-name-label-deletion branch 2 times, most recently from aa3e625 to 43a4b0a Compare July 22, 2024 17:43
@beorn7
Copy link
Member

beorn7 commented Jul 25, 2024

My (unplugged) vacations start ~now. I didn't get to this in time, sorry for that. I guess we should run prombench on this with the latest improvements (but of course, for that you have to hardcode setting the feature flag first so that prombench actually uses the new code). Happy to do the detailed review once I'm back, but if @bboreham or anyone else beats me to it, even better.

@beorn7
Copy link
Member

beorn7 commented Aug 8, 2024

I'm finally back from vacations and the subsequent catch-up…

Note that we had some interesting discussion on the original issue starting here. The concerns are valid but we decided to go forward with the approach in this PR as we want to find out the practical implications and maybe get some inspiration how to deal with them. That's after all the reason why we are doing this as an experimental feature.

@beorn7
Copy link
Member

beorn7 commented Aug 8, 2024

@jcreixell could you (temporarily) hard-wire the feature flag to "set" in this PR so that I can trigger another Prombench run?
At that same time, it might also be good to rebase on top of main so that we have a valid comparison (and the merge conflicts would also be gone).

@beorn7
Copy link
Member

beorn7 commented Aug 8, 2024

Note for later: https://github.com/prometheus/prometheus/blob/main/docs/feature_flags.md needs an update, too.

@jcreixell jcreixell force-pushed the delete-name-label-deletion branch 2 times, most recently from 3ee2c72 to aef7424 Compare August 12, 2024 11:31
@jcreixell
Copy link
Contributor Author

jcreixell commented Aug 12, 2024

@beorn7 I have now rebased and flipped the flag, feel free to trigger prombench.

I have also added some tentative content to the feature flags doc.

@beorn7
Copy link
Member

beorn7 commented Aug 13, 2024

/prombench main

@prombot
Copy link
Contributor

prombot commented Aug 13, 2024

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-14477 and main

After successful deployment, the benchmarking results can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@beorn7
Copy link
Member

beorn7 commented Aug 14, 2024

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Aug 14, 2024

Benchmark cancel is in progress.

@beorn7
Copy link
Member

beorn7 commented Aug 14, 2024

Prombench looks good. If at all, I can only see a slightly lower CPU usage with this PR, no other noticeable differences.

Of course, Pormbench isn't perfectly measuring everything, but I think we can greenlight this from a performance perspective. (It's behind a feature flag, so if we see regressions in the wild, we can just rollback.)

@jcreixell could you re-enable the normal feature flag behavior?

@bboreham do you have any other concerns?

I will put this on my list for a detailed code-level review, provided there are no other objections against doing this at all.

@jcreixell
Copy link
Contributor Author

@beorn7 ready on my side

@beorn7
Copy link
Member

beorn7 commented Aug 15, 2024

Will review ASAP (more likely early next week than this week…).

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good overall. My comments are just about comments… :o)

Very general thought about naming: Why "ShouldDropName" and not just "DropName"? The "Should" sounds like there are code paths where it is not dropped or something. That feels misleading to me. Plus, the name seems unnecessarily long.

…ry evaluation

  - This change allows optionally preserving the `__name__` label via the `label_replace` and `label_join` functions, and helps prevent the dreaded "vector cannot contain metrics with the same labelset" error.
  - The implementation extends the `Series` and `Sample` structs with a boolean flag indicating whether the `__name__` label should be deleted at the end of the query evaluation.
  - The `label_replace` and `label_join` functions can still access the value of the `__name__` label, even if it has been previously marked for deletion. If  `__name__` is used as target label, it won't be dropped at the end of the query evaluation.
  - Fixes prometheus#11397
  - See #2 for previous discussion, including the decision to create this PR and benchmark it before considering other alternatives (like refactoring `labels.Labels`).
  - See #1 for an alternative implementation using a special label instead of boolean flags.
  - Note: a feature flag  `promql-delayed-name-removal` has been added as it changes the behavior of some "weird" queries (see prometheus#11397 (comment))

Example (this always fails, as `__name__` is being dropped by `count_over_time`):

```
count_over_time({__name__!=""}[1m])

=> Error executing query: vector cannot contain metrics with the same labelset
```

Before:

```
label_replace(count_over_time({__name__!=""}[1m]), "__name__", "count_$1", "__name__", "(.+)")

=> Error executing query: vector cannot contain metrics with the same labelset
```

After:

```
label_replace(count_over_time({__name__!=""}[1m]), "__name__", "count_$1", "__name__", "(.+)")

=>
count_go_gc_cycles_automatic_gc_cycles_total{instance="localhost:9090", job="prometheus"} 1
count_go_gc_cycles_forced_gc_cycles_total{instance="localhost:9090", job="prometheus"} 1
...
```

Signed-off-by: Jorge Creixell <jcreixell@gmail.com>
@jcreixell jcreixell force-pushed the delete-name-label-deletion branch from d21dc53 to a708458 Compare August 29, 2024 09:53
@jcreixell
Copy link
Contributor Author

@beorn7 I addressed your last comment and squashed the commits, this should be ready to go!

Signed-off-by: Björn Rabenstein <github@rabenste.in>
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge on green. Thank you very much.

@beorn7 beorn7 merged commit e9e3d64 into prometheus:main Aug 29, 2024
@bboreham
Copy link
Member

bboreham commented Sep 2, 2024

Should #10921 be closed?

@beorn7
Copy link
Member

beorn7 commented Sep 3, 2024

Maybe. I'll comment on that issue.

zenador added a commit to grafana/mimir that referenced this pull request Sep 9, 2025
#### What this PR does

Implement prometheus/prometheus#14477 in MQE.

#### Which issue(s) this PR fixes or relates to

Fixes grafana/mimir-squad#3065

#### Checklist

- [x] Tests updated.
- [x] Documentation added.
- [x] `CHANGELOG.md` updated - the order of entries should be
`[CHANGE]`, `[FEATURE]`, `[ENHANCEMENT]`, `[BUGFIX]`. If changelog entry
is not needed, please add the `changelog-not-needed` label to the PR.
- [x]
[`about-versioning.md`](https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/about-versioning.md)
updated with experimental features.

---------

Co-authored-by: Charles Korn <charleskorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stop removing __name__ from results: prevent vector cannot contain metrics with the same labelset

4 participants