Skip to content

rules: allow using alternative PromQL engines for rule evaluation#13713

Merged
bwplotka merged 2 commits intoprometheus:mainfrom
charleskorn:query-engine-interface
Mar 13, 2024
Merged

rules: allow using alternative PromQL engines for rule evaluation#13713
bwplotka merged 2 commits intoprometheus:mainfrom
charleskorn:query-engine-interface

Conversation

@charleskorn
Copy link
Contributor

This PR makes it possible to use alternative PromQL query engines for rule evaluation in projects like Mimir or Thanos, similar to the changes introduced in #10050 and #12516.

I've moved the QueryEngine interface introduced in #10050 to the promql package and changed the web/api/v1 and rules packages to use it. Another option would be to introduce a QueryEngine interface specifically for the rules package that just contains NewInstantQuery(), as that's the only method used in the rules package - open to feedback on which approach is preferable.

Signed-off-by: Charles Korn <charles.korn@grafana.com>
@beorn7
Copy link
Member

beorn7 commented Mar 6, 2024

I guess it would be good to get a review from somebody not working at Grafana to avoid bias.
@bwplotka @roidelapluie maybe one of you?

@fpetkovski does this maybe also help Thanos?

@fpetkovski
Copy link
Contributor

fpetkovski commented Mar 6, 2024

I don't think this changes anything for Thanos at the moment, but it should enable reusing more code from Prometheus in the future. Happy to see an interface instead a the concrete class.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo @juliusv comment

Signed-off-by: Charles Korn <charles.korn@grafana.com>
// QueryEngine defines the interface for the *promql.Engine, so it can be replaced, wrapped or mocked.
type QueryEngine interface {
NewInstantQuery(ctx context.Context, q storage.Queryable, opts QueryOpts, qs string, ts time.Time) (Query, error)
NewRangeQuery(ctx context.Context, q storage.Queryable, opts QueryOpts, qs string, start, end time.Time, interval time.Duration) (Query, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something completely outside of the scope of the PR, but I wonder why the interface is coupled to the prometheus storage format. This parameter looks like something that should be part of the constructor of a concrete implementation. Not having storage.Queryable here would allow us to query different data sources which are not necessarily related to Prometheus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, storage.Queryable is just an abstract interface that is not tied to the Prometheus storage format (and that is also used with other storage backends in other systems already). Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is true and flexible to an extent, but we are still limited by the parameters of that interface. For example, there is no good way to request only certain labels for a series since such a parameter does not exist in the Select statement. In Thanos we now have a way to run PromQL in a distributed manner, and telling a remote engine to execute a query is not possible through that interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is possible through context 🙈 (we already do that in Thanos, and it's 💩 )

On serious note, we probably need some good repro example on where is the problem e.g. test/one go file and I think it could make sense to move queryable out of this interface for flexibility (or some other way like Select with some more generic params 🤔 ). Not for this PR for sure

@bboreham
Copy link
Member

Did you see one of the CI tests failed?

@charleskorn
Copy link
Contributor Author

Did you see one of the CI tests failed?

Yep, looks like that test is flaky - it sets up a rule to run every 1s, waits 2s and expects that at least two more queries have been logged. The test can likely be made more robust, but this seems outside the scope of this PR.

@beorn7
Copy link
Member

beorn7 commented Mar 13, 2024

I have massaged the flaky tests to pass now. @juliusv @bwplotka @bboreham could any of you verify that this is good to go?

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@bwplotka bwplotka merged commit 312e3fd into prometheus:main Mar 13, 2024
@charleskorn charleskorn deleted the query-engine-interface branch March 14, 2024 00:08
charleskorn added a commit to grafana/mimir that referenced this pull request Apr 9, 2024
* Use forked Prometheus module

* Use promql.QueryEngine rather than *promql.Engine, add config option to select engine

* Add parallel query path with streaming engine.

* Add IntelliJ debugging configuration for streaming querier.

* Add Grafana datasource for streaming querier

* Use correct query-scheduler port and enable debugging

* Set up load test.

* Fix p99 latency panel

* Commit settings used in previous test.

* Enable chunks streaming.

* Disable load generator.

* Create separate load testing query paths to ensure meta-monitoring queries don't interfere with load test results.

* Add initial k6-based load test script

* Add script to run load tests against both engines.

* Add todo

* Extract friendly name to hostname logic to separate script

* Rename k6 output file

* Check both engines produce the same result.

* Summarise results after each test run.

* Enable streaming chunks from store-gateways to queriers.

* Refactor test script, print max and average CPU and memory after test

* Reorder variables to match output order

* Don't include query name in k6 output file name.

* Add initial script to generate data.

* Rename script

* Add script to generate data generation config.

* Make it easier to test different queries.

* Rename PromQL engine CLI flag.

* Add Thanos' engine.

* Add Thanos engine to benchmarking setup.

* Set default test duration to 5 minutes.

* Rework data generation setup to work around limitation in prometheus-toolbox.

* Load test data faster.

* Record test config for first test

* Record test config for second test

* Record test config for third test

* Disable gRPC compression for ingester client

* Record test config for re-run of second test

* Record test config for re-run of first test

* Record test config for re-run of third test

* Log URL used for test.

* Record test config for fourth test

* Record test config for fifth test

* Record test config for 10k series test

* Record test config for 1k series test

* Record test config for 50k series test

* Fix configuration issue that does not affect outcome of testing.

This does not affect the outcome of testing because all tests use
queriers directly (ie. do not use query-frontends or query-schedulers)

* Record test config for seventh test

* Move production code to main Mimir code tree

* Copy test files into repo

* Add license and provenance comments

* Add a readme.

* Update package path in scripts and remove unnecessary testing script

* Add more notes to code

* Explain benefit of streaming engine.

* Mention benchmarking tools in readme.

* Remove references to Thanos' engine.

* Undo unnecessary go.mod changes

* Remove benchmarking tool for now

We can add something more robust in the future if we need it.

* Remove docker-compose changes

* Reset whitespace changes

* Remove readme

* Vendor in branch of mimir-prometheus with changes from prometheus/prometheus#13713

* Fix issues introduced with recent Prometheus changes and fix linting issues

* Detect and reject configuration not supported by streaming engine

* Remove TODOs tracked elsewhere

* Bring in latest changes to `QueryEngine` interface

* Remove more TODOs tracked elsewhere

* Change signature of `Next()` to return an error when the stream is exhausted

* Ignore warnings when comparing results from Prometheus and streaming engines.

* Correct naming

* Rename `Operator` to clarify that it produces instant vectors.

* Rename test file

* Reorganise test file

* Bring in latest benchmark changes from upstream.

* Run tests and benchmarks with isolation disabled, to mirror what Mimir itself does

* Fix script permissions

* Simplify test setup and don't bother testing 2000 series case to speed up tests.

* Use assertion helpers in benchmark

* Improve engine benchmark comparison script: fix issue with package name containing "streaming", clearly label results

* Remove unnecessary files

* Add note to engine benchmark comparison script

* Make test names clearer

* Ensure both engines produce the same result before benchmarking

* Add tests for unsupported expressions and improve error messages returned when unsupported expression used.

* Fix linting warnings

* Add notes for invalid cases that should be caught by the PromQL parser.

* Implement `Query.Statement()`

* Give variable a better name

* Remove Pool interface

* Remove unnecessary types

* Update comments

* Remove unused lookback delta for range vector selector

* Extract selector logic to a shared type

* Add more TODOs

* Use Selector for range vector selectors.

* Compute values used on every call to Next() once in instant and range vector selectors.

This gives a 0-7% latency improvement for queries with many series at
small absolute cost to queries with few series.

Memory consumption is slightly higher (tens of bytes), but not
concerning.

* Remove unnecessary script and update comment

* Validate query time range

* Enforce that range query expression produces a vector or scalar result.

* Add CLI flag to experimental features list in docs

* Add changelog entry

* Fix linting warnings

* Move common PromQL engine configuration to its own package

* Don't bother with 10 series benchmarks.

They're not very interesting, and the benchmarks take too long to run.

* Sort imports

* Add missing license header.

* Use bucketed pool based on zeropool.

This greatly improves latency and allocations, with latency
improvements between 0 and 15% on our benchmarks.

* Fix import sorting

* Set upper limits for sizes of pooled slices.

* Set size for SeriesBatch pool

* Add integration test.

* Move RingBuffer out of util package and pool slices used.

This improves latency of rate-related benchmarks by between 0 and 11%.

* Add tests for ring buffer type.

* Return RingBuffer point slices to pool when no longer required.

* Fix linting

* Remove TODOs about pooling group slices and maps.

Pooling these slices did not have a substantial performance impact.

* Pool groups in aggregation operator.

This reduces latency of aggregation operations by 0-2% in our
benchmarks.

* Remove TODOs about pooling label builders.

Pooling these had a negative performance impact.

* Remove more TODOs, clarify interface expectations

* Release memory sooner after query execution

* Fix typo in comment

* Remove unecessary context cancellation checks from operators.

Context cancellation is handled by the underlying queryable, at least
for the PromQL expressions we currently support.

* Fix linting warnings

* Address remaining TODOs in Selector

* Fix integration test flakiness

* Don't panic when encountering a native histogram

* Fix issue where bucketed pool could return a slice with capacity less than the requested size.

* Remove redundant checks

* Add support for @ in instant and range vector selectors.

This is used extensively by Prometheus' test scripts, so it's easier
to just implement support for this rather than try to workaround it
or disable it.

* Run Prometheus' test cases against streaming engine.

There are still some failing test cases, but these appear to be
genuine issues that need to be investigated.

* Fix failing staleness test: don't emit empty series for instant query

* Upgrade mimir-prometheus

* Add test case for stale markers in a range query.

* Add test cases for @ modifier with range queries

* Check for errors while reading test script

* Fix whitespace

* Remove duplicate timestamp computation in Selector and operators

* Update `rate` calculation to match behaviour in prometheus/prometheus#13725

* Add an overview of the engine.

* Fix linting issues and clarify docs.

* Add test case for scenario where input to aggregation returns no results or no points, and remove stale TODOs

* Implement `Query.String()`, and remove stale TODOs.

* Clarify BucketedPool contract, and add test case for case where min size is greater than 1.

* Expand test coverage to include scenarios not covered by upstream tests or those uniquely likely to cause issues in the streaming engine.

* Elaborate on comment

* Expand example in docs and fix typo

* Remove unnecessary sorting and use non-experimental slices package in comparison tests.

* Add missing comments.

* Clean up temporary files in benchmark comparison script, and improve formatting of output

* Add note explaining purpose of script.

* Address PR feedback

* Replace use of `DurationMilliseconds` with `d.Milliseconds()`

* Rename `Series()` to `SeriesMetadata()`

* Include names of acceptable values in CLI flag description

* Remove TODOs

* Add explanation for RangeVectorSelectorWithTransformation

* Move to top-level `streamingpromql` package

* Move common query engine config back to original location (revert e3933a2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants