feat: Add MergeMaxByAggregations optimizer to combine multiple max_by calls with same second arg by maniloya · Pull Request #26873 · prestodb/presto

maniloya · 2025-12-29T21:08:37Z

Description

This PR introduces a new Presto optimizer rule MergeMaxByAggregations that merges multiple max_by aggregations sharing the same comparison key into a single max_by aggregation with a ROW argument.

Key Changes

Added MergeMaxByAggregations optimizer rule in presto-main-base
Added session property merge_max_by_aggregations (default: false) to control the optimization
Created custom plan matchers (DereferenceMatcher, RowConstructorMatcher) for testing DEREFERENCE and ROW_CONSTRUCTOR expressions
Added dereference(), dereferenceInRange(), and rowConstructor() helper methods to PlanMatchPattern

Motivation and Context

When a query contains multiple max_by aggregations with the same comparison key, each aggregation independently scans and compares values. By merging them into a single max_by(ROW(...), key) call, we:

Reduce computation: Only one comparison operation per row instead of N comparisons
Improve memory efficiency: Single aggregation state instead of N separate states
Reduce plan complexity: Fewer aggregation function calls to process

Impact

Public API: New session property merge_max_by_aggregations (boolean, default false)
Performance: Queries with multiple max_by aggregations on the same key will see reduced CPU and memory usage when the optimization is enabled
No breaking changes: The optimization is disabled by default and must be explicitly enabled

Test Plan

Added comprehensive unit tests in TestMergeMaxByAggregations and e2e tests in AbstractTestQueries

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

sourcery-ai · 2025-12-29T21:08:45Z

Reviewer's Guide

Introduces a new MergeMaxByAggregations optimizer rule that, when enabled via a new session/config property, rewrites multiple compatible MAX_BY aggregations sharing the same comparison key into a single MAX_BY over a ROW plus dereference projections, along with new plan matchers and tests to validate the behavior at rule and query levels.

Sequence diagram for applying MergeMaxByAggregations during planning

sequenceDiagram
    participant Client
    participant Coordinator
    participant Planner
    participant Session
    participant Rule as MergeMaxByAggregations

    Client->>Coordinator: submit SQL query
    Coordinator->>Planner: create logical plan
    Planner->>Session: get system properties
    Session-->>Planner: MERGE_MAX_BY_AGGREGATIONS value

    Planner->>Rule: isEnabled(session)
    Rule->>Session: SystemSessionProperties.isMergeMaxByAggregationsEnabled(session)
    Session-->>Rule: boolean enabled
    Rule-->>Planner: enabled

    alt rule enabled
        Planner->>Rule: apply(aggregationNode, captures, context)
        Rule->>Rule: collect MAX_BY aggregations with two args
        Rule->>Rule: group by MaxByKey (comparisonKey, filter, mask, distinct)
        Rule->>Rule: identify mergeable groups (size >= 2)
        alt mergeable groups exist
            Rule->>Rule: build bottom ProjectNode with ROW_CONSTRUCTOR expressions
            Rule->>Rule: build new merged MAX_BY calls over ROW
            Rule->>Rule: build AggregationNode with newAggregations
            Rule->>Rule: build top ProjectNode with DEREFERENCE projections
            Rule-->>Planner: Result.ofPlanNode(topProjectNode)
        else no mergeable groups
            Rule-->>Planner: Result.empty()
        end
    else rule disabled
        Rule-->>Planner: not applied
    end

    Planner->>Coordinator: optimized plan
    Coordinator-->>Client: execute query with optimized aggregations

Class diagram for MergeMaxByAggregations optimizer rule

classDiagram
    class MergeMaxByAggregations {
        - String MAX_BY
        - FunctionResolution functionResolution
        - Pattern aggregationPattern
        + MergeMaxByAggregations(FunctionAndTypeManager functionAndTypeManager)
        + boolean isEnabled(Session session)
        + Pattern getPattern()
        + Result apply(AggregationNode node, Captures captures, Context context)
        - boolean isMaxByAggregationWithTwoArgs(Aggregation aggregation)
    }

    class MaxByKey {
        - RowExpression comparisonKey
        - Optional filter
        - Optional mask
        - boolean isDistinct
        + MaxByKey(Aggregation aggregation)
        + boolean equals(Object other)
        + int hashCode()
    }

    class AggregationNode {
        + Map getAggregations()
        + List getOutputVariables()
        + PlanNode getSource()
        + List getGroupingSets()
        + List getPreGroupedVariables()
        + Step getStep()
        + Optional getHashVariable()
        + Optional getGroupIdVariable()
        + Optional getAggregationId()
    }

    class Aggregation {
        + RowExpression getCall()
        + Optional getFilter()
        + Optional getOrderBy()
        + boolean isDistinct()
        + Optional getMask()
    }

    class CallExpression {
        + String getDisplayName()
        + List getArguments()
        + Type getType()
        + FunctionHandle getFunctionHandle()
    }

    class ProjectNode {
        + ProjectNode(SourceLocation sourceLocation, PlanNodeId id, PlanNode source, Assignments assignments, Locality locality)
    }

    class FunctionResolution {
        + FunctionResolution(FunctionAndTypeResolver resolver)
        + boolean isMaxByFunction(FunctionHandle functionHandle)
        + FunctionHandle lookupBuiltInFunction(String name, List types)
    }

    class FunctionAndTypeManager {
        + FunctionAndTypeResolver getFunctionAndTypeResolver()
    }

    class Context {
        + VariableAllocator getVariableAllocator()
        + PlanNodeIdAllocator getIdAllocator()
    }

    class VariableAllocator {
        + VariableReferenceExpression newVariable(String hint, Type type)
        + VariableReferenceExpression newVariable(CallExpression callExpression)
    }

    class SystemSessionProperties {
        + static boolean isMergeMaxByAggregationsEnabled(Session session)
    }

    MergeMaxByAggregations --> AggregationNode : Rule~AggregationNode~
    MergeMaxByAggregations --> FunctionResolution
    MergeMaxByAggregations --> SystemSessionProperties : uses
    MergeMaxByAggregations o-- MaxByKey : groups_by
    MergeMaxByAggregations --> ProjectNode : builds
    MergeMaxByAggregations --> Aggregation : merges
    MergeMaxByAggregations --> CallExpression : inspects
    MergeMaxByAggregations --> Context : uses
    Context --> VariableAllocator
    FunctionResolution --> FunctionAndTypeManager

File-Level Changes

Change	Details	Files
Add MergeMaxByAggregations optimizer rule to combine compatible MAX_BY aggregations into a single MAX_BY over a ROW plus projection dereferences.	Implement a Rule that detects multiple two-arg max_by calls with the same comparison key, filter, mask, and distinctness and groups them by key. Rewrite each mergeable group by adding a bottom Project that builds a ROW_CONSTRUCTOR over value arguments, replacing per-value max_by aggregations with a single max_by(ROW(...), key) aggregation, and a top Project that exposes original outputs via DEREFERENCE on the merged result. Gate the rule behind a session property check isMergeMaxByAggregationsEnabled and ensure it only fires for built-in max_by with exactly two arguments.	`presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java`
Expose configuration and session property to control the MergeMaxByAggregations optimization.	Add mergeMaxByAggregationsEnabled flag to FeaturesConfig with corresponding @config and accessor methods. Introduce MERGE_MAX_BY_AGGREGATIONS system/session property wired to the FeaturesConfig default and helper accessor isMergeMaxByAggregationsEnabled(Session).	`presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java` `presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java`
Extend plan test infrastructure with matchers for DEREFERENCE and ROW_CONSTRUCTOR and helper methods in PlanMatchPattern.	Add DereferenceMatcher to match DEREFERENCE special form expressions, with support for single index and index ranges, and integrate it via dereference()/dereferenceInRange() helpers. Add RowConstructorMatcher to match ROW_CONSTRUCTOR special form expressions in a projection, checking that a set of expected fields appears (order-agnostic), and expose it via rowConstructor(...) helpers in PlanMatchPattern.	`presto-main-base/src/test/java/com/facebook/presto/sql/planner/assertions/DereferenceMatcher.java` `presto-main-base/src/test/java/com/facebook/presto/sql/planner/assertions/RowConstructorMatcher.java` `presto-main-base/src/test/java/com/facebook/presto/sql/planner/assertions/PlanMatchPattern.java`
Add unit tests for MergeMaxByAggregations rule and end-to-end query tests to validate correctness with the new optimization toggled.	Create TestMergeMaxByAggregations to cover positive rewrites (multiple max_by with same key, grouped cases, mixed aggregations, multiple key-groups) and negative cases (single max_by, differing keys, filters, distinct, extra arguments, disabled session). Add AbstractTestQueries#testMergeMaxByAggregations that runs representative SQL queries with MERGE_MAX_BY_AGGREGATIONS enabled and disabled and asserts identical results in various scenarios including GROUP BY, mixed aggregations, multiple key groups, and non-mergeable keys.	`presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestMergeMaxByAggregations.java` `presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

In MergeMaxByAggregations.PATTERN you detect candidates via getCall().getDisplayName().equals("max_by"), while the main logic uses functionResolution.isMaxByFunction(...); consider using the same function-resolution-based check in the pattern to avoid missing or misidentifying max_by calls (e.g., if naming or resolution changes).
The MaxByKey grouping key currently considers only the comparison expression, filter, mask, and distinct flag; if Aggregation.getOrderBy() is non-empty, merging aggregations with different ORDER BY clauses could change semantics, so it may be safer to include orderBy (and any other behavior-affecting properties) in MaxByKey.equals/hashCode.
The session property registration for ADD_DISTINCT_BELOW_SEMI_JOIN_BUILD was changed to use the four-argument booleanProperty(..., false) overload when adding MERGE_MAX_BY_AGGREGATIONS; double-check that the visibility/experimental flags for the existing property are intentional and unchanged in behavior.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `MergeMaxByAggregations.PATTERN` you detect candidates via `getCall().getDisplayName().equals("max_by")`, while the main logic uses `functionResolution.isMaxByFunction(...)`; consider using the same function-resolution-based check in the pattern to avoid missing or misidentifying max_by calls (e.g., if naming or resolution changes).
- The `MaxByKey` grouping key currently considers only the comparison expression, filter, mask, and distinct flag; if `Aggregation.getOrderBy()` is non-empty, merging aggregations with different ORDER BY clauses could change semantics, so it may be safer to include `orderBy` (and any other behavior-affecting properties) in `MaxByKey.equals`/`hashCode`.
- The session property registration for `ADD_DISTINCT_BELOW_SEMI_JOIN_BUILD` was changed to use the four-argument `booleanProperty(..., false)` overload when adding `MERGE_MAX_BY_AGGREGATIONS`; double-check that the visibility/experimental flags for the existing property are intentional and unchanged in behavior.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java:70-77` </location>
<code_context>
+public class MergeMaxByAggregations
+        implements Rule<AggregationNode>
+{
+    private static final String MAX_BY = "max_by";
+
+    private final FunctionResolution functionResolution;
+
+    private static final Pattern<AggregationNode> PATTERN = aggregation()
+            .matching(MergeMaxByAggregations::hasMultipleMaxBy);
+
+    private static boolean hasMultipleMaxBy(AggregationNode aggregationNode)
+    {
+        return aggregationNode.getAggregations().values().stream()
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Avoid relying on displayName and a hard-coded string to detect MAX_BY in the pattern predicate.

The pattern uses `agg.getCall().getDisplayName().equals(MAX_BY)` while `apply()` uses `functionResolution.isMaxByFunction(...)`. This mismatch can cause the pattern to fire differently than the logic in `apply()` (e.g., with case, qualification, or future renames). Please align the predicate with the function-resolution-based approach (e.g., use the function handle rather than the display name) so the match condition and core logic stay in sync.

Suggested implementation:

```java
    private final FunctionResolution functionResolution;
    private final Pattern<AggregationNode> pattern;

    private boolean hasMultipleMaxBy(AggregationNode aggregationNode)
    {
        return aggregationNode.getAggregations().values().stream()
                .filter(agg -> functionResolution.isMaxByFunction(agg.getCall().getFunctionHandle()))
                .count() > 1;
    }

    public MergeMaxByAggregations(FunctionAndTypeManager functionAndTypeManager)
    {
        requireNonNull(functionAndTypeManager, "functionAndTypeManager is null");
        this.functionResolution = new FunctionResolution(functionAndTypeManager.getFunctionAndTypeResolver());
        this.pattern = aggregation()
                .matching(this::hasMultipleMaxBy);
    }

```

There is likely a `getPattern()` method later in the class that currently returns the static `PATTERN` field. That method should be updated to return the new instance field instead:

1. Change `return PATTERN;` to `return pattern;` in `getPattern()`.
2. Remove the now-unused static `PATTERN` field declaration if it appears elsewhere in the file.
</issue_to_address>

### Comment 2
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java:255-257` </location>
<code_context>
+    private static class MaxByKey
+    {
+        private final RowExpression comparisonKey;
+        private final Optional<RowExpression> filter;
+        private final Optional<VariableReferenceExpression> mask;
+        private final boolean isDistinct;
+
+        public MaxByKey(Aggregation aggregation)
</code_context>

<issue_to_address>
**issue (bug_risk):** Consider including ORDER BY in MaxByKey to avoid merging aggregations with different ordering.

MaxByKey currently keys aggregations by comparisonKey, filter, mask, and distinct, but not `Aggregation.getOrderBy()`. The merged aggregation then uses `firstAggregation.getOrderBy()`. If two MAX_BY aggregations share those fields but differ in ORDER BY, they will be merged and all will effectively use the first aggregation’s ORDER BY, changing semantics. If this can occur, either add orderBy to MaxByKey’s key fields or explicitly reject non-empty/differing ORDER BY to avoid incorrect rewrites.
</issue_to_address>

### Comment 3
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java:104-106` </location>
<code_context>
+    @Override
+    public Result apply(AggregationNode node, Captures captures, Context context)
+    {
+        Map<VariableReferenceExpression, Aggregation> maxByAggregations = node.getAggregations().entrySet().stream()
+                .filter(entry -> isMaxByAggregationWithTwoArgs(entry.getValue()))
+                .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+
+        if (maxByAggregations.size() < 2) {
</code_context>

<issue_to_address>
**suggestion:** Use a map implementation that preserves insertion order explicitly instead of relying on Collectors.toMap defaults.

This stream currently collects into a default `HashMap` via `Collectors.toMap`, so the iteration order is unspecified. Since `groupedByKey` is later a `LinkedHashMap` to keep grouping order, consider collecting directly into a `LinkedHashMap` (e.g., `Collectors.toMap(..., LinkedHashMap::new, (a, b) -> a)`) or using `ImmutableMap.toImmutableMap` to ensure deterministic ordering and plan stability.

```suggestion
        Map<VariableReferenceExpression, Aggregation> maxByAggregations = node.getAggregations().entrySet().stream()
                .filter(entry -> isMaxByAggregationWithTwoArgs(entry.getValue()))
                .collect(Collectors.toMap(
                        Map.Entry::getKey,
                        Map.Entry::getValue,
                        (first, second) -> first,
                        LinkedHashMap::new));
```
</issue_to_address>

### Comment 4
<location> `presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestMergeMaxByAggregations.java:42-51` </location>
<code_context>
+                                .withAlias("max_v3", dereferenceInRange("max_by", 0, 2)));
+    }
+
+    @Test
+    public void testDoesNotFireWithSingleMaxBy()
+    {
+        tester().assertThat(new MergeMaxByAggregations(getMetadata().getFunctionAndTypeManager()))
+                .setSystemProperty(MERGE_MAX_BY_AGGREGATIONS, "true")
+                .on(p -> p.aggregation(af -> {
+                    p.variable("v1", BIGINT);
+                    p.variable("k", BIGINT);
+                    af.globalGrouping()
+                            .addAggregation(p.variable("max_v1"), p.rowExpression("max_by(v1, k)"))
+                            .source(p.values(p.variable("v1"), p.variable("k")));
+                }))
+                .doesNotFire();
+    }
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add a positive-plan test for merging DISTINCT MAX_BY aggregations

You already have a negative test (`testDoesNotMergeMaxByWithDistinct`) for mixed DISTINCT/non-DISTINCT `MAX_BY`, which is good. Since the rule also groups by `isDistinct` in `MaxByKey`, please add a positive test where both `MAX_BY` aggregations are DISTINCT, share the same comparison key and filter, and are verified via `PlanMatchPattern` to merge into a single DISTINCT `max_by(ROW(...), key)` aggregation. This will help catch regressions where the DISTINCT flag is mishandled in the all-DISTINCT case.

Suggested implementation:

```java
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.aggregation;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.column;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.globalAggregation;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.node;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.project;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.rowConstructor;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.singleGroupingSet;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.values;

```

```java
     @Test
     public void testDoesNotFireWithSingleMaxBy()
     {
         tester().assertThat(new MergeMaxByAggregations(getMetadata().getFunctionAndTypeManager()))
                 .setSystemProperty(MERGE_MAX_BY_AGGREGATIONS, "true")
                 .on(p -> p.aggregation(af -> {
                     p.variable("v1", BIGINT);
                     p.variable("k", BIGINT);
                     af.globalGrouping()
                             .addAggregation(p.variable("max_v1"), p.rowExpression("max_by(v1, k)"))
                             .source(p.values(p.variable("v1"), p.variable("k")));
                 }))
                 .doesNotFire();
     }

     @Test
     public void testMergesDistinctMaxByWithSameKeyAndFilter()
     {
         tester().assertThat(new MergeMaxByAggregations(getMetadata().getFunctionAndTypeManager()))
                 .setSystemProperty(MERGE_MAX_BY_AGGREGATIONS, "true")
                 .on(p -> p.aggregation(af -> {
                     p.variable("v1", BIGINT);
                     p.variable("v2", BIGINT);
                     p.variable("k", BIGINT);
                     p.variable("f", BOOLEAN);

                     af.globalGrouping()
                             // both DISTINCT, same key and same filter
                             .addAggregation(p.variable("max_v1"), p.rowExpression("max_by(DISTINCT v1, k) FILTER (WHERE f)"))
                             .addAggregation(p.variable("max_v2"), p.rowExpression("max_by(DISTINCT v2, k) FILTER (WHERE f)"))
                             .source(p.values(p.variable("v1"), p.variable("v2"), p.variable("k"), p.variable("f")));
                 }))
                 .matches(
                         // Expect a single DISTINCT max_by over a ROW(v1, v2) and key k
                         project(
                                 node(AggregationNode.class,
                                         globalAggregation(
                                                 singleGroupingSet(),
                                                 ImmutableMap.of(
                                                         Optional.empty(),
                                                         aggregation(
                                                                 "max_by",
                                                                 ImmutableList.of(
                                                                         rowConstructor(column("v1"), column("v2")),
                                                                         column("k")))
                                                                 .distinct()))))));
     }


```

Depending on the exact PlanMatchPattern APIs and existing conventions in this test class, you might need to:
1. Adjust the `singleGroupingSet()` call in the matcher to explicitly specify the grouping key symbol (for example, `singleGroupingSet("k")` or `singleGroupingSet(column("k"))`) to match how other tests in this file describe grouping sets.
2. Confirm that `Optional.empty()` is an acceptable key in the `ImmutableMap` used by `globalAggregation`. If other tests key aggregations by symbol name (e.g., `Optional.of("max")`), align with that convention and use the appropriate alias.
3. If other tests use a different style for plan matching (for example, they use `aggregation(...)` directly inside `node(AggregationNode.class, ...)` without `globalAggregation(...)`), rewrite the `.matches(...)` section to mirror that pattern while still asserting:
   - there is exactly one `max_by` aggregation,
   - its first argument is `ROW(v1, v2)`,
   - its second argument is `k`,
   - and it is `.distinct()`.
These tweaks will make the new positive DISTINCT test consistent with the rest of the file and ensure it remains robust against minor plan-shape differences.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-12-29T21:15:34Z

+        private final Optional<RowExpression> filter;
+        private final Optional<VariableReferenceExpression> mask;
+        private final boolean isDistinct;


issue (bug_risk): Consider including ORDER BY in MaxByKey to avoid merging aggregations with different ordering.

MaxByKey currently keys aggregations by comparisonKey, filter, mask, and distinct, but not Aggregation.getOrderBy(). The merged aggregation then uses firstAggregation.getOrderBy(). If two MAX_BY aggregations share those fields but differ in ORDER BY, they will be merged and all will effectively use the first aggregation’s ORDER BY, changing semantics. If this can occur, either add orderBy to MaxByKey’s key fields or explicitly reject non-empty/differing ORDER BY to avoid incorrect rewrites.

kaikalur

Also add MIN_BY in the same PR. They work the same way. And if they have the 3rd arg, make sure to merge only those with the same 3rd arg.

kaikalur · 2025-12-29T21:30:51Z

 Enable optimization to combine multiple :func:`!approx_distinct` function calls on expressions
 of the same type into a single aggregation using ``set_agg`` with array operations (``array_constructor``, ``array_transpose``).
+
+``optimizer.merge_max_by_aggregations``


Should also work the same way for min_by aggs. Change the option name to: ...max_by_and_min_by_...

steveburnett

LGTM! (docs)

Pull branch, local doc build, looks good. Thank you!

…mbine multiple max_by with same second argument, controlled by session parameter

kaikalur · 2026-01-13T18:09:57Z

You should not need this. Once you check the name, just the function is different. Everything else is same. So just reuse the same class and call it MinByAndMaxBy. Look at the other optimization rewrite_min_max_by_to_top_n for ideas on simplifying this code.

@kaikalur

…comparison key (#27417) ## Description Adds a new `MergeMinMaxByAggregations` optimizer rule that merges multiple `max_by` or `min_by` aggregations sharing the same comparison key into a single aggregation with a `ROW` argument and `DEREFERENCE` projections. This is a takeover of #26873, addressing all reviewer feedback from @kaikalur and Sourcery. ### Key Changes - Unified `MergeMinMaxByAggregations` class handling both `max_by` and `min_by` (per kaikalur review) - Session property renamed to `merge_max_by_and_min_by_aggregations` (per kaikalur review) - `AggregationKey` includes `orderBy` to prevent merging aggregations with different ordering (per Sourcery review) - Instance-based pattern using `FunctionResolution` instead of string comparison (per Sourcery review) - `LinkedHashMap` for deterministic plan ordering (per Sourcery review) - 17 unit tests + 10 e2e integration tests ### Example ```sql -- Before optimization: SELECT max_by(v1, k), max_by(v2, k), max_by(v3, k) FROM t -- After optimization (internal rewrite): SELECT merged[0], merged[1], merged[2] FROM (SELECT max_by(ROW(v1, v2, v3), k) AS merged FROM t) ``` ## Motivation and Context When a query contains multiple `max_by`/`min_by` aggregations with the same comparison key, each independently scans and compares values. By merging into a single `max_by(ROW(...), key)` call: 1. **Reduced computation**: One comparison per row instead of N 2. **Better memory efficiency**: Single aggregation state instead of N separate states 3. **Simpler plan**: Fewer aggregation function calls ## Impact - New session property `merge_max_by_and_min_by_aggregations` (boolean, default `false`) - No breaking changes — optimization is disabled by default ## Test Plan - 17 unit tests in `TestMergeMinMaxByAggregations` covering positive rewrites (same key, GROUP BY, mixed aggs, multiple groups, both max_by+min_by, matching filters, different value types, DISTINCT) and negative cases (single agg, different keys, disabled, 3-arg, no max_by/min_by, different filters, mixed distinct) - 10 e2e integration tests in `AbstractTestQueries#testMergeMinMaxByAggregations` comparing results with optimization enabled vs disabled - All 523 TestLocalQueries pass ## Contributor checklist - [x] Complies with contributing guide - [x] PR description addresses the issue accurately - [x] Documented new session property - [x] Adequate tests added - [x] CI passed locally ## Release Notes ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Introduce an optimizer rule to merge compatible max_by and min_by aggregations sharing the same comparison key into a single aggregation and wire it behind a configurable session property. New Features: - Add MergeMinMaxByAggregations rule to rewrite multiple max_by or min_by aggregations with the same comparison key into a single ROW-based aggregation. - Introduce the merge_max_by_and_min_by_aggregations session property and corresponding features configuration flag. Documentation: - Document the new merge_max_by_and_min_by_aggregations session property in the admin session properties documentation. Tests: - Add rule unit tests covering positive and negative cases for merging max_by/min_by aggregations, including filters, GROUP BY, DISTINCT, and mixed types. - Add integration tests ensuring queries produce identical results with the merge_max_by_and_min_by_aggregations optimization enabled or disabled.

kaikalur · 2026-04-24T21:26:49Z

Closing as abandoned — superseded by #27417 which was merged.

@kaikalur

…comparison key (prestodb#27417) ## Description Adds a new `MergeMinMaxByAggregations` optimizer rule that merges multiple `max_by` or `min_by` aggregations sharing the same comparison key into a single aggregation with a `ROW` argument and `DEREFERENCE` projections. This is a takeover of prestodb#26873, addressing all reviewer feedback from @kaikalur and Sourcery. ### Key Changes - Unified `MergeMinMaxByAggregations` class handling both `max_by` and `min_by` (per kaikalur review) - Session property renamed to `merge_max_by_and_min_by_aggregations` (per kaikalur review) - `AggregationKey` includes `orderBy` to prevent merging aggregations with different ordering (per Sourcery review) - Instance-based pattern using `FunctionResolution` instead of string comparison (per Sourcery review) - `LinkedHashMap` for deterministic plan ordering (per Sourcery review) - 17 unit tests + 10 e2e integration tests ### Example ```sql -- Before optimization: SELECT max_by(v1, k), max_by(v2, k), max_by(v3, k) FROM t -- After optimization (internal rewrite): SELECT merged[0], merged[1], merged[2] FROM (SELECT max_by(ROW(v1, v2, v3), k) AS merged FROM t) ``` ## Motivation and Context When a query contains multiple `max_by`/`min_by` aggregations with the same comparison key, each independently scans and compares values. By merging into a single `max_by(ROW(...), key)` call: 1. **Reduced computation**: One comparison per row instead of N 2. **Better memory efficiency**: Single aggregation state instead of N separate states 3. **Simpler plan**: Fewer aggregation function calls ## Impact - New session property `merge_max_by_and_min_by_aggregations` (boolean, default `false`) - No breaking changes — optimization is disabled by default ## Test Plan - 17 unit tests in `TestMergeMinMaxByAggregations` covering positive rewrites (same key, GROUP BY, mixed aggs, multiple groups, both max_by+min_by, matching filters, different value types, DISTINCT) and negative cases (single agg, different keys, disabled, 3-arg, no max_by/min_by, different filters, mixed distinct) - 10 e2e integration tests in `AbstractTestQueries#testMergeMinMaxByAggregations` comparing results with optimization enabled vs disabled - All 523 TestLocalQueries pass ## Contributor checklist - [x] Complies with contributing guide - [x] PR description addresses the issue accurately - [x] Documented new session property - [x] Adequate tests added - [x] CI passed locally ## Release Notes ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Introduce an optimizer rule to merge compatible max_by and min_by aggregations sharing the same comparison key into a single aggregation and wire it behind a configurable session property. New Features: - Add MergeMinMaxByAggregations rule to rewrite multiple max_by or min_by aggregations with the same comparison key into a single ROW-based aggregation. - Introduce the merge_max_by_and_min_by_aggregations session property and corresponding features configuration flag. Documentation: - Document the new merge_max_by_and_min_by_aggregations session property in the admin session properties documentation. Tests: - Add rule unit tests covering positive and negative cases for merging max_by/min_by aggregations, including filters, GROUP BY, DISTINCT, and mixed types. - Add integration tests ensuring queries produce identical results with the merge_max_by_and_min_by_aggregations optimization enabled or disabled.

maniloya requested review from a team, elharo, feilong-liu, jaystarshot and vivek-bharathan as code owners December 29, 2025 21:08

prestodb-ci added the from:Meta PR from Meta label Dec 29, 2025

sourcery-ai Bot reviewed Dec 29, 2025

View reviewed changes

maniloya force-pushed the mani-merge-maxby branch from a532ced to 332b90e Compare December 29, 2025 21:24

maniloya requested a review from steveburnett as a code owner December 29, 2025 21:24

kaikalur requested changes Dec 29, 2025

View reviewed changes

maniloya force-pushed the mani-merge-maxby branch from 332b90e to 62c1e4d Compare December 29, 2025 21:33

steveburnett previously approved these changes Jan 8, 2026

View reviewed changes

Add MergeMaxByAggregations and MergeMinByAggregations optimizer to co…

ab558fd

…mbine multiple max_by with same second argument, controlled by session parameter

maniloya dismissed steveburnett’s stale review via ab558fd January 13, 2026 17:58

maniloya force-pushed the mani-merge-maxby branch from 62c1e4d to ab558fd Compare January 13, 2026 17:58

kaikalur requested changes Jan 13, 2026

View reviewed changes

kaikalur mentioned this pull request Mar 23, 2026

feat(optimizer): Merge multiple max_by/min_by aggregations with same comparison key #27417

Merged

5 tasks

mbasmanova mentioned this pull request Apr 24, 2026

Optimize multiple max_by/min_by aggregations on the same ordering column facebookincubator/velox#17336

Open

kaikalur closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add MergeMaxByAggregations optimizer to combine multiple max_by calls with same second arg#26873

feat: Add MergeMaxByAggregations optimizer to combine multiple max_by calls with same second arg#26873
maniloya wants to merge 1 commit into
prestodb:masterfrom
maniloya:mani-merge-maxby

maniloya commented Dec 29, 2025

Uh oh!

sourcery-ai Bot commented Dec 29, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Dec 29, 2025

Uh oh!

Uh oh!

kaikalur left a comment

Uh oh!

kaikalur Dec 29, 2025

Uh oh!

steveburnett left a comment

Uh oh!

kaikalur Jan 13, 2026

Uh oh!

kaikalur commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

maniloya commented Dec 29, 2025

Description

Key Changes

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

sourcery-ai Bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for applying MergeMaxByAggregations during planning

Class diagram for MergeMaxByAggregations optimizer rule

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaikalur left a comment

Choose a reason for hiding this comment

Uh oh!

kaikalur Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

kaikalur Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

kaikalur commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sourcery-ai Bot commented Dec 29, 2025 •

edited

Loading