Skip to content

feat: Add MergeMaxByAggregations optimizer to combine multiple max_by calls with same second arg#26873

Closed
maniloya wants to merge 1 commit into
prestodb:masterfrom
maniloya:mani-merge-maxby
Closed

feat: Add MergeMaxByAggregations optimizer to combine multiple max_by calls with same second arg#26873
maniloya wants to merge 1 commit into
prestodb:masterfrom
maniloya:mani-merge-maxby

Conversation

@maniloya

Copy link
Copy Markdown
Contributor

Description

This PR introduces a new Presto optimizer rule MergeMaxByAggregations that merges multiple max_by aggregations sharing the same comparison key into a single max_by aggregation with a ROW argument.

Key Changes

  • Added MergeMaxByAggregations optimizer rule in presto-main-base
  • Added session property merge_max_by_aggregations (default: false) to control the optimization
  • Created custom plan matchers (DereferenceMatcher, RowConstructorMatcher) for testing DEREFERENCE and ROW_CONSTRUCTOR expressions
  • Added dereference(), dereferenceInRange(), and rowConstructor() helper methods to PlanMatchPattern

Motivation and Context

When a query contains multiple max_by aggregations with the same comparison key, each aggregation independently scans and compares values. By merging them into a single max_by(ROW(...), key) call, we:

  1. Reduce computation: Only one comparison operation per row instead of N comparisons
  2. Improve memory efficiency: Single aggregation state instead of N separate states
  3. Reduce plan complexity: Fewer aggregation function calls to process

Impact

  1. Public API: New session property merge_max_by_aggregations (boolean, default false)
  2. Performance: Queries with multiple max_by aggregations on the same key will see reduced CPU and memory usage when the optimization is enabled
  3. No breaking changes: The optimization is disabled by default and must be explicitly enabled

Test Plan

Added comprehensive unit tests in TestMergeMaxByAggregations and e2e tests in AbstractTestQueries

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

@sourcery-ai

sourcery-ai Bot commented Dec 29, 2025

Copy link
Copy Markdown
Contributor

Reviewer's Guide

Introduces a new MergeMaxByAggregations optimizer rule that, when enabled via a new session/config property, rewrites multiple compatible MAX_BY aggregations sharing the same comparison key into a single MAX_BY over a ROW plus dereference projections, along with new plan matchers and tests to validate the behavior at rule and query levels.

Sequence diagram for applying MergeMaxByAggregations during planning

sequenceDiagram
    participant Client
    participant Coordinator
    participant Planner
    participant Session
    participant Rule as MergeMaxByAggregations

    Client->>Coordinator: submit SQL query
    Coordinator->>Planner: create logical plan
    Planner->>Session: get system properties
    Session-->>Planner: MERGE_MAX_BY_AGGREGATIONS value

    Planner->>Rule: isEnabled(session)
    Rule->>Session: SystemSessionProperties.isMergeMaxByAggregationsEnabled(session)
    Session-->>Rule: boolean enabled
    Rule-->>Planner: enabled

    alt rule enabled
        Planner->>Rule: apply(aggregationNode, captures, context)
        Rule->>Rule: collect MAX_BY aggregations with two args
        Rule->>Rule: group by MaxByKey (comparisonKey, filter, mask, distinct)
        Rule->>Rule: identify mergeable groups (size >= 2)
        alt mergeable groups exist
            Rule->>Rule: build bottom ProjectNode with ROW_CONSTRUCTOR expressions
            Rule->>Rule: build new merged MAX_BY calls over ROW
            Rule->>Rule: build AggregationNode with newAggregations
            Rule->>Rule: build top ProjectNode with DEREFERENCE projections
            Rule-->>Planner: Result.ofPlanNode(topProjectNode)
        else no mergeable groups
            Rule-->>Planner: Result.empty()
        end
    else rule disabled
        Rule-->>Planner: not applied
    end

    Planner->>Coordinator: optimized plan
    Coordinator-->>Client: execute query with optimized aggregations
Loading

Class diagram for MergeMaxByAggregations optimizer rule

classDiagram
    class MergeMaxByAggregations {
        - String MAX_BY
        - FunctionResolution functionResolution
        - Pattern aggregationPattern
        + MergeMaxByAggregations(FunctionAndTypeManager functionAndTypeManager)
        + boolean isEnabled(Session session)
        + Pattern getPattern()
        + Result apply(AggregationNode node, Captures captures, Context context)
        - boolean isMaxByAggregationWithTwoArgs(Aggregation aggregation)
    }

    class MaxByKey {
        - RowExpression comparisonKey
        - Optional filter
        - Optional mask
        - boolean isDistinct
        + MaxByKey(Aggregation aggregation)
        + boolean equals(Object other)
        + int hashCode()
    }

    class AggregationNode {
        + Map getAggregations()
        + List getOutputVariables()
        + PlanNode getSource()
        + List getGroupingSets()
        + List getPreGroupedVariables()
        + Step getStep()
        + Optional getHashVariable()
        + Optional getGroupIdVariable()
        + Optional getAggregationId()
    }

    class Aggregation {
        + RowExpression getCall()
        + Optional getFilter()
        + Optional getOrderBy()
        + boolean isDistinct()
        + Optional getMask()
    }

    class CallExpression {
        + String getDisplayName()
        + List getArguments()
        + Type getType()
        + FunctionHandle getFunctionHandle()
    }

    class ProjectNode {
        + ProjectNode(SourceLocation sourceLocation, PlanNodeId id, PlanNode source, Assignments assignments, Locality locality)
    }

    class FunctionResolution {
        + FunctionResolution(FunctionAndTypeResolver resolver)
        + boolean isMaxByFunction(FunctionHandle functionHandle)
        + FunctionHandle lookupBuiltInFunction(String name, List types)
    }

    class FunctionAndTypeManager {
        + FunctionAndTypeResolver getFunctionAndTypeResolver()
    }

    class Context {
        + VariableAllocator getVariableAllocator()
        + PlanNodeIdAllocator getIdAllocator()
    }

    class VariableAllocator {
        + VariableReferenceExpression newVariable(String hint, Type type)
        + VariableReferenceExpression newVariable(CallExpression callExpression)
    }

    class SystemSessionProperties {
        + static boolean isMergeMaxByAggregationsEnabled(Session session)
    }

    MergeMaxByAggregations --> AggregationNode : Rule~AggregationNode~
    MergeMaxByAggregations --> FunctionResolution
    MergeMaxByAggregations --> SystemSessionProperties : uses
    MergeMaxByAggregations o-- MaxByKey : groups_by
    MergeMaxByAggregations --> ProjectNode : builds
    MergeMaxByAggregations --> Aggregation : merges
    MergeMaxByAggregations --> CallExpression : inspects
    MergeMaxByAggregations --> Context : uses
    Context --> VariableAllocator
    FunctionResolution --> FunctionAndTypeManager
Loading

File-Level Changes

Change Details Files
Add MergeMaxByAggregations optimizer rule to combine compatible MAX_BY aggregations into a single MAX_BY over a ROW plus projection dereferences.
  • Implement a Rule that detects multiple two-arg max_by calls with the same comparison key, filter, mask, and distinctness and groups them by key.
  • Rewrite each mergeable group by adding a bottom Project that builds a ROW_CONSTRUCTOR over value arguments, replacing per-value max_by aggregations with a single max_by(ROW(...), key) aggregation, and a top Project that exposes original outputs via DEREFERENCE on the merged result.
  • Gate the rule behind a session property check isMergeMaxByAggregationsEnabled and ensure it only fires for built-in max_by with exactly two arguments.
presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java
Expose configuration and session property to control the MergeMaxByAggregations optimization.
  • Add mergeMaxByAggregationsEnabled flag to FeaturesConfig with corresponding @config and accessor methods.
  • Introduce MERGE_MAX_BY_AGGREGATIONS system/session property wired to the FeaturesConfig default and helper accessor isMergeMaxByAggregationsEnabled(Session).
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
Extend plan test infrastructure with matchers for DEREFERENCE and ROW_CONSTRUCTOR and helper methods in PlanMatchPattern.
  • Add DereferenceMatcher to match DEREFERENCE special form expressions, with support for single index and index ranges, and integrate it via dereference()/dereferenceInRange() helpers.
  • Add RowConstructorMatcher to match ROW_CONSTRUCTOR special form expressions in a projection, checking that a set of expected fields appears (order-agnostic), and expose it via rowConstructor(...) helpers in PlanMatchPattern.
presto-main-base/src/test/java/com/facebook/presto/sql/planner/assertions/DereferenceMatcher.java
presto-main-base/src/test/java/com/facebook/presto/sql/planner/assertions/RowConstructorMatcher.java
presto-main-base/src/test/java/com/facebook/presto/sql/planner/assertions/PlanMatchPattern.java
Add unit tests for MergeMaxByAggregations rule and end-to-end query tests to validate correctness with the new optimization toggled.
  • Create TestMergeMaxByAggregations to cover positive rewrites (multiple max_by with same key, grouped cases, mixed aggregations, multiple key-groups) and negative cases (single max_by, differing keys, filters, distinct, extra arguments, disabled session).
  • Add AbstractTestQueries#testMergeMaxByAggregations that runs representative SQL queries with MERGE_MAX_BY_AGGREGATIONS enabled and disabled and asserts identical results in various scenarios including GROUP BY, mixed aggregations, multiple key groups, and non-mergeable keys.
presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestMergeMaxByAggregations.java
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In MergeMaxByAggregations.PATTERN you detect candidates via getCall().getDisplayName().equals("max_by"), while the main logic uses functionResolution.isMaxByFunction(...); consider using the same function-resolution-based check in the pattern to avoid missing or misidentifying max_by calls (e.g., if naming or resolution changes).
  • The MaxByKey grouping key currently considers only the comparison expression, filter, mask, and distinct flag; if Aggregation.getOrderBy() is non-empty, merging aggregations with different ORDER BY clauses could change semantics, so it may be safer to include orderBy (and any other behavior-affecting properties) in MaxByKey.equals/hashCode.
  • The session property registration for ADD_DISTINCT_BELOW_SEMI_JOIN_BUILD was changed to use the four-argument booleanProperty(..., false) overload when adding MERGE_MAX_BY_AGGREGATIONS; double-check that the visibility/experimental flags for the existing property are intentional and unchanged in behavior.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `MergeMaxByAggregations.PATTERN` you detect candidates via `getCall().getDisplayName().equals("max_by")`, while the main logic uses `functionResolution.isMaxByFunction(...)`; consider using the same function-resolution-based check in the pattern to avoid missing or misidentifying max_by calls (e.g., if naming or resolution changes).
- The `MaxByKey` grouping key currently considers only the comparison expression, filter, mask, and distinct flag; if `Aggregation.getOrderBy()` is non-empty, merging aggregations with different ORDER BY clauses could change semantics, so it may be safer to include `orderBy` (and any other behavior-affecting properties) in `MaxByKey.equals`/`hashCode`.
- The session property registration for `ADD_DISTINCT_BELOW_SEMI_JOIN_BUILD` was changed to use the four-argument `booleanProperty(..., false)` overload when adding `MERGE_MAX_BY_AGGREGATIONS`; double-check that the visibility/experimental flags for the existing property are intentional and unchanged in behavior.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java:70-77` </location>
<code_context>
+public class MergeMaxByAggregations
+        implements Rule<AggregationNode>
+{
+    private static final String MAX_BY = "max_by";
+
+    private final FunctionResolution functionResolution;
+
+    private static final Pattern<AggregationNode> PATTERN = aggregation()
+            .matching(MergeMaxByAggregations::hasMultipleMaxBy);
+
+    private static boolean hasMultipleMaxBy(AggregationNode aggregationNode)
+    {
+        return aggregationNode.getAggregations().values().stream()
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Avoid relying on displayName and a hard-coded string to detect MAX_BY in the pattern predicate.

The pattern uses `agg.getCall().getDisplayName().equals(MAX_BY)` while `apply()` uses `functionResolution.isMaxByFunction(...)`. This mismatch can cause the pattern to fire differently than the logic in `apply()` (e.g., with case, qualification, or future renames). Please align the predicate with the function-resolution-based approach (e.g., use the function handle rather than the display name) so the match condition and core logic stay in sync.

Suggested implementation:

```java
    private final FunctionResolution functionResolution;
    private final Pattern<AggregationNode> pattern;

    private boolean hasMultipleMaxBy(AggregationNode aggregationNode)
    {
        return aggregationNode.getAggregations().values().stream()
                .filter(agg -> functionResolution.isMaxByFunction(agg.getCall().getFunctionHandle()))
                .count() > 1;
    }

    public MergeMaxByAggregations(FunctionAndTypeManager functionAndTypeManager)
    {
        requireNonNull(functionAndTypeManager, "functionAndTypeManager is null");
        this.functionResolution = new FunctionResolution(functionAndTypeManager.getFunctionAndTypeResolver());
        this.pattern = aggregation()
                .matching(this::hasMultipleMaxBy);
    }

```

There is likely a `getPattern()` method later in the class that currently returns the static `PATTERN` field. That method should be updated to return the new instance field instead:

1. Change `return PATTERN;` to `return pattern;` in `getPattern()`.
2. Remove the now-unused static `PATTERN` field declaration if it appears elsewhere in the file.
</issue_to_address>

### Comment 2
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java:255-257` </location>
<code_context>
+    private static class MaxByKey
+    {
+        private final RowExpression comparisonKey;
+        private final Optional<RowExpression> filter;
+        private final Optional<VariableReferenceExpression> mask;
+        private final boolean isDistinct;
+
+        public MaxByKey(Aggregation aggregation)
</code_context>

<issue_to_address>
**issue (bug_risk):** Consider including ORDER BY in MaxByKey to avoid merging aggregations with different ordering.

MaxByKey currently keys aggregations by comparisonKey, filter, mask, and distinct, but not `Aggregation.getOrderBy()`. The merged aggregation then uses `firstAggregation.getOrderBy()`. If two MAX_BY aggregations share those fields but differ in ORDER BY, they will be merged and all will effectively use the first aggregation’s ORDER BY, changing semantics. If this can occur, either add orderBy to MaxByKey’s key fields or explicitly reject non-empty/differing ORDER BY to avoid incorrect rewrites.
</issue_to_address>

### Comment 3
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/MergeMaxByAggregations.java:104-106` </location>
<code_context>
+    @Override
+    public Result apply(AggregationNode node, Captures captures, Context context)
+    {
+        Map<VariableReferenceExpression, Aggregation> maxByAggregations = node.getAggregations().entrySet().stream()
+                .filter(entry -> isMaxByAggregationWithTwoArgs(entry.getValue()))
+                .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+
+        if (maxByAggregations.size() < 2) {
</code_context>

<issue_to_address>
**suggestion:** Use a map implementation that preserves insertion order explicitly instead of relying on Collectors.toMap defaults.

This stream currently collects into a default `HashMap` via `Collectors.toMap`, so the iteration order is unspecified. Since `groupedByKey` is later a `LinkedHashMap` to keep grouping order, consider collecting directly into a `LinkedHashMap` (e.g., `Collectors.toMap(..., LinkedHashMap::new, (a, b) -> a)`) or using `ImmutableMap.toImmutableMap` to ensure deterministic ordering and plan stability.

```suggestion
        Map<VariableReferenceExpression, Aggregation> maxByAggregations = node.getAggregations().entrySet().stream()
                .filter(entry -> isMaxByAggregationWithTwoArgs(entry.getValue()))
                .collect(Collectors.toMap(
                        Map.Entry::getKey,
                        Map.Entry::getValue,
                        (first, second) -> first,
                        LinkedHashMap::new));
```
</issue_to_address>

### Comment 4
<location> `presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestMergeMaxByAggregations.java:42-51` </location>
<code_context>
+                                .withAlias("max_v3", dereferenceInRange("max_by", 0, 2)));
+    }
+
+    @Test
+    public void testDoesNotFireWithSingleMaxBy()
+    {
+        tester().assertThat(new MergeMaxByAggregations(getMetadata().getFunctionAndTypeManager()))
+                .setSystemProperty(MERGE_MAX_BY_AGGREGATIONS, "true")
+                .on(p -> p.aggregation(af -> {
+                    p.variable("v1", BIGINT);
+                    p.variable("k", BIGINT);
+                    af.globalGrouping()
+                            .addAggregation(p.variable("max_v1"), p.rowExpression("max_by(v1, k)"))
+                            .source(p.values(p.variable("v1"), p.variable("k")));
+                }))
+                .doesNotFire();
+    }
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add a positive-plan test for merging DISTINCT MAX_BY aggregations

You already have a negative test (`testDoesNotMergeMaxByWithDistinct`) for mixed DISTINCT/non-DISTINCT `MAX_BY`, which is good. Since the rule also groups by `isDistinct` in `MaxByKey`, please add a positive test where both `MAX_BY` aggregations are DISTINCT, share the same comparison key and filter, and are verified via `PlanMatchPattern` to merge into a single DISTINCT `max_by(ROW(...), key)` aggregation. This will help catch regressions where the DISTINCT flag is mishandled in the all-DISTINCT case.

Suggested implementation:

```java
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.aggregation;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.column;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.globalAggregation;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.node;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.project;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.rowConstructor;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.singleGroupingSet;
import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.values;

```

```java
     @Test
     public void testDoesNotFireWithSingleMaxBy()
     {
         tester().assertThat(new MergeMaxByAggregations(getMetadata().getFunctionAndTypeManager()))
                 .setSystemProperty(MERGE_MAX_BY_AGGREGATIONS, "true")
                 .on(p -> p.aggregation(af -> {
                     p.variable("v1", BIGINT);
                     p.variable("k", BIGINT);
                     af.globalGrouping()
                             .addAggregation(p.variable("max_v1"), p.rowExpression("max_by(v1, k)"))
                             .source(p.values(p.variable("v1"), p.variable("k")));
                 }))
                 .doesNotFire();
     }

     @Test
     public void testMergesDistinctMaxByWithSameKeyAndFilter()
     {
         tester().assertThat(new MergeMaxByAggregations(getMetadata().getFunctionAndTypeManager()))
                 .setSystemProperty(MERGE_MAX_BY_AGGREGATIONS, "true")
                 .on(p -> p.aggregation(af -> {
                     p.variable("v1", BIGINT);
                     p.variable("v2", BIGINT);
                     p.variable("k", BIGINT);
                     p.variable("f", BOOLEAN);

                     af.globalGrouping()
                             // both DISTINCT, same key and same filter
                             .addAggregation(p.variable("max_v1"), p.rowExpression("max_by(DISTINCT v1, k) FILTER (WHERE f)"))
                             .addAggregation(p.variable("max_v2"), p.rowExpression("max_by(DISTINCT v2, k) FILTER (WHERE f)"))
                             .source(p.values(p.variable("v1"), p.variable("v2"), p.variable("k"), p.variable("f")));
                 }))
                 .matches(
                         // Expect a single DISTINCT max_by over a ROW(v1, v2) and key k
                         project(
                                 node(AggregationNode.class,
                                         globalAggregation(
                                                 singleGroupingSet(),
                                                 ImmutableMap.of(
                                                         Optional.empty(),
                                                         aggregation(
                                                                 "max_by",
                                                                 ImmutableList.of(
                                                                         rowConstructor(column("v1"), column("v2")),
                                                                         column("k")))
                                                                 .distinct()))))));
     }


```

Depending on the exact PlanMatchPattern APIs and existing conventions in this test class, you might need to:
1. Adjust the `singleGroupingSet()` call in the matcher to explicitly specify the grouping key symbol (for example, `singleGroupingSet("k")` or `singleGroupingSet(column("k"))`) to match how other tests in this file describe grouping sets.
2. Confirm that `Optional.empty()` is an acceptable key in the `ImmutableMap` used by `globalAggregation`. If other tests key aggregations by symbol name (e.g., `Optional.of("max")`), align with that convention and use the appropriate alias.
3. If other tests use a different style for plan matching (for example, they use `aggregation(...)` directly inside `node(AggregationNode.class, ...)` without `globalAggregation(...)`), rewrite the `.matches(...)` section to mirror that pattern while still asserting:
   - there is exactly one `max_by` aggregation,
   - its first argument is `ROW(v1, v2)`,
   - its second argument is `k`,
   - and it is `.distinct()`.
These tweaks will make the new positive DISTINCT test consistent with the rest of the file and ensure it remains robust against minor plan-shape differences.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +255 to +257
private final Optional<RowExpression> filter;
private final Optional<VariableReferenceExpression> mask;
private final boolean isDistinct;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Consider including ORDER BY in MaxByKey to avoid merging aggregations with different ordering.

MaxByKey currently keys aggregations by comparisonKey, filter, mask, and distinct, but not Aggregation.getOrderBy(). The merged aggregation then uses firstAggregation.getOrderBy(). If two MAX_BY aggregations share those fields but differ in ORDER BY, they will be merged and all will effectively use the first aggregation’s ORDER BY, changing semantics. If this can occur, either add orderBy to MaxByKey’s key fields or explicitly reject non-empty/differing ORDER BY to avoid incorrect rewrites.

@kaikalur kaikalur left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add MIN_BY in the same PR. They work the same way. And if they have the 3rd arg, make sure to merge only those with the same 3rd arg.

Enable optimization to combine multiple :func:`!approx_distinct` function calls on expressions
of the same type into a single aggregation using ``set_agg`` with array operations (``array_constructor``, ``array_transpose``).

``optimizer.merge_max_by_aggregations``

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also work the same way for min_by aggs. Change the option name to: ...max_by_and_min_by_...

steveburnett
steveburnett previously approved these changes Jan 8, 2026

@steveburnett steveburnett left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build, looks good. Thank you!

…mbine multiple max_by with same second argument, controlled by session parameter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need this. Once you check the name, just the function is different. Everything else is same. So just reuse the same class and call it MinByAndMaxBy. Look at the other optimization rewrite_min_max_by_to_top_n for ideas on simplifying this code.

feilong-liu pushed a commit that referenced this pull request Mar 27, 2026
…comparison key (#27417)

## Description
Adds a new `MergeMinMaxByAggregations` optimizer rule that merges
multiple `max_by` or `min_by` aggregations sharing the same comparison
key into a single aggregation with a `ROW` argument and `DEREFERENCE`
projections.

This is a takeover of #26873, addressing all reviewer feedback from
@kaikalur and Sourcery.

### Key Changes
- Unified `MergeMinMaxByAggregations` class handling both `max_by` and
`min_by` (per kaikalur review)
- Session property renamed to `merge_max_by_and_min_by_aggregations`
(per kaikalur review)
- `AggregationKey` includes `orderBy` to prevent merging aggregations
with different ordering (per Sourcery review)
- Instance-based pattern using `FunctionResolution` instead of string
comparison (per Sourcery review)
- `LinkedHashMap` for deterministic plan ordering (per Sourcery review)
- 17 unit tests + 10 e2e integration tests

### Example
```sql
-- Before optimization:
SELECT max_by(v1, k), max_by(v2, k), max_by(v3, k) FROM t

-- After optimization (internal rewrite):
SELECT merged[0], merged[1], merged[2]
FROM (SELECT max_by(ROW(v1, v2, v3), k) AS merged FROM t)
```

## Motivation and Context
When a query contains multiple `max_by`/`min_by` aggregations with the
same comparison key, each independently scans and compares values. By
merging into a single `max_by(ROW(...), key)` call:
1. **Reduced computation**: One comparison per row instead of N
2. **Better memory efficiency**: Single aggregation state instead of N
separate states
3. **Simpler plan**: Fewer aggregation function calls

## Impact
- New session property `merge_max_by_and_min_by_aggregations` (boolean,
default `false`)
- No breaking changes — optimization is disabled by default

## Test Plan
- 17 unit tests in `TestMergeMinMaxByAggregations` covering positive
rewrites (same key, GROUP BY, mixed aggs, multiple groups, both
max_by+min_by, matching filters, different value types, DISTINCT) and
negative cases (single agg, different keys, disabled, 3-arg, no
max_by/min_by, different filters, mixed distinct)
- 10 e2e integration tests in
`AbstractTestQueries#testMergeMinMaxByAggregations` comparing results
with optimization enabled vs disabled
- All 523 TestLocalQueries pass

## Contributor checklist
- [x] Complies with contributing guide
- [x] PR description addresses the issue accurately
- [x] Documented new session property
- [x] Adequate tests added
- [x] CI passed locally

## Release Notes
```
== NO RELEASE NOTE ==
```

## Summary by Sourcery

Introduce an optimizer rule to merge compatible max_by and min_by
aggregations sharing the same comparison key into a single aggregation
and wire it behind a configurable session property.

New Features:
- Add MergeMinMaxByAggregations rule to rewrite multiple max_by or
min_by aggregations with the same comparison key into a single ROW-based
aggregation.
- Introduce the merge_max_by_and_min_by_aggregations session property
and corresponding features configuration flag.

Documentation:
- Document the new merge_max_by_and_min_by_aggregations session property
in the admin session properties documentation.

Tests:
- Add rule unit tests covering positive and negative cases for merging
max_by/min_by aggregations, including filters, GROUP BY, DISTINCT, and
mixed types.
- Add integration tests ensuring queries produce identical results with
the merge_max_by_and_min_by_aggregations optimization enabled or
disabled.
@kaikalur

Copy link
Copy Markdown
Contributor

Closing as abandoned — superseded by #27417 which was merged.

@kaikalur kaikalur closed this Apr 24, 2026
msmygit pushed a commit to msmygit/presto that referenced this pull request Jun 3, 2026
…comparison key (prestodb#27417)

## Description
Adds a new `MergeMinMaxByAggregations` optimizer rule that merges
multiple `max_by` or `min_by` aggregations sharing the same comparison
key into a single aggregation with a `ROW` argument and `DEREFERENCE`
projections.

This is a takeover of prestodb#26873, addressing all reviewer feedback from
@kaikalur and Sourcery.

### Key Changes
- Unified `MergeMinMaxByAggregations` class handling both `max_by` and
`min_by` (per kaikalur review)
- Session property renamed to `merge_max_by_and_min_by_aggregations`
(per kaikalur review)
- `AggregationKey` includes `orderBy` to prevent merging aggregations
with different ordering (per Sourcery review)
- Instance-based pattern using `FunctionResolution` instead of string
comparison (per Sourcery review)
- `LinkedHashMap` for deterministic plan ordering (per Sourcery review)
- 17 unit tests + 10 e2e integration tests

### Example
```sql
-- Before optimization:
SELECT max_by(v1, k), max_by(v2, k), max_by(v3, k) FROM t

-- After optimization (internal rewrite):
SELECT merged[0], merged[1], merged[2]
FROM (SELECT max_by(ROW(v1, v2, v3), k) AS merged FROM t)
```

## Motivation and Context
When a query contains multiple `max_by`/`min_by` aggregations with the
same comparison key, each independently scans and compares values. By
merging into a single `max_by(ROW(...), key)` call:
1. **Reduced computation**: One comparison per row instead of N
2. **Better memory efficiency**: Single aggregation state instead of N
separate states
3. **Simpler plan**: Fewer aggregation function calls

## Impact
- New session property `merge_max_by_and_min_by_aggregations` (boolean,
default `false`)
- No breaking changes — optimization is disabled by default

## Test Plan
- 17 unit tests in `TestMergeMinMaxByAggregations` covering positive
rewrites (same key, GROUP BY, mixed aggs, multiple groups, both
max_by+min_by, matching filters, different value types, DISTINCT) and
negative cases (single agg, different keys, disabled, 3-arg, no
max_by/min_by, different filters, mixed distinct)
- 10 e2e integration tests in
`AbstractTestQueries#testMergeMinMaxByAggregations` comparing results
with optimization enabled vs disabled
- All 523 TestLocalQueries pass

## Contributor checklist
- [x] Complies with contributing guide
- [x] PR description addresses the issue accurately
- [x] Documented new session property
- [x] Adequate tests added
- [x] CI passed locally

## Release Notes
```
== NO RELEASE NOTE ==
```

## Summary by Sourcery

Introduce an optimizer rule to merge compatible max_by and min_by
aggregations sharing the same comparison key into a single aggregation
and wire it behind a configurable session property.

New Features:
- Add MergeMinMaxByAggregations rule to rewrite multiple max_by or
min_by aggregations with the same comparison key into a single ROW-based
aggregation.
- Introduce the merge_max_by_and_min_by_aggregations session property
and corresponding features configuration flag.

Documentation:
- Document the new merge_max_by_and_min_by_aggregations session property
in the admin session properties documentation.

Tests:
- Add rule unit tests covering positive and negative cases for merging
max_by/min_by aggregations, including filters, GROUP BY, DISTINCT, and
mixed types.
- Add integration tests ensuring queries produce identical results with
the merge_max_by_and_min_by_aggregations optimization enabled or
disabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants