Skip to content

Support expressions with window functions#37848

Merged
novikd merged 15 commits intomasterfrom
window-function-expression
Jun 20, 2022
Merged

Support expressions with window functions#37848
novikd merged 15 commits intomasterfrom
window-function-expression

Conversation

@novikd
Copy link
Copy Markdown
Member

@novikd novikd commented Jun 4, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support expressions with window functions. Closes #19857

@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-improvement Pull request with some product improvements label Jun 4, 2022
@novikd
Copy link
Copy Markdown
Member Author

novikd commented Jun 10, 2022

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jun 10, 2022

update

✅ Branch has been successfully updated

@novikd novikd marked this pull request as ready for review June 16, 2022 13:42
@vdimir vdimir self-assigned this Jun 17, 2022
template <typename Visitor, typename T = ASTPtr>
struct InDepthNodeVisitorWithChildInfo : Visitor
{
using ChildInfo = typename Visitor::ChildInfo;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need special visitor for this, can't we use regular one and store the info in the data?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want explicitly show that:

  1. Node processing result depends on the results of children nodes.
  2. Children are processed completely independently. Data represents global visitor values.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, GetAggregatesVisitor could be implemented in mode efficient way using this approach.
Instead of traversing query AST for several times in getAggregates and getWindowFunctions functions, we can collect functions and check correctness on the fly if we know independent results for child nodes.

public:
using Visitor = ConstInDepthNodeVisitor<ActionsMatcher, true>;

enum class WindowDependancyState
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just bool depends_on_window ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not affecting performance in any way, but IMHO makes code more readable. This variable describes the current state of the visitor, so enum represents it more straightforwardly than boolean.

data.window_function_in_subtree = true;
return;
}
else if (node.compute_after_window_functions)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this flags subtree_contains_window_call, compute_after_window_functions, window_function_in_subtree, window_dependancy_state are required just to change the order of visiting the nodes, are they? It looks like this PR's trickiest part, but it just does visit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the most important part of this PR. Let's consider some general expression with a window function:

f(g(arg) OVER window_name)

Evaluation of this expression must be split to 3 separate parts:

  1. Evaluate all arguments of window functions (arg in the example)
  2. Evaluate window functions. It's important to notice, that any window function can't have another window function as an argument. (In the example it's g(arg) OVER window_name)
  3. Evaluate expressions with window functions (in the example we will computef call value). We can do this way because we are sure that f is not:
    • Aggregate function (aggregation is executed before window functions, so I can't have a window function as an argument);
    • Window function.

In other words, here we have some kind of topological sort.

ExpressionActionsChain::Step & step = chain.lastStep(columns_after_window);
for (const auto & expression : syntax->expressions_with_window_function)
{
getRootActionsForWindowFunctions(expression->clone(), true, step.actions());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, generally, most of the work is done here. Have I got right that it's the part that drives the calculation of expressions for window functions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here we create actions for expression evaluation.


bool is_window_function = false;

bool compute_after_window_functions = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AST Looks like not the best place for this flag, but we already have is_window_function here, so it seems now there's no easy way to get rid of it. And we need actually to determine which expressions are related to the window in WindowExpressionsCollectorChildInfo that also operates with AST, though

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it's completely okay to cache some useful data in AST. Instead of that we usually recompute the same values several times.

The only part that should be considered is that we don't want to make AST nodes too big. In this case, it's not a problem, because I just added a simple boolean flag (in the opposite situation we could store a pointer to a struct with cached values).

@novikd
Copy link
Copy Markdown
Member Author

novikd commented Jun 19, 2022

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jun 19, 2022

update

✅ Branch has been successfully updated

@novikd
Copy link
Copy Markdown
Member Author

novikd commented Jun 20, 2022

Stateless tests (release, s3 storage, actions) — fail: 2, passed: 4148, skipped: 53

Unrelated.

@vpanfilov
Copy link
Copy Markdown

@novikd, thanks for working on this functionality.
I tested this feature on unstable version 22.7.1.710. Simple expressions like:

SELECT sum(number) OVER() + 1 AS result FROM numbers(10);

SELECT arraySum(groupArray(number) OVER()) AS result FROM numbers(10);

SELECT sum(number) OVER() + avg(number) OVER() AS result FROM numbers(10);

now works.

But it is not working for array operations with lambda expressions:

SELECT arrayMap(x -> x + 1, groupArray(number) OVER()) AS result FROM numbers(10);

Code: 183. DB::Exception: Unexpected lambda expression: While processing arrayMap(x -> (x + 1), groupArray(number) OVER ()) AS result. (UNEXPECTED_EXPRESSION) (version 22.7.1.710 (official build))

And not working in case of nested window functions:

SELECT sum(avg(number) OVER()) OVER() AS result FROM numbers(10);

Code: 184. DB::Exception: Window function avg(number) OVER () is found inside another window function in query: While processing avg(number) OVER (). (ILLEGAL_AGGREGATION) (version 22.7.1.710 (official build))

@UnamedRus
Copy link
Copy Markdown
Contributor

And not working in case of nested window functions:

It's possible to have nested WINDOW functions?

@novikd
Copy link
Copy Markdown
Member Author

novikd commented Jun 27, 2022

@vpanfilov Thank you for testing the feature. I'll take a look at higher-order functions.

And not working in case of nested window functions:

SQL standard defines nested window functions as a completely different entity. The query you wrote is not supposed to work. If you want, you can use GROUP BY to fix this query.

Nested window functions specified in the SQL standard require specific <row_marker> usage. As far as I know, it's not been supported in ClickHouse yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SUM(col) / SUM(SUM(col)) OVER (PARTITION BY col2) aggregate function over WINDOW is not supported.

5 participants