Skip to content

Refactor ARRAY JOIN#13611

Merged
KochetovNicolai merged 8 commits intomasterfrom
array-join-processor
Aug 20, 2020
Merged

Refactor ARRAY JOIN#13611
KochetovNicolai merged 8 commits intomasterfrom
array-join-processor

Conversation

@KochetovNicolai
Copy link
Copy Markdown
Member

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 11, 2020
@azat
Copy link
Copy Markdown
Member

azat commented Aug 13, 2020

@KochetovNicolai I know about some issues with ARRAY JOIN/arrayJoin, in particular:

  • SELECT *, a_ FROM numbers(1) ARRAY JOIN [[], []] AS a_ GROUP BY number right now it fails with NOT_FOUND_COLUMN_IN_BLOCK (later then it should), but should fail with NOT_AN_AGGREGATE
    The following works as expected:
    • SELECT *, any(a_) FROM numbers(1) ARRAY JOIN [[], []] AS a_ GROUP BY number
    • SELECT *, a_ FROM numbers(1) ARRAY JOIN [[], []] AS a_ GROUP BY number, a_
  • SELECT arrayJoin([[], []]), * FROM numbers(1) GROUP BY number - looks like this should fail with NOT_AN_AGGREGATE

AFAICS you are touching related bits, so will this PR address such things or this is just "code" refactoring?

@KochetovNicolai
Copy link
Copy Markdown
Member Author

@azat it is mainly just "code" refactoring, but I will check if those issues may be fixed easily.

@azat
Copy link
Copy Markdown
Member

azat commented Aug 13, 2020

Thanks (than I will wait your PR to avoid conflicts)

@KochetovNicolai
Copy link
Copy Markdown
Member Author

@azat
I've just checked your queries.
SELECT *, a_ FROM numbers(1) ARRAY JOIN [[], []] AS a_ GROUP BY number fails at the moment we try to calculate header for select. I didn't change this part, and this bug remains. It would be easier to fix in separate pr.
SELECT arrayJoin([[], []]), * FROM numbers(1) GROUP BY number probably shouldn't fail cause arrayJoin may be calculated over aggregation result as well (compare with SELECT now(), * FROM numbers(1) GROUP BY number).

@KochetovNicolai KochetovNicolai marked this pull request as ready for review August 14, 2020 18:25
@azat
Copy link
Copy Markdown
Member

azat commented Aug 14, 2020

SELECT *, a_ FROM numbers(1) ARRAY JOIN [[], []] AS a_ GROUP BY number fails at the moment we try to calculate header for select. I didn't change this part, and this bug remains. It would be easier to fix in separate pr.

Ok, will take a look after this PR will be merged.

SELECT arrayJoin([[], []]), * FROM numbers(1) GROUP BY number probably shouldn't fail cause arrayJoin may be calculated over aggregation result as well (compare with SELECT now(), * FROM numbers(1) GROUP BY number).

Ok, then something should be done with SELECT arrayJoin([[], []]), * FROM numbers(1) GROUP BY number WITH TOTALS which now produces two rows in TOTALS block

case ARRAY_JOIN:
{
array_join->prepare(sample_block);
ColumnWithTypeAndName current = sample_block.getByName(source_name);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to have one-line function here. Why do we need such a complex logic inside of switch?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think one-line functions are more readable.
However, it would be reasonable rewrite the whole function then. I would rather do it next time.

Also, I don't think this logic is complex, especially comparing with previous many-column implementation.

case ARRAY_JOIN:
{
array_join->execute(block, dry_run);
auto source = block.getByName(source_name);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same: why not to move all this logic in separate function?

}

bool depends_on_array_join = false;
for (auto & column : action.getNeededColumns())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cycle is candidate for function extraction: action.dependsOnArrayJoin()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument against doing it:
When I see function dependsOnArrayJoin I have no idea what this function actually does. I will need to switch my attention and read it anyway. It's better to read this 3 lines of code here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is not clear why action should know something about arrayJoin.
As for me, it looks like inserting unnecessary business-logic into simple interface.

/// Each alias has separate dependencies, so we split this action into two parts.
NamesWithAliases split_aliases;
NamesWithAliases depend_aliases;
for (const auto & pair : action.projection)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract this cycle into a function action.extractArrayJoinDependentColumns() ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look pretty much reasonable to me.
There is no copy-paste, this code is very context-dependent, and comment to such function will be longer then it's implementation.


struct Step
{
std::variant<ExpressionActionsLink, ArrayJoinLink> link;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step, ArrayJoinLink and ExpressionActionsLink have mostly the same interface. Wouldn't it be better to make a base class instead of variant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had the same idea.
But I didn't like doing virtual calls even thought if we always know exact type.
So, I did a static interface.
Well, maybe class is more readable.

@4ertus2 4ertus2 self-requested a review August 18, 2020 20:19
@KochetovNicolai
Copy link
Copy Markdown
Member Author

01444_create_table_drop_database_race - was broken in master, fixed
build - broken in master

@KochetovNicolai
Copy link
Copy Markdown
Member Author

Not waiting for perftests.

@KochetovNicolai KochetovNicolai merged commit 7c0fcb2 into master Aug 20, 2020
@KochetovNicolai KochetovNicolai deleted the array-join-processor branch August 20, 2020 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants