Skip to content

Conversation

@JingDas
Copy link
Contributor

@JingDas JingDas commented Sep 28, 2023

Proposed changes

Infer the column name when create view if the column is expression

Further comments

expr column name infer strategy as following:

expr example column name(before) Inferred column name(if position is 2)
function dayofyear() dayofyear() __dayofyear_1
cast cast(1 as bigint) CAST(1 AS BIGINT) __cast_1
anylyticExpr min() min() __min_1
predicate 1 in (1,2,3,4) 1 IN (1, 2, 3, 4) __in_predicate_1
literal 1 or 'string_var_name' 1 or 'string_var_name' __literal_1
arithmeticExpr & ... & ... __arithmetic_expr_1
identifier a or b a or b a or b
case CASE WHEN remark = 's' THEN 1 ELSE 2 END CASE WHEN remark = 's' THEN 1 ELSE 2 END __case_1
window min(timestamp) OVER (...) min(timestamp) OVER(...) __min_1

SQL for example:

CREATE VIEW v1 AS 
SELECT 
  error_code,
  1, 
  'string', 
  now(), 
  dayofyear(op_time), 
  cast (source AS BIGINT), 
  min(`timestamp`) OVER (
    ORDER BY 
      op_time DESC ROWS BETWEEN UNBOUNDED PRECEDING
      AND 1 FOLLOWING
  ), 
  1 > 2,
  2 + 3,
  1 IN (1, 2, 3, 4), 
  remark LIKE '%like', 
  CASE WHEN remark = 's' THEN 1 ELSE 2 END,
  TRUE | FALSE 
FROM 
  db_test.table_test1

the output column name is as following:

error_code
__literal_1
__literal_2
__now_3
__dayofyear_4
__cast_expr_5
__min_6
__binary_predicate_7
__arithmetic_expr_8
__in_predicate_9
__like_predicate_10
__case_expr_11
__arithmetic_expr_12

@JingDas
Copy link
Contributor Author

JingDas commented Sep 28, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.37 seconds
stream load tsv: 565 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162305723 Bytes

@JingDas
Copy link
Contributor Author

JingDas commented Sep 28, 2023

run buildall

@JingDas
Copy link
Contributor Author

JingDas commented Sep 28, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.43 seconds
stream load tsv: 565 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162418654 Bytes


@Override
protected String getExprName() {
return Expression.normalizeColumnName(getFnCall().getExprName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should in org.apache.doris.catalog.Column or under org.apache.doris.common package

Copy link
Contributor Author

@JingDas JingDas Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rethink this method, I think the actual mean of the method is that normalize the name of expression or expr. and
it should be independent to column domain or other. So I think should rename it to normalizeName to match the actual meaning for normalize the expr(expression name). WDYT?

// alias or is not slotRef
protected String getExprName() {
if (StringUtils.isEmpty(this.exprName)) {
return Expression.DEFAULT_COLUMN_NAME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

* Need the position of selectListItem to generate column label
*/
public String toColumnLabel(int position) {
Preconditions.checkState(!isStar());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add check msg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have fixed.

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.33 seconds
stream load tsv: 564 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17162234420 Bytes

@JingDas
Copy link
Contributor Author

JingDas commented Sep 28, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.16 seconds
stream load tsv: 564 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17162283499 Bytes

@JingDas
Copy link
Contributor Author

JingDas commented Oct 7, 2023

run buildall

@morrySnow
Copy link
Contributor

please replace picture with md's table in PR desc for search and git commit msg friendly

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.94 seconds
stream load tsv: 562 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17162214913 Bytes

@JingDas JingDas requested a review from morrySnow October 7, 2023 08:48
@JingDas
Copy link
Contributor Author

JingDas commented Oct 7, 2023

run buildall

@JingDas
Copy link
Contributor Author

JingDas commented Oct 7, 2023

run buildall

@JingDas
Copy link
Contributor Author

JingDas commented Oct 7, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.54 seconds
stream load tsv: 572 seconds loaded 74807831229 Bytes, about 124 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162567345 Bytes

* Abstract class for all Expression in Nereids.
*/
public abstract class Expression extends AbstractTreeNode<Expression> implements ExpressionTrait {
public static final String DEFAULT_EXPRESSION_NAME = "__expression";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is different from Expr.DEFAULT_EXPR_NAME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expression is for the new optimizer, Expr is for the old representation. Maybe we need to distinguish them by different names.

}

@Override
protected String getExpressionName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems same as Expression.getExpressionName()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is same to Expression.getExpressionName(), it can be removed. I will fix it.


@Override
protected String getExpressionName() {
return "literal";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as Expression.getExpressionName()?

Copy link
Contributor Author

@JingDas JingDas Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Litreal has some child classes such as FloatLiteral, StringLiteral and so on, I think they can be literal uniformly.
So I overwrite the getExpressionName with return "literal".

@JingDas
Copy link
Contributor Author

JingDas commented Oct 8, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.59 seconds
stream load tsv: 559 seconds loaded 74807831229 Bytes, about 127 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.6 seconds inserted 10000000 Rows, about 349K ops/s
storage size: 17162357738 Bytes

@JingDas
Copy link
Contributor Author

JingDas commented Oct 8, 2023

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.64 seconds
stream load tsv: 554 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162370860 Bytes

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 8, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2023

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2023

PR approved by anyone and no changes requested.

@morrySnow morrySnow merged commit 263631e into apache:master Oct 9, 2023
morrySnow pushed a commit that referenced this pull request Oct 16, 2023
…izer (#25317)

Disable infer expr column name when query on old optimizer.
This bug is be brought in #24990

if your query SQL is
select id, name, sum(target) FROM db_test.table_test2 group by id, name;
the result column name when query is as following:
|id|name |sum(cast(target as DOUBLE))|

when you create view as following:
CREATE VIEW v1 as select id, name, sum(target) FROM db_test.table_test2 group by id, name;
then query the view v1, the result is as following:
|id|name |__sum_2|
morningman pushed a commit that referenced this pull request Oct 25, 2023
…select into outfile` (#25854)

This pr do two things:
1. Infer the column name if the column is expression in `select into outfile`. The rule for column name generation can be refered in pr: #24990 
2. fix bug that it will core dump if the `_schema` fails to build in the open phase in vorc_transformer.cpp


TODO:
1. Support infer the column name if the column is expression in `select into outfile` in new optimizer(Nereids).
dutyu pushed a commit to dutyu/doris that referenced this pull request Oct 28, 2023
…umn is expression (apache#24990)

## Proposed changes

Infer the column name when create view if the column is expression

## Further comments
expr column name infer strategy as following:
|      expr       |                example                    |           column name(before)             | Inferred column name(if position is 2)  |
|  -------------  | ---------------------------------------   | ------------------------------            | --------------------------------------  |
| function        | dayofyear()                               | dayofyear()                               | __dayofyear_1                           |
| cast            | cast(1 as bigint)                         | CAST(1 AS BIGINT)                         | __cast_1                                |
| anylyticExpr    | min()                                     | min()                                     | __min_1                                 |
| predicate       | 1 in (1,2,3,4)                            | 1 IN (1, 2, 3, 4)                         | __in_predicate_1                        |
| literal         | 1 or 'string_var_name'                    | 1 or 'string_var_name'                    | __literal_1                             |
| arithmeticExpr  | &                                         | ... & ...                                 | __arithmetic_expr_1                     |
| identifier      | a or b                                    | a or b                                    | a or b                                  |
| case            | CASE WHEN remark = 's' THEN 1 ELSE 2 END  | CASE WHEN remark = 's' THEN 1 ELSE 2 END  | __case_1                                |
| window          | min(timestamp) OVER (...)                 | min(timestamp) OVER(...)                  | __min_1                                 |


SQL for example:
```sql
CREATE VIEW v1 AS 
SELECT 
  error_code,
  1, 
  'string', 
  now(), 
  dayofyear(op_time), 
  cast (source AS BIGINT), 
  min(`timestamp`) OVER (
    ORDER BY 
      op_time DESC ROWS BETWEEN UNBOUNDED PRECEDING
      AND 1 FOLLOWING
  ), 
  1 > 2,
  2 + 3,
  1 IN (1, 2, 3, 4), 
  remark LIKE '%like', 
  CASE WHEN remark = 's' THEN 1 ELSE 2 END,
  TRUE | FALSE 
FROM 
  db_test.table_test1
```

the output column name is as following:
```
error_code
__literal_1
__literal_2
__now_3
__dayofyear_4
__cast_expr_5
__min_6
__binary_predicate_7
__arithmetic_expr_8
__in_predicate_9
__like_predicate_10
__case_expr_11
__arithmetic_expr_12
```
dutyu pushed a commit to dutyu/doris that referenced this pull request Oct 28, 2023
…izer (apache#25317)

Disable infer expr column name when query on old optimizer.
This bug is be brought in apache#24990

if your query SQL is
select id, name, sum(target) FROM db_test.table_test2 group by id, name;
the result column name when query is as following:
|id|name |sum(cast(target as DOUBLE))|

when you create view as following:
CREATE VIEW v1 as select id, name, sum(target) FROM db_test.table_test2 group by id, name;
then query the view v1, the result is as following:
|id|name |__sum_2|
dutyu pushed a commit to dutyu/doris that referenced this pull request Oct 28, 2023
…select into outfile` (apache#25854)

This pr do two things:
1. Infer the column name if the column is expression in `select into outfile`. The rule for column name generation can be refered in pr: apache#24990 
2. fix bug that it will core dump if the `_schema` fails to build in the open phase in vorc_transformer.cpp


TODO:
1. Support infer the column name if the column is expression in `select into outfile` in new optimizer(Nereids).
morrySnow pushed a commit that referenced this pull request Nov 8, 2023
Infer name if it is an expression and doesn't alias artificially when create or select stmt in nereids.
The infer name strategy is the same as #24990
seawinde added a commit to seawinde/doris that referenced this pull request Nov 13, 2023
…e#26055)

Infer name if it is an expression and doesn't alias artificially when create or select stmt in nereids.
The infer name strategy is the same as apache#24990
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
…izer (apache#25317)

Disable infer expr column name when query on old optimizer.
This bug is be brought in apache#24990

if your query SQL is
select id, name, sum(target) FROM db_test.table_test2 group by id, name;
the result column name when query is as following:
|id|name |sum(cast(target as DOUBLE))|

when you create view as following:
CREATE VIEW v1 as select id, name, sum(target) FROM db_test.table_test2 group by id, name;
then query the view v1, the result is as following:
|id|name |__sum_2|
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
…select into outfile` (apache#25854)

This pr do two things:
1. Infer the column name if the column is expression in `select into outfile`. The rule for column name generation can be refered in pr: apache#24990 
2. fix bug that it will core dump if the `_schema` fails to build in the open phase in vorc_transformer.cpp


TODO:
1. Support infer the column name if the column is expression in `select into outfile` in new optimizer(Nereids).
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
…e#26055)

Infer name if it is an expression and doesn't alias artificially when create or select stmt in nereids.
The infer name strategy is the same as apache#24990
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants