Enable Partition Discovery for Broker Load #1569

yuanlihan · 2019-07-31T17:03:12Z

Currently, we support loading data from parquet file, but can not parse partition columns in the path of parquet file and can not recursively list all files under base path of input.

This patch is able to discover and infer partitioning information under the base path of input like in Spark. It recursively list all the files under the base path and parse partition columns base on the base path if needed.

This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then parquet_reader of BE can read the value of specified partition column.

…#1517) Doris support deploy multi BE on one host. So when allocating BE for replicas of a tablet, we should select different host. But there is a bug in tablet scheduler that same host may be selected for one tablet. This patch will fix this problem. There are some places related to this problem: 1. Create Table There is no bug in Create Table process. 2. Tablet Scheduler Fixed when selecting BE for REPLICA_MISSING and REPLICA_RELOCATING. Fixed when balance the tablet. 3. Colocate Table Balancer Fixed when selecting BE for repairing colocate backend sequence. Not fix in colocate group balance. Leave it to colocate repairing. 4. Tablet report Tablet report may add replica to catalog. But I did not check the host here, Tablet Scheduler will fix it.

imay · 2019-08-01T02:37:30Z

@yuanlihan

Firstly, thanks for your improvement. I have some questions about this PR.

Should we add an option to switch this feature off? Because this will fill partition row automatically which should be known by users.
Can this patch work with other format of file? like CSV format? And how can user use this function.
How this function work with other load functions, like user specified columns and set operation. For example, if the partition column name is different in Doris table and Spark.

yuanlihan · 2019-08-01T03:34:11Z

@yuanlihan

Firstly, thanks for your improvement. I have some questions about this PR.

Should we add an option to switch this feature off? Because this will fill partition row automatically which should be known by users.

Can this patch work with other format of file? like CSV format? And how can user use this function.

How this function work with other load functions, like user specified columns and set operation. For example, if the partition column name is different in Doris table and Spark.

@imay
Thanks for your review

In my opinion, it is quite intuitive to enable partition discovery as well as recursively list files by default. We extract partition columns iff them was defined/needed in the specified table.
We can support partition discovery of other file sources(including Text/CSV/JSON/ORC/Parquet) like in Spark. May be users will prefer having this feature by default, which rarely have conflicts with their previous usage.
I will dig more about this and try to solve it.

The TabletQuorumFailedException will be thrown in commitTxn while the success replica num of tablet is less then quorom replica num. The Hadoop load does not handle this exception because the push task will retry it later. The streaming broker, insert, stream and mini load will catch this exception and abort the txn after that.

yuanlihan · 2019-08-04T16:09:20Z

Add Partition Discover as a feature of Broker Load.

For usage of this feature see diffs of related docs.

imay · 2019-08-05T02:02:51Z

docs/help/Contents/Data Manipulation/broker_load.md

+        1. 不指定Partition Discovery的基础路径（BASE_PATH）
+            LOAD LABEL example_db.label10
+            (
+            DATA INFILE("hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir")


这样直接指定目录就可以导入，跟我们之前的默认行为是有冲突的。有的时候用户写错了，可能会将整个目录导入。
为了避免这种情况，可以不使用DATA INFILE，可是使用类似DATA INDIR的方式来指定导入一个目录

imay · 2019-08-05T02:05:47Z

docs/help/Contents/Data Manipulation/broker_load.md

+            INTO TABLE `my_table`
+            FORMAT AS "csv"
+            BASE_PATH AS "hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/"
+            (k1, k2, k3, utc_date，city)


这里columns表示的一般是文件中包含的列名，如果加上utc_date, city可能会让用户感到困惑。
建议不要再这个地方声明partition列名

imay · 2019-08-05T02:09:19Z

docs/help/Contents/Data Manipulation/broker_load.md

+            DATA INFILE("hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26")
+            INTO TABLE `my_table`
+            FORMAT AS "csv"
+            BASE_PATH AS "hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/"


是不是可以增加partition column 声明，这样能够让用户清晰的知道自己在做什么。

imay · 2019-08-05T02:11:47Z

Add Partition Discover as a feature of Broker Load.

For usage of this feature see diffs of related docs.

@yuanlihan
Because this patch will change load interface, I think it's better to discuss the new interface before implementation.

yuanlihan · 2019-08-05T02:26:37Z

Add Partition Discover as a feature of Broker Load.
For usage of this feature see diffs of related docs.

@yuanlihan
Because this patch will change load interface, I think it's better to discuss the new interface before implementation.

@imay
OK. I will create an issue to discuss the interface design about this feature before updating additional diffs.

yuanlihan · 2019-08-05T03:46:21Z

@imay I have added an issue about this Enable Partition Discovery for Broker Load

When creating table with OLAP engine, use can specify multi parition columns. eg: PARTITION BY RANGE(`date`, `id`) ( PARTITION `p201701_1000` VALUES LESS THAN ("2017-02-01", "1000"), PARTITION `p201702_2000` VALUES LESS THAN ("2017-03-01", "2000"), PARTITION `p201703_all` VALUES LESS THAN ("2017-04-01") ) Notice that load by hadoop cluster does not support multi parition column table.

) If there is a redundant replica on BE which version is missing, the tablet report logic can not drop it correctly.

We create a new segment format for BetaRowset. New format merge data file and index file into one file. And we create a new format for short key index. In origin code index is stored in format like RowCusor which is not efficient to compare. Now we encode multiple column into binary, and we assure that this binary is sorted same with the key columns.

…#1589) Revert commit: eda55a7

Help document collation (integration of help and documentation documents)

…8.04 (apache#1602)

commit f5d6201 Author: morrySnow <101034200+morrySnow@users.noreply.github.com> Date: Wed Mar 22 10:53:15 2023 +0800 [fix](planner) should always execute projection plan (apache#17885) 1. should always execute projection plan, whatever the statement it is. 2. should always execute projection plan, since we only have vectorized engine now commit ad61c84 Author: mch_ucchi <41606806+sohardforaname@users.noreply.github.com> Date: Mon Mar 27 17:50:52 2023 +0800 [fix](planner) fix conjunct planned on exchange node (apache#18042) sql like: select k5, k6, SUM(k3) AS k3 from ( select k5, date_format(k6, '%Y-%m-%d') as k6, count(distinct k3) as k3 from t group by k5, k6 ) AS temp where 1=1 group by k5, k6; will throw exception since conjuncts planned on exchange node, because exchange node cannot handle conjuncts, now we skip exchange node when planning conjuncts, which fixes the bug. notice: the bug occurs iff the conjunct is always true like 1=1 above.

yuanlihan and others added 2 commits August 1, 2019 00:58

Enable partition discovery when loading data from parquet file

01787cc

imay and others added 4 commits August 3, 2019 22:49

Switch MAKE_TEST off in build.sh (apache#1579)

6c21a5a

Read date type column from parquet

81cbffd

Enable partition discovery for broker loading

0ab22e4

yuanlihan changed the title ~~Enable Partition Discovery When Loading Data from Parquet File~~ Enable Partition Discovery for Broker Load Aug 4, 2019

Update comment

bf8e081

imay reviewed Aug 5, 2019

View reviewed changes

morningman and others added 15 commits August 5, 2019 16:16

Fix bug that unable to delete replica if version is missing (apache#1585

eda55a7

) If there is a redundant replica on BE which version is missing, the tablet report logic can not drop it correctly.

Implement the initial version of BetaRowset (apache#1568)

d938f9a

Acquire tablet map write lock during tablet gc (apache#1588)

ec7b9e4

Fix a serious bug that will cause all replicas being deleted. (apache…

343b913

…#1589) Revert commit: eda55a7

Support setting timezone variable in FE (apache#1587)

f7a05d8

Fix parquet directory have empty file (apache#1593)

9402456

Support Decimal Type when load Parquet File (apache#1595)

dc4a5e6

Manage tablet by partition id (apache#1591)

41cbedf

Merge Help document to documentation (apache#1586)

4c2a3d6

Help document collation (integration of help and documentation documents)

Fix errors when ES username and passwd is empty (apache#1601)

60d997f

Include header file for ‘preadv' which caused break build on ubuntu 1…

b937887

…8.04 (apache#1602)

Enable partition discovery when loading data from parquet file

133e90e

Read date type column from parquet

067c0df

yuanlihan added 5 commits August 8, 2019 10:45

Enable partition discovery for broker loading

7da20bc

Update comment

66b9ffd

Enable parsing columns from file path for Broker Load

2f11132

Update docs

3ed9248

merge conflicts

3e553c2

yuanlihan closed this Aug 11, 2019

yuanlihan deleted the partition_discovery_when_loading_parquet_file branch March 16, 2022 04:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Partition Discovery for Broker Load #1569

Enable Partition Discovery for Broker Load #1569

Uh oh!

yuanlihan commented Jul 31, 2019

Uh oh!

imay commented Aug 1, 2019

Uh oh!

yuanlihan commented Aug 1, 2019

Uh oh!

yuanlihan commented Aug 4, 2019

Uh oh!

imay Aug 5, 2019

Uh oh!

imay Aug 5, 2019

Uh oh!

imay Aug 5, 2019

Uh oh!

imay commented Aug 5, 2019

Uh oh!

yuanlihan commented Aug 5, 2019

Uh oh!

yuanlihan commented Aug 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Enable Partition Discovery for Broker Load #1569

Enable Partition Discovery for Broker Load #1569

Uh oh!

Conversation

yuanlihan commented Jul 31, 2019

Uh oh!

imay commented Aug 1, 2019

Uh oh!

yuanlihan commented Aug 1, 2019

Uh oh!

yuanlihan commented Aug 4, 2019

Uh oh!

imay Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

imay Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

imay Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

imay commented Aug 5, 2019

Uh oh!

yuanlihan commented Aug 5, 2019

Uh oh!

yuanlihan commented Aug 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants