Skip to content

Conversation

@yuanlihan
Copy link
Contributor

Currently, we support loading data from parquet file, but can not parse partition columns in the path of parquet file and can not recursively list all files under base path of input.

This patch is able to discover and infer partitioning information under the base path of input like in Spark. It recursively list all the files under the base path and parse partition columns base on the base path if needed.

This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then parquet_reader of BE can read the value of specified partition column.

yuanlihan and others added 2 commits August 1, 2019 00:58
…#1517)

Doris support deploy multi BE on one host. So when allocating BE for replicas of
a tablet, we should select different host. But there is a bug in tablet scheduler
that same host may be selected for one tablet. This patch will fix this problem.

There are some places related to this problem:

1. Create Table
    There is no bug in Create Table process.

2. Tablet Scheduler
    Fixed when selecting BE for REPLICA_MISSING and REPLICA_RELOCATING.
    Fixed when balance the tablet.

3. Colocate Table Balancer
    Fixed when selecting BE for repairing colocate backend sequence.
    Not fix in colocate group balance. Leave it to colocate repairing.

4. Tablet report
    Tablet report may add replica to catalog. But I did not check the host here,
    Tablet Scheduler will fix it.
@imay
Copy link
Contributor

imay commented Aug 1, 2019

@yuanlihan

Firstly, thanks for your improvement. I have some questions about this PR.

  1. Should we add an option to switch this feature off? Because this will fill partition row automatically which should be known by users.

  2. Can this patch work with other format of file? like CSV format? And how can user use this function.

  3. How this function work with other load functions, like user specified columns and set operation. For example, if the partition column name is different in Doris table and Spark.

@yuanlihan
Copy link
Contributor Author

@yuanlihan

Firstly, thanks for your improvement. I have some questions about this PR.

  1. Should we add an option to switch this feature off? Because this will fill partition row automatically which should be known by users.
  2. Can this patch work with other format of file? like CSV format? And how can user use this function.
  3. How this function work with other load functions, like user specified columns and set operation. For example, if the partition column name is different in Doris table and Spark.

@imay
Thanks for your review

  1. In my opinion, it is quite intuitive to enable partition discovery as well as recursively list files by default. We extract partition columns iff them was defined/needed in the specified table.
  2. We can support partition discovery of other file sources(including Text/CSV/JSON/ORC/Parquet) like in Spark. May be users will prefer having this feature by default, which rarely have conflicts with their previous usage.
  3. I will dig more about this and try to solve it.

imay and others added 4 commits August 3, 2019 22:49
The TabletQuorumFailedException will be thrown in commitTxn while the success replica num of tablet is less then quorom replica num.
The Hadoop load does not handle this exception because the push task will retry it later.
The streaming broker, insert, stream and mini load will catch this exception and abort the txn after that.
@yuanlihan yuanlihan changed the title Enable Partition Discovery When Loading Data from Parquet File Enable Partition Discovery for Broker Load Aug 4, 2019
@yuanlihan
Copy link
Contributor Author

Add Partition Discover as a feature of Broker Load.

For usage of this feature see diffs of related docs.

1. 不指定Partition Discovery的基础路径(BASE_PATH)
LOAD LABEL example_db.label10
(
DATA INFILE("hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样直接指定目录就可以导入,跟我们之前的默认行为是有冲突的。有的时候用户写错了,可能会将整个目录导入。
为了避免这种情况,可以不使用DATA INFILE,可是使用类似DATA INDIR的方式来指定导入一个目录

INTO TABLE `my_table`
FORMAT AS "csv"
BASE_PATH AS "hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/"
(k1, k2, k3, utc_date,city)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里columns表示的一般是文件中包含的列名,如果加上utc_date, city可能会让用户感到困惑。
建议不要再这个地方声明partition列名

DATA INFILE("hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26")
INTO TABLE `my_table`
FORMAT AS "csv"
BASE_PATH AS "hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是可以增加partition column 声明,这样能够让用户清晰的知道自己在做什么。

@imay
Copy link
Contributor

imay commented Aug 5, 2019

Add Partition Discover as a feature of Broker Load.

For usage of this feature see diffs of related docs.

@yuanlihan
Because this patch will change load interface, I think it's better to discuss the new interface before implementation.

@yuanlihan
Copy link
Contributor Author

Add Partition Discover as a feature of Broker Load.
For usage of this feature see diffs of related docs.

@yuanlihan
Because this patch will change load interface, I think it's better to discuss the new interface before implementation.

@imay
OK. I will create an issue to discuss the interface design about this feature before updating additional diffs.

@yuanlihan
Copy link
Contributor Author

@imay I have added an issue about this Enable Partition Discovery for Broker Load

morningman and others added 15 commits August 5, 2019 16:16
When creating table with OLAP engine, use can specify multi parition columns.
eg:

PARTITION BY RANGE(`date`, `id`)
(
    PARTITION `p201701_1000` VALUES LESS THAN ("2017-02-01", "1000"),
    PARTITION `p201702_2000` VALUES LESS THAN ("2017-03-01", "2000"),
    PARTITION `p201703_all`  VALUES LESS THAN ("2017-04-01")
)

Notice that load by hadoop cluster does not support multi parition column table.
)

If there is a redundant replica on BE which version is missing,
the tablet report logic can not drop it correctly.
We create a new segment format for BetaRowset. New format merge
data file and index file into one file. And we create a new format
for short key index. In origin code index is stored in format like
RowCusor which is not efficient to compare. Now we encode multiple
column into binary, and we assure that this binary is sorted same
with the key columns.
Help document collation (integration of help and documentation documents)
@yuanlihan yuanlihan closed this Aug 11, 2019
@yuanlihan yuanlihan deleted the partition_discovery_when_loading_parquet_file branch March 16, 2022 04:21
swjtu-zhanglei pushed a commit to swjtu-zhanglei/incubator-doris that referenced this pull request Jul 25, 2023
commit f5d6201
Author: morrySnow <101034200+morrySnow@users.noreply.github.com>
Date:   Wed Mar 22 10:53:15 2023 +0800

    [fix](planner) should always execute projection plan (apache#17885)

    1. should always execute projection plan, whatever the statement it is.
    2. should always execute projection plan, since we only have vectorized engine now

commit ad61c84
Author: mch_ucchi <41606806+sohardforaname@users.noreply.github.com>
Date:   Mon Mar 27 17:50:52 2023 +0800

    [fix](planner) fix conjunct planned on exchange node (apache#18042)

    sql like:
    select k5, k6, SUM(k3) AS k3
    from (
        select
            k5,
            date_format(k6, '%Y-%m-%d') as k6,
            count(distinct k3) as k3
        from t
        group by k5, k6
    ) AS temp where 1=1
    group by k5, k6;

    will throw exception since conjuncts planned on exchange node, because exchange node cannot handle conjuncts, now we skip exchange node when planning conjuncts, which fixes the bug.
    notice: the bug occurs iff the conjunct is always true like 1=1 above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.