Skip to content

[GLUTEN-6705] [CORE] [Part 1] Avoid adding c2r for ColumnarWriteFilesExec, since it neither output Columnar batch data nor InternalRow#6745

Merged
baibaichen merged 1 commit intoapache:mainfrom
baibaichen:feature/no_c2r_for_col_write
Aug 8, 2024
Merged

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Aug 7, 2024

What changes were proposed in this pull request?

This PR avoids c2r and r2c transition by returning Convention which ouputs both VanillaRow and BatchType for ColumnarWriteFilesExec.

  /**
   * ColumnarWriteFilesExec neither output Row nor columnar data. We output both row and columnar to
   * avoid c2r and r2c transitions. Please note, [[GlutenPlan]] already implement batchType()
   */
  sealed trait ExecuteWriteCompatible extends KnownChildrenConventions with KnownRowType {
   // ...
    override def requiredChildrenConventions(): Seq[ConventionReq] = {
      List(ConventionReq.backendBatch)
    }

    override def rowType(): RowType = {
      RowType.VanillaRow
    }
  }

Why need

ColumnarWriteFilesExec neither outputs Columnar batch data nor InternalRow, and hence we can't add c2r or r2c on top of it, otherwise it would be am internal error.

  1. In the normal code path, DataWritingCommandExec is on top of ColumnarWriteFilesExec and DataWritingCommandExec's Convention is VanillaRow which avoid c2r transition.
  2. In ClickHouse Backend, we are preparing to support native delta write in Spark 3.5 and delta 3.2, WriteFile is directly added on top of delta logical plan and hence the root SparkPlan is ColumnarWriteFilesExec which casue internal error without this change.

Other refactor

  1. Using org.apache.spark.sql.catalyst.util.sideBySide to ouptput changed plan, which is same as spark RuleExecutor.
  2. Improving some readability, case _ if !plan.supportsColumnar => case _ if plan.supportsColumnar
  3. Adding comments for CHBatch.fromRow

(Fixes: #6705)

How was this patch tested?

Existed UTs

@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Aug 7, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 7, 2024

#6705

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 7, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen requested a review from zhztheplayer August 7, 2024 09:51
@baibaichen baibaichen changed the title [GLUTEN-6705] [CORE] [Part 2] Avoid adding c2r for ColumnarWriteFilesExec, since it neither output Columnar batch data nor InternalRow [GLUTEN-6705] [CORE] [Part 1] Avoid adding c2r for ColumnarWriteFilesExec, since it neither output Columnar batch data nor InternalRow Aug 7, 2024
@philo-he
Copy link
Copy Markdown
Member

philo-he commented Aug 7, 2024

cc @zhztheplayer

@FelixYBW
Copy link
Copy Markdown
Contributor

FelixYBW commented Aug 7, 2024

@JkSelf

Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minors. Thanks.

Comment on lines +95 to +100
case _: ColumnarWriteFilesExec.ExecuteWriteCompatible =>
// ColumnarWriteFilesExec neither output Row nor columnar data.
// We output both row and columnar to avoid c2r and r2c transitions.
Convention.of(
Convention.RowType.VanillaRow,
BackendsApiManager.getSparkPlanExecApiInstance.batchType)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this list is for overriding conventions of vanilla Spark's operators. Would you use interface KnowBatchType / KnownRowType instead?

Comment on lines +161 to +164
case _: ColumnarWriteFilesExec.ExecuteWriteCompatible =>
ConventionReq.of(
ConventionReq.RowType.Any,
ConventionReq.BatchType.Is(BackendsApiManager.getSparkPlanExecApiInstance.batchType))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same. Use interface KnownChildrenConventions instead.

ConventionReq.of(ConventionReq.RowType.Any, ConventionReq.BatchType.Is(toBatchType)))
}

private def convert(plan: SparkPlan, req: ConventionReq): SparkPlan = {
Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to rename convert which sounds too general. How about enforceReq?

Avoid adding c2r for ColumnarWriteFilesExec, since it neither output Columnar batch data nor InternalRow
@baibaichen baibaichen force-pushed the feature/no_c2r_for_col_write branch from 5930d83 to 7bffb86 Compare August 8, 2024 04:13
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 8, 2024

Run Gluten Clickhouse CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Basic Support Write Paruet in Delta

4 participants