Skip to content

[GLUTEN-11088][VL] Fix GlutenParquetIOSuite compatibility issues for Spark 4.0#11281

Merged
baibaichen merged 7 commits intoapache:mainfrom
baibaichen:feature/GlutenParquetIOSuite
Dec 17, 2025
Merged

[GLUTEN-11088][VL] Fix GlutenParquetIOSuite compatibility issues for Spark 4.0#11281
baibaichen merged 7 commits intoapache:mainfrom
baibaichen:feature/GlutenParquetIOSuite

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Dec 11, 2025

What changes were proposed in this pull request?

This PR fixes compatibility issues in GlutenParquetIOSuite for Spark 4.0 by addressing the following Spark 4.0 shim layer changes:

  1. Respect mapreduce.output.basename configuration: Updated SparkWriteFilesCommitProtocol to honor the mapreduce.output.basename configuration when generating output file names, aligning with SPARK-49991.

  2. Proper error handling in write operations: Replaced direct exception throwing with GlutenFileFormatWriter.throwWriteError to use Spark's standardized error handling mechanism (QueryExecutionErrors.taskFailedWhileWritingRowsError). aligning with SPARK-45844.

  3. Code quality improvements:

    • Added explicit type annotations to sparkStageId, sparkPartitionId, and sparkAttemptNumber for better type safety
    • Changed fileNames initialization from null to underscore idiom (_) for cleaner Scala style
    • Migrated from deprecated scala.collection.JavaConverters to scala.jdk.CollectionConverters
    • Simplified TextScan instantiation by removing redundant new keyword (applies to Scala 3/case class patterns)
  4. Test coverage: Re-enabled 3 previously excluded tests in VeloxTestSettings:

    • SPARK-49991: Respect 'mapreduce.output.basename' to generate file names
    • SPARK-6330 regression test
    • SPARK-7837 Do not close output writer twice when commitTask() fails

Why are the changes needed?

The Spark 4.0 upgrade introduced breaking changes in the shim layer:

  • The file naming convention now supports custom basename configuration through mapreduce.output.basename
  • Error handling APIs were refactored to use centralized error builders
  • The previous direct exception throwing approach is incompatible with Spark 4.0's error handling framework

Without these changes, GlutenParquetIOSuite tests fail due to:

  1. Incorrect file name generation (missing basename support)
  2. Incompatible exception types when write operations fail
  3. Deprecated Scala collection conversion APIs

How was this patch tested?

  • Re-enabled and verified all 3 previously excluded tests pass successfully
  • Existing GlutenParquetIOSuite tests continue to pass
  • Validated file naming with custom mapreduce.output.basename configurations
  • Confirmed error handling produces correct exception types and messages

Related Issue

Addresses #11088 (Track on Spark-4.0 failed unit tests)

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Dec 11, 2025
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen force-pushed the feature/GlutenParquetIOSuite branch from 59fb290 to 5d2b367 Compare December 13, 2025 06:18
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhouyuan
Copy link
Copy Markdown
Member

@baibaichen
it looks like we should disable below ANSI test as Gluten-velox does not support it yet

2025-12-13T07:06:56.4659042Z - Throw exceptions on inserting out-of-range int value with ANSI casting policy *** FAILED ***
2025-12-13T07:06:56.4660344Z   Expected exception org.apache.spark.SparkArithmeticException to be thrown, but org.apache.spark.SparkException was thrown (InsertSuite.scala:775)
2025-12-13T07:06:56.4840500Z 07:06:56.483 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project[QueryId=47908], due to: 
2025-12-13T07:06:56.4842112Z  - Validation failed with exception from: ProjectExecTransformer, reason: CheckOverflowInTableInsert is used in ANSI mode, but Gluten does not support ANSI mode.
2025-12-13T07:06:56.4843981Z 

@baibaichen baibaichen force-pushed the feature/GlutenParquetIOSuite branch from 5d2b367 to 4674f9d Compare December 15, 2025 07:59
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@baibaichen
Copy link
Copy Markdown
Contributor Author

@baibaichen it looks like we should disable below ANSI test as Gluten-velox does not support it yet

2025-12-13T07:06:56.4659042Z - Throw exceptions on inserting out-of-range int value with ANSI casting policy *** FAILED ***
2025-12-13T07:06:56.4660344Z   Expected exception org.apache.spark.SparkArithmeticException to be thrown, but org.apache.spark.SparkException was thrown (InsertSuite.scala:775)
2025-12-13T07:06:56.4840500Z 07:06:56.483 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project[QueryId=47908], due to: 
2025-12-13T07:06:56.4842112Z  - Validation failed with exception from: ProjectExecTransformer, reason: CheckOverflowInTableInsert is used in ANSI mode, but Gluten does not support ANSI mode.
2025-12-13T07:06:56.4843981Z 

Thanks @zhouyuan. This isn't related to ANSI mode; the issue was introduced by my fix.

@baibaichen baibaichen force-pushed the feature/GlutenParquetIOSuite branch from 4674f9d to 45e4618 Compare December 16, 2025 01:55
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen merged commit ee153ed into apache:main Dec 17, 2025
106 of 107 checks passed
@baibaichen baibaichen deleted the feature/GlutenParquetIOSuite branch December 17, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants