-
Notifications
You must be signed in to change notification settings - Fork 588
[VL] date_format returns wrong results #5524
Description
Backend
VL (Velox)
Bug description
Velox evaluates date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM') to '12345-07', whereas vanilla Spark evaluates the same expression to '+12345-07'. This can be an issue because unix_timestamp in vanilla Spark only supports '+12345-07'. If date_format is executed in Velox and the result is used as an argument to unix_timestamp in vanilla Spark, there will be a failure.
// Somehow CREATE TABLE doesn't work with five-digit year timestamps
spark.sql("select timestamp'12345-01-01 01:01:01' c").write.mode("overwrite").save("x")
spark.read.load("x").createOrReplaceTempView("t")
// date_format is run in Velox
spark.sql("select date_format(c, 'yyyy-MM') from t").explain()
// == Physical Plan ==
// VeloxColumnarToRowExec
// +- ^(14) ProjectExecTransformer [date_format(c#83, yyyy-MM, Some(Etc/UTC)) AS date_format(c, yyyy-MM)#85]
// +- ^(14) NativeFileScan parquet [c#83] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/ssd/chungmin/repos/spark34/x], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:timestamp>
// Use collect() instead of show(), as show() makes the function run in vanilla Spark in Spark 3.5 due to the inserted ToPrettyString function.
spark.sql("select date_format(c, 'yyyy-MM') from t").collect()
// Array([12345-01])
spark.sql("create table t2 as select date_format(c, 'yyyy-MM') c from t")
spark.sql("set spark.gluten.enabled = false")
spark.sql("select unix_timestamp(c, 'yyyy-MM') from t2").collect()
// 24/04/25 02:01:01 ERROR TaskResources: Task 8 failed by error:
// org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
// Fail to parse '12345-01' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
// ...Spark uses java.time.format.DateTimeFormatter for date_format.
import java.time.{LocalDate, ZoneId}
import java.time.format.DateTimeFormatter
DateTimeFormatter.ofPattern("yyyy").withZone(ZoneId.of("Z")).format(LocalDate.of(12345, 1, 1))
// "+12345"OpenJDK 1.8.0_402, 11.0.22, 21.0.2 all behave the same. It is not documented in the class in general, but for some constants it is mentioned that years outside of 0000-9999 will have a prefixed positive or negative symbol.
Five-digit years should be extremely rare in real world applications, but it's breaking Delta unit tests.
The issue occurs with Spark 3.4.2 and 3.5.1. Didn't check older versions.
Spark version
None
Spark configurations
spark.plugins=org.apache.gluten.GlutenPlugin
spark.gluten.enabled=true
spark.gluten.sql.columnar.backend.lib=velox
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=28g
System information
Velox System Info v0.0.2
Commit: 45dc46a
CMake Version: 3.28.3
System: Linux-6.5.0-1018-azure
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/ssd/linuxbrew/.linuxbrew/Cellar/cmake/3.28.3;/usr/local;/usr/X11R6;/usr/pkg;/opt
Relevant logs
No response