-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[Go][Parquet] Arrow DATE64 type is coerced into Parquet TIMESTAMP[ms] logical type instead of DATE (32-bit) #39456
Description
Describe the bug, including details regarding any error messages, version, and platform.
The Parquet DATE logical type must annotate an int32 representing days since the UNIX epoc per the spec. The Arrow DATE64 (ms since UNIX epoch) type does not have a direct analog in Parquet, so it must be coerced into a compatible representation when writing Arrow data to Parquet.
The prevailing convention is to coerce DATE64 to int32 seconds since the UNIX epoch (Parquet DATE logical type) [e.g. C++, Rust]. The behavior for handling an int64 value not on a date boundary (i.e. not divisible by 86400000) is not defined. Some implementations validate this condition while others truncate to the date the physical value falls within.
The current Go implementation diverges from the approach followed by these languages, coercing instead to a UTC-normalized TIMESTAMP[ms]. This may lead to surprising behavior in cross-language use-cases and alters the original semantics of the type (at least for non-arrow consumers that don't handle store_schema). It seems that it would increase overall compatibility in the ecosystem to align Go to the convention currently followed in the other implementations.
See also: https://lists.apache.org/thread/q036r1q3cw5ysn3zkpvljx3s9ho18419
Component(s)
Go, Parquet