-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[Go][Parquet] Timestamp conversion from Arrow to Parquet does not match expected timezone semantics #39466
Description
Describe the bug, including details regarding any error messages, version, and platform.
Arrow documentation indicates that Timestamps are always measured from the UNIX epoch, and the presence of timezone information is used to determine whether the value has "instant" semantics or "wall clock" semantics. Timestamps with any timezone have "instant" semantics and the epoch is always in UTC. The timezone field may be used by applications to display a localized time on read.
Parquet also has a notion of "instant" vs "local" semantics which is specified with isAdjustedToUTC=true or isAdjustedToUTC=false, respectively (source). Parquet does not store timezone information, expecting physical representations to already be in UTC when using "instant" semantics.
This means that Arrow timestamps in any timezone are already "instants" physically represented in UTC and would map to Parquet "instants" directly with isAdjustedToUTC=true. Otherwise, the Arrow physical representation has "local" semantics and should map to Parquet with isAdjustedToUTC=false.
This means that isAdjustedToUTC in Parquet should be set based on whether an Arrow timezone was present or not, which is how other implementations handle the conversion (C++, Rust). In the current Go implementation, Parquet "instant" timestamps are produced for Arrow timestamps without a timezone or that are specifically in UTC. This does not align with the documentation and other implementations.
Component(s)
Go, Parquet