GH-39466: [Go][Parquet] Align Arrow and Parquet Timestamp Instant/Local Semantics#39467
GH-39466: [Go][Parquet] Align Arrow and Parquet Timestamp Instant/Local Semantics#39467zeroshade merged 4 commits intoapache:mainfrom
Conversation
|
|
go/parquet/pqarrow/schema.go
Outdated
There was a problem hiding this comment.
shouldn't we also be checking if the timezone is already explicitly "UTC" / "utc" ?
There was a problem hiding this comment.
I think that scenario should have the same logic (isAdjustedToUTC=true) as any non-empty timezone, so in that case it should already be covered by this.
|
@joellubi This needs to get rebased since the other PR was merged (for the DATE type handling) |
Thanks @zeroshade, just rebased |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 858574d. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Rationale for this change
Closes: #39466
What changes are included in this PR?
isAdjustedToUTC=trueon conversion to Parquet.Are these changes tested?
Yes,
Are there any user-facing changes?
Yes, users of
pqarrow.FileWriterwill produce Parquet files in which theTIMESTAMPtype is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format document. Rust already implements the spec as described and #39489 has been reported due to a mismatch in the handling of convertedTypes in C++.