Skip to content

GH-39466: [Go][Parquet] Align Arrow and Parquet Timestamp Instant/Local Semantics#39467

Merged
zeroshade merged 4 commits intoapache:mainfrom
joellubi:gh-39466
Jan 18, 2024
Merged

GH-39466: [Go][Parquet] Align Arrow and Parquet Timestamp Instant/Local Semantics#39467
zeroshade merged 4 commits intoapache:mainfrom
joellubi:gh-39466

Conversation

@joellubi
Copy link
Copy Markdown
Member

@joellubi joellubi commented Jan 5, 2024

Rationale for this change

Closes: #39466

What changes are included in this PR?

  • Update logic for determining whether an Arrow Timestamp should have isAdjustedToUTC=true on conversion to Parquet.
  • Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format backward-compatibilty rules.
  • Refactor Timestamp serialization methods to reduce duplicated code

Are these changes tested?

Yes,

  • Logical type mapping in existing test updated.
  • New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
  • New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

Are there any user-facing changes?

Yes, users of pqarrow.FileWriter will produce Parquet files in which the TIMESTAMP type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format document. Rust already implements the spec as described and #39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2024

⚠️ GitHub issue #39466 has been automatically assigned in GitHub to PR creator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also be checking if the timezone is already explicitly "UTC" / "utc" ?

Copy link
Copy Markdown
Member Author

@joellubi joellubi Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that scenario should have the same logic (isAdjustedToUTC=true) as any non-empty timezone, so in that case it should already be covered by this.

@zeroshade
Copy link
Copy Markdown
Member

@joellubi This needs to get rebased since the other PR was merged (for the DATE type handling)

@joellubi
Copy link
Copy Markdown
Member Author

@joellubi This needs to get rebased since the other PR was merged (for the DATE type handling)

Thanks @zeroshade, just rebased

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 10, 2024
@zeroshade zeroshade merged commit 858574d into apache:main Jan 18, 2024
@zeroshade zeroshade removed the awaiting committer review Awaiting committer review label Jan 18, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 858574d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go][Parquet] Timestamp conversion from Arrow to Parquet does not match expected timezone semantics

2 participants