Skip to content

GH-39456: [Go][Parquet] Arrow DATE64 Type Coerced to Parquet DATE Logical Type#39460

Merged
zeroshade merged 2 commits intoapache:mainfrom
joellubi:gh-39456
Jan 9, 2024
Merged

GH-39456: [Go][Parquet] Arrow DATE64 Type Coerced to Parquet DATE Logical Type#39460
zeroshade merged 2 commits intoapache:mainfrom
joellubi:gh-39456

Conversation

@joellubi
Copy link
Copy Markdown
Member

@joellubi joellubi commented Jan 5, 2024

Rationale for this change

Closes: #39456

What changes are included in this PR?

Update physical and logical type mapping from Arrow to Parquet for DATE64 type

Are these changes tested?

Yes,

  • Update expected schema mapping in existing test
  • Tests asserting new behavior
    • Arrow DATE64 will roundtrip -> Parquet -> Arrow as DATE32
    • Arrow DATE64 not aligned to exact date boundary will truncate to milliseconds at boundary of greatest full day on Parquet roundtrip

Are there any user-facing changes?

Yes, users of pqarrow.FileWriter will produce Parquet files containing DATE logical type instead of TIMESTAMP[ms] when writing Arrow data containing DATE64 field(s). The proposed implementation truncates int64 values to be divisible by 86400000 rather than validating that this is already the case, as some implementations do. I am happy to add this validation if it would be preferred, but the truncating behavior will likely break fewer existing users.

I'm not sure whether this is technically considered a breaking change to a public API and if/how it should be communicated. Any direction regarding this would be appreciated.

@joellubi joellubi requested a review from zeroshade as a code owner January 5, 2024 00:54
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2024

⚠️ GitHub issue #39456 has been automatically assigned in GitHub to PR creator.

@zeroshade zeroshade added Breaking Change Includes a breaking change to the API Component: Parquet labels Jan 9, 2024
@zeroshade zeroshade merged commit eade938 into apache:main Jan 9, 2024
@zeroshade zeroshade removed the awaiting review Awaiting review label Jan 9, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit eade938.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…TE Logical Type (apache#39460)

### Rationale for this change

Closes: apache#39456 

### What changes are included in this PR?

Update physical and logical type mapping from Arrow to Parquet for DATE64 type

### Are these changes tested?

Yes,
- Update expected schema mapping in existing test
- Tests asserting new behavior
  - Arrow DATE64 will roundtrip -> Parquet -> Arrow as DATE32
  - Arrow DATE64 _not aligned_ to exact date boundary will truncate to milliseconds at boundary of greatest full day on Parquet roundtrip

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files containing `DATE` logical type instead of `TIMESTAMP[ms]` when writing Arrow data containing DATE64 field(s). The proposed implementation truncates `int64` values to be divisible by 86400000 rather than validating that this is already the case, as some implementations do. I am happy to add this validation if it would be preferred, but the truncating behavior will likely break fewer existing users.

I'm not sure whether this is technically considered a breaking change to a public API and if/how it should be communicated. Any direction regarding this would be appreciated.

* Closes: apache#39456

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Breaking Change Includes a breaking change to the API Component: Go Component: Parquet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go][Parquet] Arrow DATE64 type is coerced into Parquet TIMESTAMP[ms] logical type instead of DATE (32-bit)

2 participants