Skip to content

Conversation

@phillipleblanc
Copy link
Contributor

@phillipleblanc phillipleblanc commented Feb 4, 2025

Which issue does this PR close?

Rationale for this change

Support for casting large dates from string to Date32.

What changes are included in this PR?

Extend the parse_date method, which is used in the impl Parser for Date32Type, to handle dates which are prefixed with + or -. If the date is not prefixed with + or -, the existing logic is used unmodified.

This code isn't as optimized as the code for processing more common date formats - but given that these extended dates are relatively rare in practice, I don't think it matters all that much.

Are there any user-facing changes?

Aside from the desired fix, no.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 4, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @phillipleblanc -- this looks readonable to me. I only think the PR needs a few more tests and we can merge.

I am sure we could make parsing dates like this faster but we can do that type of optimization as a follow up.

I am also running the cast benchmarks just to be sure this doesn't accidentally introduce a regression and will post the results to this PR

@alamb
Copy link
Contributor

alamb commented Feb 6, 2025

++ critcmp main phillip_250205-handle-large-dates
group         main                                   phillip_250205-handle-large-dates
-----         ----                                   ---------------------------------
2020-09-08    1.00     21.5±0.05ns        ? ?/sec    1.04     22.2±0.05ns        ? ?/sec
2020-09-8     1.01     19.0±0.08ns        ? ?/sec    1.00     18.8±0.03ns        ? ?/sec
2020-9-08     1.00     18.6±0.04ns        ? ?/sec    1.04     19.4±0.15ns        ? ?/sec
2020-9-8      1.00     17.4±0.02ns        ? ?/sec    1.01     17.5±0.02ns        ? ?/sec

Seems ok to me

phillipleblanc and others added 17 commits February 10, 2025 12:22
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…e in decimal conversion (apache#7070)

* fix <= check for scale in decimal conversion

* Update arrow-cast/src/cast/mod.rs

name change

Co-authored-by: Arttu <Blizzara@users.noreply.github.com>

* remove incorrect comment

---------

Co-authored-by: Arttu <Blizzara@users.noreply.github.com>
* Add another decimal cast edge test case

Before 1019f5b this test would fail, as
the cast produced 1. 0 is an edge case worth explicitly testing for.

* typo/fmt

Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

---------

Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
…adata (apache#7052)

* Support both 0x01 and 0x02 as type for list of booleans

* Also support 0 for false inside boolean collections

* Use hex notation in tests
…pache#6751)

* Fix LocalFileSystem with range request that ends beyond end of file

* fix windows

* add comment

* Seek error

* fix seek check

* remove windows flag

* Get file length from file metadata
…ache#7027)

* Introduce UnsafeFlag to manage disabling validation

* fix docs
…che#7028)

* Rename `ArrayReader` to `RecordBatchDecoder`

* Remove alias for `self`
* Minor: Update release schedule

* realism
* Refactor some decimal-related code and tests in preparation for adding Decimal32 and Decimal64 support

* Fixed symbol

* Apply PR feedback

* Fixed format problem

* Fixed logical merge conflicts

* PR feedback
…coder` (apache#7029)

* Move `create_primitive_array` into RecordBatchReader

* Move `create_list-array` into RecordBatchReader

* Move `create_dictionay_array` into RecordBatchReader
* Print Parquet BasicTypeInfo id when present

* Improve print_schema documentation

* tiny cleanup
…he#7019)

* Initial change from Daniel.

* Upgrade unit test to be more generic.

* Add comments on why we have filter

* Cleanup unit tests.

* Update object_store/src/local.rs

Co-authored-by: Adam Reeve <adreeve@gmail.com>

* Add changes suggested by Adam.

* Cleanup match error.

* Apply formatting changes suggested by cargo +stable fmt --all.

* Apply cosmetic changes suggested by clippy.

* Upgrade test_path_with_offset to create temporary directory + files for testing rather than pointing to existing dir.

---------

Co-authored-by: Adam Reeve <adreeve@gmail.com>
…s` (apache#7065)

* fix: first none in `ListArray` panics in `cast_with_options`

* simplify

* fix

* Update arrow-cast/src/cast/list.rs

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
* Add benchmarks for Arrow IPC writer

* Add benchmarks for Arrow IPC writer

* reuse target buffer

* rename, etc

* Add compression type

* update

---------

Co-authored-by: Andy Grove <agrove@apache.org>
…pache#7089)

* Minor: Clarify documentaiton on NullBufferBuilder::allocated_size

* add note about why allocations are 64 bytes
@github-actions github-actions bot added parquet Changes to the parquet crate object-store labels Feb 10, 2025
@phillipleblanc
Copy link
Contributor Author

Thanks @alamb for the review. I've pushed up fixes for your comments.

@github-actions github-actions bot removed the parquet Changes to the parquet crate label Feb 10, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks @phillipleblanc and @sgrebnov

assert_eq!(3298139, c.value(0)); // 10999-12-31
assert_eq!(-723122, c.value(1)); // -0010-02-28
assert_eq!(-715817, c.value(2)); // 0010-02-28
assert_eq!(c.value(3), c.value(4)); // Expect 0000-01-01 and -0000-01-01 to be parsed the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 2bce568 into apache:main Feb 12, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

Thanks again @phillipleblanc

ryzhyk pushed a commit to feldera/feldera that referenced this pull request Feb 25, 2025
Upgrade to the latest delta-rs main branch, which has a workaround for
this: apache/arrow-rs#7074

This triggered apache/datafusion#14862 and
while we're waiting for the [fix](apache/datafusion#14862)
to land and make it into the next datafusion release, I had to add a [patch] section
to Cargo.toml to use the fixed-up version of datafusion 45.0.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit to feldera/feldera that referenced this pull request Feb 25, 2025
Upgrade to the latest delta-rs main branch, which has a workaround for
this: apache/arrow-rs#7074

This triggered apache/datafusion#14862 and
while we're waiting for the [fix](apache/datafusion#14862)
to land and make it into the next datafusion release, I had to add a [patch] section
to Cargo.toml to use the fixed-up version of datafusion 45.0.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
github-merge-queue bot pushed a commit to feldera/feldera that referenced this pull request Feb 25, 2025
Upgrade to the latest delta-rs main branch, which has a workaround for
this: apache/arrow-rs#7074

This triggered apache/datafusion#14862 and
while we're waiting for the [fix](apache/datafusion#14862)
to land and make it into the next datafusion release, I had to add a [patch] section
to Cargo.toml to use the fixed-up version of datafusion 45.0.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
github-merge-queue bot pushed a commit to feldera/feldera that referenced this pull request Feb 25, 2025
Upgrade to the latest delta-rs main branch, which has a workaround for
this: apache/arrow-rs#7074

This triggered apache/datafusion#14862 and
while we're waiting for the [fix](apache/datafusion#14862)
to land and make it into the next datafusion release, I had to add a [patch] section
to Cargo.toml to use the fixed-up version of datafusion 45.0.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit to feldera/feldera that referenced this pull request Feb 26, 2025
Upgrade to the latest delta-rs main branch, which has a workaround for
this: apache/arrow-rs#7074

This triggered apache/datafusion#14862 and
while we're waiting for the [fix](apache/datafusion#14862)
to land and make it into the next datafusion release, I had to add a [patch] section
to Cargo.toml to use the fixed-up version of datafusion 45.0.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
@phillipleblanc phillipleblanc deleted the phillip/250205-handle-large-dates branch December 6, 2025 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support casting strings to Date32 that contain large dates