GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation#39112
Conversation
jorisvandenbossche
left a comment
There was a problem hiding this comment.
(already posting whatever I have right now)
The PartitionSet and ParquetPartitions classes can also be removed?
There a few helper methods like _get_filesystem_and_path and _mkdir_if_not_exists that are no longer used and can be removed as well
|
It is very hard to review this PR due to the way the diff is presented in GitHub. I tried to summarise the main changes in the description of the PR, hope it helps a bit. @jorisvandenbossche after the last review I have updated the marks in the tests, added I have also removed |
|
The HDFS failures seem to be related: https://github.com/ursacomputing/crossbow/actions/runs/7288206604/job/19860362598#step:6:9473 |
|
Yeah, missed some metadata there. Thanks, will correct! |
|
@jorisvandenbossche if I am understanding correctly |
|
@github-actions crossbow submit -g python-*-hdfs |
|
|
@github-actions crossbow submit hdfs |
|
I am not sure if raising for the |
|
Revision: 77b4ecb Submitted crossbow builds: ursacomputing/crossbow @ actions-bb763d7a57
|
|
Still need to look into the failures with |
|
Ah, I guess the issue is the same it only fails in |
|
@github-actions crossbow submit hdfs |
|
Revision: 481a85c Submitted crossbow builds: ursacomputing/crossbow @ actions-fc72b69813
|
|
Thanks a lot @AlenkaF for the work here! |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b70ad0b. There were 7 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
The However, I can't reproduce this locally with pyarrow 14.0 (where the legacy reader still exists): |
…on-based implementation (apache#39112) ### Rationale for this change Legacy ParquetDataset has been deprecated for a while now, see apache#31529. This PR is removing the legacy implementation from the code. ### What changes are included in this PR? The PR is removing: - `ParquetDatasetPiece ` - `ParquetManifest` - `_ParquetDatasetMetadata ` - `ParquetDataset` The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was removed. It is also updating the docstrings. The PR is updating: - `read_table` - `write_to_dataset` The PR is updating all the tests to not use `use_legacy_dataset` keyword or legacy parametrisation. ### Are these changes tested? Yes. ### Are there any user-facing changes? Deprecated code is removed. * Closes: apache#31303
…able in case dataset is not available (#46550) ### Rationale for this change Since we merged: - #39112 The fallback case were dataset is not available is not being tested on parquet.read_table (I did validate myself with a breakpoint). ### What changes are included in this PR? Adds a basic test that checks the exercises that code path and tests the read_table method from `ParquetFile` instead of `ParquetDataset` in case dataset is not available. ### Are these changes tested? Yes via CI. ### Are there any user-facing changes? No * GitHub Issue: #46373 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
Rationale for this change
Legacy ParquetDataset has been deprecated for a while now, see #31529. This PR is removing the legacy implementation from the code.
What changes are included in this PR?
The PR is removing:
ParquetDatasetPieceParquetManifest_ParquetDatasetMetadataParquetDatasetThe PR is renaming
_ParquetDatasetV2toParquetDatasetwhich was removed. It is also updating the docstrings.The PR is updating:
read_tablewrite_to_datasetThe PR is updating all the tests to not use
use_legacy_datasetkeyword or legacy parametrisation.Are these changes tested?
Yes.
Are there any user-facing changes?
Deprecated code is removed.