Skip to content

[Data] Deprecate read_parquet_bulk#48691

Merged
bveeramani merged 1 commit intomasterfrom
deprecate-read-parquet-bulk
Nov 12, 2024
Merged

[Data] Deprecate read_parquet_bulk#48691
bveeramani merged 1 commit intomasterfrom
deprecate-read-parquet-bulk

Conversation

@bveeramani
Copy link
Copy Markdown
Member

@bveeramani bveeramani commented Nov 11, 2024

Why are these changes needed?

Users (including Ray Data developers!) are often confused about how to choose between read_parquet and read_parquet_bulk. To avoid confusion, this PR deprecates read_parquet_bulk.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link
Copy Markdown
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹

@bveeramani bveeramani enabled auto-merge (squash) November 12, 2024 18:33
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 12, 2024
@bveeramani bveeramani merged commit 7bbab39 into master Nov 12, 2024
@bveeramani bveeramani deleted the deprecate-read-parquet-bulk branch November 12, 2024 19:23
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
Users (including Ray Data developers!) are often confused about how to
choose between `read_parquet` and `read_parquet_bulk`. To avoid
confusion, this PR deprecates `read_parquet_bulk`.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@JackGammack
Copy link
Copy Markdown
Contributor

Is there more information about why read_parquet_bulk was deprecated? Or recommendations for how to read many thousands of parquet files that already have the same schema without the huge overhead/startup time of read_parquet? This startup time can take 30+ minutes with a very large number of files.

Using FastFileMetadataProvider works for some other datasources, but fails with read_parquet:
AttributeError: 'FastFileMetadataProvider' object has no attribute 'prefetch_file_metadata'

@alexeykudinkin
Copy link
Copy Markdown
Contributor

Hey, @JackGammack! We're currently looking into addressing some of the long-standing issues with reading Parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants