Skip to content

[Data] Set default file_extensions for read_parquet#56481

Merged
bveeramani merged 10 commits intomasterfrom
parquet-file-extnesion
Oct 28, 2025
Merged

[Data] Set default file_extensions for read_parquet#56481
bveeramani merged 10 commits intomasterfrom
parquet-file-extnesion

Conversation

@bveeramani
Copy link
Copy Markdown
Member

@bveeramani bveeramani commented Sep 12, 2025

Why are these changes needed?

http://github.com/ray-project/ray/pull/50092 warned that we'd be changing the default file_extensions for Parquet from None to [parquet]. This was the motivation:

People often have non-Parquet files in their datasets (e.g., _SUCCESS or stale files). However, the default for file_extensions is None, so read_parquet tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like ["parquet"]. This PR adds a warning for that change.

This PR follows up on actually changes the default.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Sets read_parquet default file_extensions to ['parquet'], updates Parquet datasource accordingly, and adjusts tests/file names to use .parquet.

  • Data API
    • read_parquet: default file_extensions now ['parquet'] (was None).
    • ParquetDatasource:
      • Introduces _FILE_EXTENSIONS = ['parquet'] and uses it as default.
      • Removes future-warning logic for impending default change.
  • Tests
    • Update paths and fixtures to use .parquet extensions (e.g., test_include_paths, null-first-file case).
    • Remove warning test for invalid file extensions.
    • Adjust training dataset filenames from *.parquet.snappy to *.snappy.parquet, including smoke-test paths.

Written by Cursor Bugbot for commit a774c4e. This will update automatically on new commits. Configure here.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner September 12, 2025 03:05
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly updates the default file_extensions for read_parquet from None to ["parquet"], making the previously warned-about change effective. The implementation is clean, removing the now-obsolete FutureWarning logic, which simplifies the codebase. The changes are straightforward and align perfectly with the stated goal. I have no further comments.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Sep 12, 2025
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner September 15, 2025 00:28
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 30, 2025
@bveeramani bveeramani enabled auto-merge (squash) September 30, 2025 12:12
@github-actions github-actions bot added go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Sep 30, 2025
@github-actions github-actions bot disabled auto-merge October 7, 2025 17:40
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link
Copy Markdown
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) October 27, 2025 06:39
@github-actions github-actions bot disabled auto-merge October 27, 2025 23:27
@bveeramani bveeramani merged commit c63eaa7 into master Oct 28, 2025
6 checks passed
@bveeramani bveeramani deleted the parquet-file-extnesion branch October 28, 2025 00:18
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…56481)

http://github.com/ray-project/ray/pull/50092 warned that we'd be
changing the default `file_extensions` for Parquet from `None` to
`[parquet]`. This was the motivation:
> People often have non-Parquet files in their datasets (e.g., _SUCCESS
or stale files). However, the default for file_extensions is None, so
read_parquet tries reading the non-Parquet files. To avoid this issue,
we'll change the default file extensions to something like ["parquet"].
This PR adds a warning for that change.

This PR follows up on actually changes the default.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…56481)

http://github.com/ray-project/ray/pull/50092 warned that we'd be
changing the default `file_extensions` for Parquet from `None` to
`[parquet]`. This was the motivation:
> People often have non-Parquet files in their datasets (e.g., _SUCCESS
or stale files). However, the default for file_extensions is None, so
read_parquet tries reading the non-Parquet files. To avoid this issue,
we'll change the default file extensions to something like ["parquet"].
This PR adds a warning for that change.

This PR follows up on actually changes the default.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…56481)

http://github.com/ray-project/ray/pull/50092 warned that we'd be
changing the default `file_extensions` for Parquet from `None` to
`[parquet]`. This was the motivation:
> People often have non-Parquet files in their datasets (e.g., _SUCCESS
or stale files). However, the default for file_extensions is None, so
read_parquet tries reading the non-Parquet files. To avoid this issue,
we'll change the default file extensions to something like ["parquet"].
This PR adds a warning for that change.

This PR follows up on actually changes the default.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…56481)

http://github.com/ray-project/ray/pull/50092 warned that we'd be
changing the default `file_extensions` for Parquet from `None` to
`[parquet]`. This was the motivation:
> People often have non-Parquet files in their datasets (e.g., _SUCCESS
or stale files). However, the default for file_extensions is None, so
read_parquet tries reading the non-Parquet files. To avoid this issue,
we'll change the default file extensions to something like ["parquet"].
This PR adds a warning for that change.

This PR follows up on actually changes the default.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants