Skip to content

[Data] Allow file extensions starting with '.'#58339

Merged
bveeramani merged 1 commit intoray-project:masterfrom
CowKeyMan:feature/dot_file_extensions
Nov 5, 2025
Merged

[Data] Allow file extensions starting with '.'#58339
bveeramani merged 1 commit intoray-project:masterfrom
CowKeyMan:feature/dot_file_extensions

Conversation

@CowKeyMan
Copy link
Copy Markdown
Contributor

@CowKeyMan CowKeyMan commented Oct 31, 2025

It is sometimes intuitive for users to provide their extensions with '.' at the start. This PR takes care of that and removed the '.' when it is provided.

For example, when using ray.data.read_parquet, the parameter file_extensions needs to be something like ['parquet']. However, intuitively some users may interpret this parameter as being able to use ['.parquet'].

This commit allows users to switch from:

train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)

to

train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)

@CowKeyMan CowKeyMan force-pushed the feature/dot_file_extensions branch 4 times, most recently from 2e8cac5 to 653177a Compare October 31, 2025 10:45
@CowKeyMan CowKeyMan marked this pull request as ready for review October 31, 2025 10:46
@CowKeyMan CowKeyMan requested a review from a team as a code owner October 31, 2025 10:46
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues community-contribution Contributed by the community labels Oct 31, 2025
@edoakes edoakes removed the core Issues that should be addressed in Ray Core label Oct 31, 2025
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Oct 31, 2025

@bveeramani PTAL

Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @CowKeyMan , would you mind elaborating on the motivation for this change in the PR description?

@CowKeyMan CowKeyMan force-pushed the feature/dot_file_extensions branch from 653177a to 53df1b5 Compare October 31, 2025 20:34
@CowKeyMan
Copy link
Copy Markdown
Contributor Author

I added an example in the commit description

@CowKeyMan CowKeyMan force-pushed the feature/dot_file_extensions branch from 53df1b5 to 20e009f Compare October 31, 2025 22:22
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test?

Comment on lines +270 to +272
if file_extensions is not None:
file_extensions = [x[1:] if x.startswith(".") else x for x in file_extensions]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move to _has_file_extension so it's colocated with the relevant code?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add a comment explaining why we have this logic?

Copy link
Copy Markdown
Contributor Author

@CowKeyMan CowKeyMan Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the code makes more sense in _has_file_extension. This has now been done

I also added another example to this method. Are these tested automatically with doctest? I am not sure where I should put the test (which file?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CowKeyMan, I think the test for _has_file_extension would go here - python/ray/data/tests/test_path_util.py

@CowKeyMan CowKeyMan force-pushed the feature/dot_file_extensions branch 4 times, most recently from 030fefa to ddeeb87 Compare November 3, 2025 19:43
Signed-off-by: Daniel Cauchi <dancauchi1@gmail.com>

It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.
@CowKeyMan CowKeyMan force-pushed the feature/dot_file_extensions branch from aebd6b0 to 3746cea Compare November 4, 2025 16:13
@CowKeyMan
Copy link
Copy Markdown
Contributor Author

Test added, code moved, and I adjusted the comment as well

@bveeramani bveeramani changed the title Allow file extensions starting with '.' [Data] Allow file extensions starting with '.' Nov 5, 2025
@bveeramani bveeramani enabled auto-merge (squash) November 5, 2025 05:57
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 5, 2025
@bveeramani bveeramani merged commit f6bb8b8 into ray-project:master Nov 5, 2025
8 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```

Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```

Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```

Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants