[Data] Fix Parquet datasource path column support#60046

Merged

alexeykudinkin merged 2 commits intoray-project:masterfrom

Jan 16, 2026

Contributor

daiping8 commented Jan 12, 2026 •

edited

Loading

Description

This pr fixes an issue where the include_paths=True parameter in Parquet datasource was not correctly adding the 'path' column to the dataset schema.

Previously, when reading Parquet files with include_paths=True, the path column was not being included in the schema, causing operations like ds.groupby("path").count() to fail with a "column not found" error.

The fix involves:

Passing the include_paths parameter to the _derive_schema method in the ParquetDatasource
Adding logic to automatically append a string-typed 'path' column to the schema when include_paths=True and the path field doesn't already exist
Ensuring that when column projection is used together with include_paths=True, the path column is also included in the projected columns list
Adding comprehensive tests to verify the functionality works both in basic scenarios and when combined with column projection

This ensures that when include_paths=True is specified, the path column is properly included in the dataset schema, allowing operations like groupby on the path column to work as expected.

Related issues

Closes #60027

Additional information

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request aims to fix an issue where the path column was not correctly added to the schema when include_paths=True for Parquet datasources. The changes correctly identify the need to modify _derive_schema and add a test case. However, I've found a logical flaw in the implementation where applying column projection can incorrectly remove the path column after it has been added. I've suggested a fix to reorder the operations to ensure the path column is preserved when a projection is active. With this change, the fix should be correct and robust.


          feat(parquet): Fix Parquet datasource path column support; automatica…

a616562

…lly add a string-typed 'path' column to the schema when include_paths is True and the path field does not exist

Change-Id: I6419b371a8bf2451c326db550ec3b685ebbe248e
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

daiping8 force-pushed the path branch from aa0d3f2 to a616562 Compare

January 12, 2026 11:50

daiping8 mentioned this pull request

[Data] include_paths=True does not add "path" column key to the (lazy) schema and fails in groupby("path") #60027

Closed

daiping8 marked this pull request as ready for review

January 12, 2026 12:36

daiping8 requested a review from a team as a code owner

January 12, 2026 12:36

daiping8 changed the title ~~[Data] Fix Parquet datasource path column support~~ [WIP][Data] Fix Parquet datasource path column support

cursor bot reviewed

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Show resolved Hide resolved

ray-gardener bot added data community-contribution labels

bveeramani reviewed

View reviewed changes

Member

bveeramani left a comment

Makes sense

python/ray/data/_internal/datasource/parquet_datasource.py Show resolved Hide resolved

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

Member

bveeramani commented Jan 12, 2026

@goutamvenkat-anyscale could you review? I think you have context on the code

cursor bot reviewed

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Outdated Show resolved Hide resolved

daiping8 commented

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py

Comment on lines +890 to 896

+                              if partition_col_values:
+                                  table = _add_partitions_to_table(partition_col_values, table)
                               if include_path:
                                   table = ArrowBlockAccessor.for_block(table).fill_column(
                                       "path", fragment.path
                                   )

Contributor Author

daiping8 Jan 13, 2026

@bveeramani Yes. I adjusted _read_batches_from to first add the partition column, then add the path column, making it consistent with the schema.

daiping8 force-pushed the path branch from 5cb14c1 to dd34ca5 Compare

January 13, 2026 02:37

cursor bot reviewed

View reviewed changes

python/ray/data/_internal/datasource/parquet_datasource.py Show resolved Hide resolved


          feat(parquet)

347d0a9

Change-Id: I939fb4c6f56982a8cd8c936b5cae56d516be6399
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

daiping8 force-pushed the path branch from dd34ca5 to 347d0a9 Compare

January 13, 2026 03:17

daiping8 changed the title ~~[WIP][Data] Fix Parquet datasource path column support~~ [Data] Fix Parquet datasource path column support

daiping8 requested a review from bveeramani

January 13, 2026 06:08

bveeramani reviewed

View reviewed changes

Member

bveeramani left a comment

Looks reasonable to me, but will defer to @goutamvenkat-anyscale because I think he touched this code most recently.

The handling of schemas and include paths here seems brittle, but I think that's an architectural problem out of the scope of this PR

alexeykudinkin approved these changes

View reviewed changes

alexeykudinkin added the go label

alexeykudinkin enabled auto-merge (squash)

January 16, 2026 06:05

alexeykudinkin merged commit a5581c8 into ray-project:master

8 checks passed

limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request


          [Data] Fix Parquet datasource path column support (ray-project#60046)

6fcb07b

## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

jeffery4011 pushed a commit to jeffery4011/ray that referenced this pull request


          [Data] Fix Parquet datasource path column support (ray-project#60046)

4c152ce

## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request


          [Data] Fix Parquet datasource path column support (ray-project#60046)

696512b

## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.


## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

peterxcli pushed a commit to peterxcli/ray that referenced this pull request


          [Data] Fix Parquet datasource path column support (ray-project#60046)

db4db1a

## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: peterxcli <peterxcli@gmail.com>

peterxcli pushed a commit to peterxcli/ray that referenced this pull request


          [Data] Fix Parquet datasource path column support (ray-project#60046)

279bb87

## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: peterxcli <peterxcli@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution data go