Skip to content

[Data] Fix Parquet datasource path column support#60046

Merged
alexeykudinkin merged 2 commits intoray-project:masterfrom
daiping8:path
Jan 16, 2026
Merged

[Data] Fix Parquet datasource path column support#60046
alexeykudinkin merged 2 commits intoray-project:masterfrom
daiping8:path

Conversation

@daiping8
Copy link
Copy Markdown
Contributor

@daiping8 daiping8 commented Jan 12, 2026

Description

This pr fixes an issue where the include_paths=True parameter in Parquet datasource was not correctly adding the 'path' column to the dataset schema.

Previously, when reading Parquet files with include_paths=True, the path column was not being included in the schema, causing operations like ds.groupby("path").count() to fail with a "column not found" error.

The fix involves:

  1. Passing the include_paths parameter to the _derive_schema method in the ParquetDatasource
  2. Adding logic to automatically append a string-typed 'path' column to the schema when include_paths=True and the path field doesn't already exist
  3. Ensuring that when column projection is used together with include_paths=True, the path column is also included in the projected columns list
  4. Adding comprehensive tests to verify the functionality works both in basic scenarios and when combined with column projection

This ensures that when include_paths=True is specified, the path column is properly included in the dataset schema, allowing operations like groupby on the path column to work as expected.

Related issues

Closes #60027

Additional information

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an issue where the path column was not correctly added to the schema when include_paths=True for Parquet datasources. The changes correctly identify the need to modify _derive_schema and add a test case. However, I've found a logical flaw in the implementation where applying column projection can incorrectly remove the path column after it has been added. I've suggested a fix to reorder the operations to ensure the path column is preserved when a projection is active. With this change, the fix should be correct and robust.

…lly add a string-typed 'path' column to the schema when include_paths is True and the path field does not exist

Change-Id: I6419b371a8bf2451c326db550ec3b685ebbe248e
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@daiping8 daiping8 marked this pull request as ready for review January 12, 2026 12:36
@daiping8 daiping8 requested a review from a team as a code owner January 12, 2026 12:36
@daiping8 daiping8 changed the title [Data] Fix Parquet datasource path column support [WIP][Data] Fix Parquet datasource path column support Jan 12, 2026
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 12, 2026
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@bveeramani
Copy link
Copy Markdown
Member

@goutamvenkat-anyscale could you review? I think you have context on the code

Comment on lines +890 to 896
if partition_col_values:
table = _add_partitions_to_table(partition_col_values, table)

if include_path:
table = ArrowBlockAccessor.for_block(table).fill_column(
"path", fragment.path
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bveeramani Yes. I adjusted _read_batches_from to first add the partition column, then add the path column, making it consistent with the schema.

Change-Id: I939fb4c6f56982a8cd8c936b5cae56d516be6399
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@daiping8 daiping8 changed the title [WIP][Data] Fix Parquet datasource path column support [Data] Fix Parquet datasource path column support Jan 13, 2026
@daiping8 daiping8 requested a review from bveeramani January 13, 2026 06:08
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, but will defer to @goutamvenkat-anyscale because I think he touched this code most recently.

The handling of schemas and include paths here seems brittle, but I think that's an architectural problem out of the scope of this PR

@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Jan 16, 2026
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) January 16, 2026 06:05
@alexeykudinkin alexeykudinkin merged commit a5581c8 into ray-project:master Jan 16, 2026
8 checks passed
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 18, 2026
## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
jeffery4011 pushed a commit to jeffery4011/ray that referenced this pull request Jan 20, 2026
## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.


## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.

Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.

The fix involves:

1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection

This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.

## Related issues
Closes ray-project#60027

## Additional information

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] include_paths=True does not add "path" column key to the (lazy) schema and fails in groupby("path")

3 participants