[Data] Fix Parquet datasource path column support#60046
[Data] Fix Parquet datasource path column support#60046alexeykudinkin merged 2 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue where the path column was not correctly added to the schema when include_paths=True for Parquet datasources. The changes correctly identify the need to modify _derive_schema and add a test case. However, I've found a logical flaw in the implementation where applying column projection can incorrectly remove the path column after it has been added. I've suggested a fix to reorder the operations to ensure the path column is preserved when a projection is active. With this change, the fix should be correct and robust.
…lly add a string-typed 'path' column to the schema when include_paths is True and the path field does not exist Change-Id: I6419b371a8bf2451c326db550ec3b685ebbe248e Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
|
@goutamvenkat-anyscale could you review? I think you have context on the code |
| if partition_col_values: | ||
| table = _add_partitions_to_table(partition_col_values, table) | ||
|
|
||
| if include_path: | ||
| table = ArrowBlockAccessor.for_block(table).fill_column( | ||
| "path", fragment.path | ||
| ) |
There was a problem hiding this comment.
@bveeramani Yes. I adjusted _read_batches_from to first add the partition column, then add the path column, making it consistent with the schema.
Change-Id: I939fb4c6f56982a8cd8c936b5cae56d516be6399 Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
bveeramani
left a comment
There was a problem hiding this comment.
Looks reasonable to me, but will defer to @goutamvenkat-anyscale because I think he touched this code most recently.
The handling of schemas and include paths here seems brittle, but I think that's an architectural problem out of the scope of this PR
## Description
This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.
Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.
The fix involves:
1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection
This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.
## Related issues
Closes ray-project#60027
## Additional information
---------
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
## Description
This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.
Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.
The fix involves:
1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection
This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.
## Related issues
Closes ray-project#60027
## Additional information
---------
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
## Description
This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.
Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.
The fix involves:
1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection
This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.
## Related issues
Closes ray-project#60027
## Additional information
---------
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
## Description
This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.
Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.
The fix involves:
1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection
This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.
## Related issues
Closes ray-project#60027
## Additional information
---------
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: peterxcli <peterxcli@gmail.com>
## Description
This pr fixes an issue where the `include_paths=True` parameter in
Parquet datasource was not correctly adding the 'path' column to the
dataset schema.
Previously, when reading Parquet files with `include_paths=True`, the
path column was not being included in the schema, causing operations
like `ds.groupby("path").count()` to fail with a "column not found"
error.
The fix involves:
1. Passing the `include_paths` parameter to the `_derive_schema` method
in the ParquetDatasource
2. Adding logic to automatically append a string-typed 'path' column to
the schema when `include_paths=True` and the path field doesn't already
exist
3. Ensuring that when column projection is used together with
`include_paths=True`, the path column is also included in the projected
columns list
4. Adding comprehensive tests to verify the functionality works both in
basic scenarios and when combined with column projection
This ensures that when `include_paths=True` is specified, the path
column is properly included in the dataset schema, allowing operations
like groupby on the path column to work as expected.
## Related issues
Closes ray-project#60027
## Additional information
---------
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
This pr fixes an issue where the
include_paths=Trueparameter in Parquet datasource was not correctly adding the 'path' column to the dataset schema.Previously, when reading Parquet files with
include_paths=True, the path column was not being included in the schema, causing operations likeds.groupby("path").count()to fail with a "column not found" error.The fix involves:
include_pathsparameter to the_derive_schemamethod in the ParquetDatasourceinclude_paths=Trueand the path field doesn't already existinclude_paths=True, the path column is also included in the projected columns listThis ensures that when
include_paths=Trueis specified, the path column is properly included in the dataset schema, allowing operations like groupby on the path column to work as expected.Related issues
Closes #60027
Additional information