Skip to content

[Data] include_paths=True does not add "path" column key to the (lazy) schema and fails in groupby("path") #60027

@wingkitlee0

Description

@wingkitlee0

What happened + What you expected to happen

  • includes_paths=True does not add paths to the schema
  • it would fail some checks in the groupby

For example, ds = ray.data.read_parquet("data/", include_paths=True), gives

In [24]: ds
Out[24]: Dataset(num_rows=?, schema={id: int64, data: double, uuid: string})

without the expected path column.

Then if we want to do

ds.groupby("path").count().take_all()

It fails in SortKey.validate_schema(self, schema):

     81 for column in self._columns:
     82     if column not in schema_names_set:
---> 83         raise ValueError(
     84             f"You specified the column '{column}', but there's no such "
     85             "column in the dataset. The dataset has columns: "
     86             f"{schema.names}"
     87         )

ValueError: You specified the column 'path', but there's no such column in the dataset. The dataset has columns: ['id', 'data', 'uuid']

For debugging purpose, it would work if:

  • disable that line of check
  • or use materialize()

Versions / Dependencies

master

Reproduction script

Any dataset:

ds = ray.data.read_parquet("data/", include_paths=True)
ds.groupby("path").count()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to Raystability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions