Skip to content

[Data] read_parquet trigger serialization error with filesystem=HfFileSystem #59029

@owenowenisme

Description

@owenowenisme

What happened + What you expected to happen

Serialization error happened because of this hf issue huggingface/huggingface_hub#3576

> python test_batch.py        
...                                                                                                                                                                                                                   
    "_repo_and_revision_exists_cache": deepcopy(self._repo_and_revision_exists_cache),
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 211, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/youchenglin/miniconda3/envs/myenv/lib/python3.10/copy.py", line 265, in _reconstruct
    y = func(*args)
TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'
...

Versions / Dependencies

  • ray: master branch
  • Hugging face: 1.1.5

Reproduction script

import ray
from huggingface_hub import HfFileSystem

hf_fs = HfFileSystem(token="YOUR_HF_TOKEN")
dataset_url = "hf://datasets/rotten_tomatoes"

ds = ray.data.read_parquet(
    "hf://datasets/rotten_tomatoes",
    file_extensions=["parquet"],
    filesystem=hf_fs
)
ds.count()

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tdataRay Data-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions