Skip to content

_add_retries_to_file_obj_read_method makes file_obj invalid for pyarrow #7936

@li-yi-dong

Description

@li-yi-dong

Describe the bug

I'm trying to use load_dataset to construct a dataset that read parquet data on HDFS streamingly, like

ds = load_dataset(
    "parquet",
    data_files={
        "train": "hdfs://xxx/train*.parquet",
        "test": "hdfs://xxx/test*.parquet"
    },
    streaming=True,
)

I encountered an error

Image

In file src/datasets/packaged_modules/parquet/parquet.py,

with open(file, "rb") as f:
    self.info.features = datasets.Features.from_arrow_schema(pq.read_schema(f))

The open is replaced with xopen in src/datasets/utils/file_utils.py

In the func _add_retries_to_file_obj_read_method, the original file object would be replaced by io.RawIOBase(). Even though it tried to proxy all methods back to the original file object, it still unusable for pyarrow.

try:
    file_obj.read = read_with_retries
except AttributeError:  # read-only attribute
    orig_file_obj = file_obj
    file_obj = io.RawIOBase()
    file_obj.read = read_with_retries
    file_obj.__getattr__ = lambda _, attr: getattr(orig_file_obj, attr)
return file_obj

For example, the original file_obj.readable() == True, while the new file_obj.readable() == False

Steps to reproduce the bug

from datasets.utils.file_utils import xopen
f = xopen('hdfs://xxxx.parquet', 'rb')
f.readable()

Expected behavior

Not sure

Environment info

Datasets 4.4.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions