Skip to content

TarArchiveReader is not functioning with HTTPReader or GDriveReader #42

@NivekT

Description

@NivekT

This issue was discovered as part of #40. The TarArchiveReader implementation is likely wrong:

  1. An error is raised when we attempt to use TarArchiveReader immediately after HTTPReader because the HTTP stream does not support the operation seek:
file_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
http_reader_dp = HttpReader(IterableWrapper([file_url]))
tar_dp = http_reader_dp.read_from_tar()
for fname, stream in tar_dp:
    print(f"{fname}: {stream.read()}")

It returns an error that looks something like this:

Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/.../miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek

Currently, you can work around by downloading the file in advance (or caching it with OnDiskCacheHolderIterDataPipe). In those cases, TarArchiveReader works as intended.

  1. TarArchiveReader also doesn't work with GDriveReader because of the return type
amazon_review_url = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM"
gdrive_reader_dp = OnlineReader(IterableWrapper([amazon_review_url]))
tar_dp = gdrive_reader_dp.read_from_tar()

This is because validate_pathname_binary_tuple requires BufferedIOBase. Perhaps it should accept HTTP response as well?

https://github.com/pytorch/data/blob/85d8bbe235cd58f270c17367a5577de107b0095f/torchdata/datapipes/utils/common.py#L66-L76

test/test_stream.py:None (test/test_stream.py)
test_stream.py:79: in <module>
    for fname, stream in tar_dp:
../torchdata/datapipes/iter/util/tararchivereader.py:43: in __iter__
    validate_pathname_binary_tuple(data)
../torchdata/datapipes/utils/common.py:74: in validate_pathname_binary_tuple
    raise TypeError(
E   TypeError: pathname binary tuple should have BufferedIOBase based binary type, but got <class 'urllib3.response.HTTPResponse'>

cc @VitalyFedyunin @ejguan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions