TarArchiveReader is not functioning with HTTPReader or GDriveReader

This issue was discovered as part of #40. The `TarArchiveReader` implementation is likely wrong:

1. An error is raised when we attempt to use `TarArchiveReader` immediately after `HTTPReader` because the HTTP stream does not support the operation `seek`:

```
file_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
http_reader_dp = HttpReader(IterableWrapper([file_url]))
tar_dp = http_reader_dp.read_from_tar()
for fname, stream in tar_dp:
    print(f"{fname}: {stream.read()}")
```
It returns an error that looks something like this:
```
Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/.../miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
```
Currently, you can work around by downloading the file in advance (or caching it with `OnDiskCacheHolderIterDataPipe`). In those cases, `TarArchiveReader` works as intended.

2. `TarArchiveReader` also doesn't work with `GDriveReader` because of the return type

```
amazon_review_url = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM"
gdrive_reader_dp = OnlineReader(IterableWrapper([amazon_review_url]))
tar_dp = gdrive_reader_dp.read_from_tar()
```
This is because `validate_pathname_binary_tuple` requires `BufferedIOBase`. Perhaps it should accept HTTP response as well?

https://github.com/pytorch/data/blob/85d8bbe235cd58f270c17367a5577de107b0095f/torchdata/datapipes/utils/common.py#L66-L76
```
test/test_stream.py:None (test/test_stream.py)
test_stream.py:79: in <module>
    for fname, stream in tar_dp:
../torchdata/datapipes/iter/util/tararchivereader.py:43: in __iter__
    validate_pathname_binary_tuple(data)
../torchdata/datapipes/utils/common.py:74: in validate_pathname_binary_tuple
    raise TypeError(
E   TypeError: pathname binary tuple should have BufferedIOBase based binary type, but got <class 'urllib3.response.HTTPResponse'>
```

cc @VitalyFedyunin @ejguan


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TarArchiveReader is not functioning with HTTPReader or GDriveReader #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TarArchiveReader is not functioning with HTTPReader or GDriveReader #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions