-
Notifications
You must be signed in to change notification settings - Fork 174
Closed
Description
This issue was discovered as part of #40. The TarArchiveReader implementation is likely wrong:
- An error is raised when we attempt to use
TarArchiveReaderimmediately afterHTTPReaderbecause the HTTP stream does not support the operationseek:
file_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
http_reader_dp = HttpReader(IterableWrapper([file_url]))
tar_dp = http_reader_dp.read_from_tar()
for fname, stream in tar_dp:
print(f"{fname}: {stream.read()}")
It returns an error that looks something like this:
Traceback (most recent call last):
File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
for fname, stream in tar_dp:
File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
raise e
File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
File "/Users/.../miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
Currently, you can work around by downloading the file in advance (or caching it with OnDiskCacheHolderIterDataPipe). In those cases, TarArchiveReader works as intended.
TarArchiveReaderalso doesn't work withGDriveReaderbecause of the return type
amazon_review_url = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM"
gdrive_reader_dp = OnlineReader(IterableWrapper([amazon_review_url]))
tar_dp = gdrive_reader_dp.read_from_tar()
This is because validate_pathname_binary_tuple requires BufferedIOBase. Perhaps it should accept HTTP response as well?
test/test_stream.py:None (test/test_stream.py)
test_stream.py:79: in <module>
for fname, stream in tar_dp:
../torchdata/datapipes/iter/util/tararchivereader.py:43: in __iter__
validate_pathname_binary_tuple(data)
../torchdata/datapipes/utils/common.py:74: in validate_pathname_binary_tuple
raise TypeError(
E TypeError: pathname binary tuple should have BufferedIOBase based binary type, but got <class 'urllib3.response.HTTPResponse'>
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels