-
Notifications
You must be signed in to change notification settings - Fork 174
Closed
Description
🐛 Describe the bug
Hi Team,
Facing some similar issues with reference to #42.
When trying to read a big tar file (around 1GB) through GDrive, I am encountering the io.UnsupportedOperation: seek exception.
test which I ran:
from torchdata.datapipes.iter import IterableWrapper, GDriveReader
gdrive_file_url = "https://drive.google.com/uc?export=download&id=1boo_Jg0rwfKhylp0956E1h6SFY4FmzcP"
gdrive_reader_dp = GDriveReader(IterableWrapper([gdrive_file_url]))
dp = gdrive_reader_dp.load_from_tar()
for f, file in dp:
print(f)
Error:
(venv) abhgaikwad@abhgaikwad-Precision-3561:~/Documents/demo$ python gdrive_loader.py
/home/abhgaikwad/Documents/Projects/torch/data/torchdata/datapipes/iter/util/tararchiveloader.py:72: UserWarning: Unable to extract files from corrupted tarfile stream coco-train2014-seg-000001.tar due to: seek, abort!
warnings.warn(f"Unable to extract files from corrupted tarfile stream {pathname} due to: {e}, abort!")
Traceback (most recent call last):
File "/usr/lib/python3.8/tarfile.py", line 1674, in gzopen
t = cls.taropen(name, mode, fileobj, **kwargs)
File "/usr/lib/python3.8/tarfile.py", line 1651, in taropen
return cls(name, mode, fileobj, **kwargs)
File "/usr/lib/python3.8/tarfile.py", line 1514, in __init__
self.firstmember = self.next()
File "/usr/lib/python3.8/tarfile.py", line 2318, in next
tarinfo = self.tarinfo.fromtarfile(self)
File "/usr/lib/python3.8/tarfile.py", line 1104, in fromtarfile
buf = tarfile.fileobj.read(BLOCKSIZE)
File "/usr/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/usr/lib/python3.8/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.8/gzip.py", line 479, in read
if not self._read_gzip_header():
File "/usr/lib/python3.8/gzip.py", line 427, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'29')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.8/tarfile.py", line 1603, in open
return func(name, "r", fileobj, **kwargs)
File "/usr/lib/python3.8/tarfile.py", line 1678, in gzopen
raise ReadError("not a gzip file")
tarfile.ReadError: not a gzip file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "gdrive_loader.py", line 10, in <module>
for f, file in dp:
File "/home/abhgaikwad/venv/lib/python3.8/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 150, in wrap_generator
response = gen.send(None)
File "/home/abhgaikwad/Documents/Projects/torch/data/torchdata/datapipes/iter/util/tararchiveloader.py", line 73, in __iter__
raise e
File "/home/abhgaikwad/Documents/Projects/torch/data/torchdata/datapipes/iter/util/tararchiveloader.py", line 61, in __iter__
tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
File "/usr/lib/python3.8/tarfile.py", line 1606, in open
fileobj.seek(saved_pos)
io.UnsupportedOperation: seek
This exception is thrown by __iter__ of TarArchiveLoaderIterDataPipe()
Note: This issue is only encountered with big files and small (<100 MB) files are working fine with the same code.
The google drive file: https://drive.google.com/file/d/1boo_Jg0rwfKhylp0956E1h6SFY4FmzcP/view?usp=sharing (I am able to download, extract and view files from this tar)
Versions
Environment Details:
PyTorch version: 1.13.0.dev20220712+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-1044-oem-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] torch==1.13.0.dev20220712+cpu
[pip3] torchdata==0.5.0a0+5dadbca
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels