Skip to content

TarArchiveReader is not functioning with GDriveReader for BIG files (~1GB) #650

@gaikwadabhishek

Description

@gaikwadabhishek

🐛 Describe the bug

Hi Team,

Facing some similar issues with reference to #42.

When trying to read a big tar file (around 1GB) through GDrive, I am encountering the io.UnsupportedOperation: seek exception.

test which I ran:


from torchdata.datapipes.iter import IterableWrapper, GDriveReader

gdrive_file_url = "https://drive.google.com/uc?export=download&id=1boo_Jg0rwfKhylp0956E1h6SFY4FmzcP"
gdrive_reader_dp = GDriveReader(IterableWrapper([gdrive_file_url]))

dp = gdrive_reader_dp.load_from_tar() 

for f, file in dp: 
    print(f)

Error:

(venv) abhgaikwad@abhgaikwad-Precision-3561:~/Documents/demo$ python gdrive_loader.py 
/home/abhgaikwad/Documents/Projects/torch/data/torchdata/datapipes/iter/util/tararchiveloader.py:72: UserWarning: Unable to extract files from corrupted tarfile stream coco-train2014-seg-000001.tar due to: seek, abort!
  warnings.warn(f"Unable to extract files from corrupted tarfile stream {pathname} due to: {e}, abort!")
Traceback (most recent call last):
  File "/usr/lib/python3.8/tarfile.py", line 1674, in gzopen
    t = cls.taropen(name, mode, fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1651, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1514, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python3.8/tarfile.py", line 2318, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib/python3.8/tarfile.py", line 1104, in fromtarfile
    buf = tarfile.fileobj.read(BLOCKSIZE)
  File "/usr/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.8/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.8/gzip.py", line 479, in read
    if not self._read_gzip_header():
  File "/usr/lib/python3.8/gzip.py", line 427, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'29')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/tarfile.py", line 1603, in open
    return func(name, "r", fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1678, in gzopen
    raise ReadError("not a gzip file")
tarfile.ReadError: not a gzip file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "gdrive_loader.py", line 10, in <module>
    for f, file in dp:
  File "/home/abhgaikwad/venv/lib/python3.8/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 150, in wrap_generator
    response = gen.send(None)
  File "/home/abhgaikwad/Documents/Projects/torch/data/torchdata/datapipes/iter/util/tararchiveloader.py", line 73, in __iter__
    raise e
  File "/home/abhgaikwad/Documents/Projects/torch/data/torchdata/datapipes/iter/util/tararchiveloader.py", line 61, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/usr/lib/python3.8/tarfile.py", line 1606, in open
    fileobj.seek(saved_pos)
io.UnsupportedOperation: seek
This exception is thrown by __iter__ of TarArchiveLoaderIterDataPipe()

Note: This issue is only encountered with big files and small (<100 MB) files are working fine with the same code.
The google drive file: https://drive.google.com/file/d/1boo_Jg0rwfKhylp0956E1h6SFY4FmzcP/view?usp=sharing (I am able to download, extract and view files from this tar)

Versions

Environment Details:

PyTorch version: 1.13.0.dev20220712+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Mar 15 2022, 12:22:08)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-1044-oem-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] torch==1.13.0.dev20220712+cpu
[pip3] torchdata==0.5.0a0+5dadbca

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions