Skip to content

Unable to open a file in GCS #36993

@Fokko

Description

@Fokko

Describe the bug, including details regarding any error messages, version, and platform.

I'm writing integration tests against a local GCS instance using fake-gcs-server, however, the call when reading the file does not seem to work:

➜  python git:(fd-gcs) ✗ ipython
Python 3.9.17 (main, Jun 20 2023, 18:00:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pyarrow.fs import GcsFileSystem
   ...: from datetime import datetime
   ...: 
   ...: fs = GcsFileSystem(
   ...:   access_token='anon',
   ...:   credential_token_expiration=datetime(2023, 8, 2, 16, 30, 4),
   ...:   scheme='http',
   ...:   endpoint_override='0.0.0.0:4443'
   ...: )

In [2]: location = 'warehouse/vo.txt'
   ...: 
   ...: with fs.open_output_stream(location) as f:
   ...:   print(f.write(b"foo"))
3

In [3]: print(fs.get_file_info(location))
<FileInfo for 'warehouse/vo.txt': type=FileType.File, size=3>

In [4]: with fs.open_input_file(location) as f:
   ...:   print(f.read())
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 with fs.open_input_file(location) as f:
      2   print(f.read())

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/_fs.pyx:763, in pyarrow._fs.FileSystem.open_input_file()

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/error.pxi:113, in pyarrow.lib.check_status()

FileNotFoundError: [Errno 2] google::cloud::Status(NOT_FOUND: Permanent error in Read(): ). Detail: [errno 2] No such file or directory

Can be reproduced using:

from pyarrow.fs import GcsFileSystem
from datetime import datetime

fs = GcsFileSystem(
  access_token='anon',
  credential_token_expiration=datetime(2023, 8, 2, 16, 30, 4),
  scheme='http',
  endpoint_override='0.0.0.0:4443'
)

location = 'warehouse/vo.txt'

with fs.open_output_stream(location) as f:
  print(f.write(b"foo"))

print(fs.get_file_info(location))

with fs.open_input_file(location) as f:
  print(f.read())

Failing calls with PyArrow

time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o?prefix=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt%2F&pageToken= HTTP/1.1\" 200 27"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 200 335"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"PUT /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt&upload_id=43a8ec7bc33a15592b750fc916790750 HTTP/1.1\" 200 570"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 200 570"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /warehouse/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 10"

The last call is causing the 404, and it seems to be missing /storage/v1/b/.

The equivalent code using GCSSpec:

time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /warehouse/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 10"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o?delimiter=/&prefix=d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt/ HTTP/1.1\" 200 27"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable HTTP/1.1\" 200 335"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt&upload_id=2b6f8d48acf8dd87cc86d1e51bd3120e HTTP/1.1\" 200 570"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 200 570"

This only seems to happen when the endpoint_override is set

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions