Skip to content

Conversation

@jqin61
Copy link
Contributor

@jqin61 jqin61 commented Dec 15, 2023

Hi! We noticed we are running into exceptions thrown due to aws timeout when we are using pyiceberg.io.pyarrow.PyArrowFileIO:

OSError: When reading information for key 'table/metadata/b7f0bbf9-594a-47a8-aeca-961243c3a893-m0.avro' in bucket 'iceberg-warehouse': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

This is because the current time_out is defaulted to 1 second (according to https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html#pyarrow-fs-s3filesystem) and not configurable from input properties to the filesystem. And this pull request is to make it configurable.

@jqin61 jqin61 marked this pull request as draft December 15, 2023 18:28
@jqin61 jqin61 marked this pull request as ready for review December 15, 2023 18:32
@HonahX
Copy link
Contributor

HonahX commented Dec 16, 2023

Thanks @jqin61. This looks great! Shall we update the doc as well?
https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md?plain=1#L67-L76
so that people can know this added configuration more easily

Comment on lines +131 to +132
if connect_timeout := properties.get(S3_CONNECT_TIMEOUT):
config_kwargs["connect_timeout"] = connect_timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if proxy_uri := self.properties.get(S3_PROXY_URI):
client_kwargs["proxy_options"] = proxy_uri

if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @jqin61 Thanks for adding this.

One minor thing I noticed is that s3fs is looking for an int, while pyarrow expects a double. Can you check for s3fs the type, and if it is not an int, emit a warning and cast it to an int (because we lose precision).

@jqin61
Copy link
Contributor Author

jqin61 commented Dec 18, 2023

This looks good @jqin61 Thanks for adding this.

One minor thing I noticed is that s3fs is looking for an int, while pyarrow expects a double. Can you check for s3fs the type, and if it is not an int, emit a warning and cast it to an int (because we lose precision).

Hi @Fokko thank you for the review and comment! I tracked functions underneath the s3fs filesystem initiation call and it seems the input could be float or int from this doc.
I also did a test to verify:

from s3fs import S3FileSystem
S3FileSystem(config_kwargs=dict(connect_timeout=12.2)).config_kwargs
{'connect_timeout': 12.2}

So it looks like we do not need to do the warning.
Thank you!

@jqin61
Copy link
Contributor Author

jqin61 commented Dec 18, 2023

Sorry for the confusion of changing my GitHub account and the inconvenience I brought in approving workflows.

@sungwy
Copy link
Collaborator

sungwy commented Dec 18, 2023

Should we add this to the mkdocs? https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md?plain=1#L76

This will add the config property to PyIceberg Configuration docs: https://py.iceberg.apache.org/configuration/#s3

@Fokko
Copy link
Contributor

Fokko commented Dec 18, 2023

@jqin61 Thanks for verifying! That's awesome; the less warnings the better 👍 I agree with the comment by @HonahX and @syun64 that we should add the property to the docs; we want to make sure that people can find this functionality

| s3.signer | bearer | Configure the signature version of the FileIO. |
| s3.region | us-west-2 | Sets the region of the bucket |
| s3.proxy-uri | http://my.proxy.com:8080 | Configure the proxy server to be used by the FileIO. |
| s3.connect-timeout | 60 | Configure socket connection timeout, in seconds. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a decimal, should we suggest this in the docs as well:

Suggested change
| s3.connect-timeout | 60 | Configure socket connection timeout, in seconds. |
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@jqin61
Copy link
Contributor Author

jqin61 commented Dec 19, 2023

Hi @Fokko ! It looks like I do not have permission to squash and merge the pull request. Could you merge it for me, thank you so much!

@Fokko Fokko merged commit bf68107 into apache:main Dec 19, 2023
@Fokko
Copy link
Contributor

Fokko commented Dec 19, 2023

Thanks for adding this @jqin61 and @HonahX for the quick review 🙌

sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Jan 13, 2024
* add timeout config

* wrap into a constant

* property format consistency

* linting

* update doc

* make config sample more informative

---------

Co-authored-by: adrianqin <adrianqin2019@gmail.com>
Fokko pushed a commit that referenced this pull request Jan 28, 2025
Similarly to #218, we see occasional timeout errors when writing data to
S3-compatible object storage:

```
When uploading part for key 'drivestats/data/date_month=2014-08/00000-0-9c7baab5-af18-4558-ae10-1678aa90b6a5.parquet' in bucket 'drivestats-iceberg': AWS Error NETWORK_CONNECTION during UploadPart operation: curlCode: 28, Timeout was reached
```

[I don't believe the issue is specific to the fact that I'm using
[Backblaze B2](https://www.backblaze.com/cloud-storage) rather than
Amazon S3 - I saw references to similar error messages with the latter
as I was researching this issue.]

The issue happens when the underlying `PUT` operation takes longer than
the request timeout, which is [set to a default of 3 seconds in the AWS
C++
SDK](https://github.com/aws/aws-sdk-cpp/blob/c9eaae91b9eaa77f304a12cd4b15ec5af3e8a726/src/aws-cpp-sdk-core/source/client/ClientConfiguration.cpp#L184)
used by Arrow via PyArrow.

The changes in this PR allow configuration of `s3.request_timeout` when
working directly or indirectly with
`pyiceberg.io.pyarrow.PyArrowFileIO`, just as #218 allowed configuration
of `s3.connect_timeout`.

For example, when creating a catalog:

```python
catalog = load_catalog(
    "docs",
    **{
        "uri": "http://127.0.0.1:8181",
        "s3.endpoint": "http://127.0.0.1:9000",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": "admin",
        "s3.secret-access-key": "password",
        "s3.request-timeout": 5.0,
        "s3.connect-timeout": 20.0,
    }
)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants