-
Notifications
You must be signed in to change notification settings - Fork 421
Make connect_timeout configurable in IO #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @jqin61. This looks great! Shall we update the doc as well? |
| if connect_timeout := properties.get(S3_CONNECT_TIMEOUT): | ||
| config_kwargs["connect_timeout"] = connect_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if proxy_uri := self.properties.get(S3_PROXY_URI): | ||
| client_kwargs["proxy_options"] = proxy_uri | ||
|
|
||
| if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @jqin61 Thanks for adding this.
One minor thing I noticed is that s3fs is looking for an int, while pyarrow expects a double. Can you check for s3fs the type, and if it is not an int, emit a warning and cast it to an int (because we lose precision).
Hi @Fokko thank you for the review and comment! I tracked functions underneath the s3fs filesystem initiation call and it seems the input could be float or int from this doc. So it looks like we do not need to do the warning. |
|
Sorry for the confusion of changing my GitHub account and the inconvenience I brought in approving workflows. |
|
Should we add this to the mkdocs? https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md?plain=1#L76 This will add the config property to PyIceberg Configuration docs: https://py.iceberg.apache.org/configuration/#s3 |
mkdocs/docs/configuration.md
Outdated
| | s3.signer | bearer | Configure the signature version of the FileIO. | | ||
| | s3.region | us-west-2 | Sets the region of the bucket | | ||
| | s3.proxy-uri | http://my.proxy.com:8080 | Configure the proxy server to be used by the FileIO. | | ||
| | s3.connect-timeout | 60 | Configure socket connection timeout, in seconds. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is a decimal, should we suggest this in the docs as well:
| | s3.connect-timeout | 60 | Configure socket connection timeout, in seconds. | | |
| | s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. | |
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
|
Hi @Fokko ! It looks like I do not have permission to squash and merge the pull request. Could you merge it for me, thank you so much! |
* add timeout config * wrap into a constant * property format consistency * linting * update doc * make config sample more informative --------- Co-authored-by: adrianqin <adrianqin2019@gmail.com>
Similarly to #218, we see occasional timeout errors when writing data to S3-compatible object storage: ``` When uploading part for key 'drivestats/data/date_month=2014-08/00000-0-9c7baab5-af18-4558-ae10-1678aa90b6a5.parquet' in bucket 'drivestats-iceberg': AWS Error NETWORK_CONNECTION during UploadPart operation: curlCode: 28, Timeout was reached ``` [I don't believe the issue is specific to the fact that I'm using [Backblaze B2](https://www.backblaze.com/cloud-storage) rather than Amazon S3 - I saw references to similar error messages with the latter as I was researching this issue.] The issue happens when the underlying `PUT` operation takes longer than the request timeout, which is [set to a default of 3 seconds in the AWS C++ SDK](https://github.com/aws/aws-sdk-cpp/blob/c9eaae91b9eaa77f304a12cd4b15ec5af3e8a726/src/aws-cpp-sdk-core/source/client/ClientConfiguration.cpp#L184) used by Arrow via PyArrow. The changes in this PR allow configuration of `s3.request_timeout` when working directly or indirectly with `pyiceberg.io.pyarrow.PyArrowFileIO`, just as #218 allowed configuration of `s3.connect_timeout`. For example, when creating a catalog: ```python catalog = load_catalog( "docs", **{ "uri": "http://127.0.0.1:8181", "s3.endpoint": "http://127.0.0.1:9000", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", "s3.access-key-id": "admin", "s3.secret-access-key": "password", "s3.request-timeout": 5.0, "s3.connect-timeout": 20.0, } ) ```
Hi! We noticed we are running into exceptions thrown due to aws timeout when we are using pyiceberg.io.pyarrow.PyArrowFileIO:
This is because the current time_out is defaulted to 1 second (according to https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html#pyarrow-fs-s3filesystem) and not configurable from input properties to the filesystem. And this pull request is to make it configurable.