-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++] S3Filesystem always initiates multipart uploads, regardless of input size #40557
Description
Describe the bug, including details regarding any error messages, version, and platform.
Running the following snippet shows that open_output_stream() initiates a multipart upload immediately, before anything is written.
This is quite unexpected: I would expect that the buffer_size argument would ensure that a multipart upload is not initiated until at least 1,000 bytes are written. The issue with the current behaviour is that writing a single byte results in three requests to s3: one to create the multipart upload, one to upload the 1-byte part, and one to finish the multipart upload.
This is very inefficient if you are writing a small file to S3, where a simple put object (without multipart uploading) would suffice. Using background_writes=False and fs.copy_files(...) with a local, "known-sized" small file also results in a multipart upload.
While this behaviour keeps the implementation simple, it is surprising and I couldn't find it described in the documentation anywhere.
import time
from pyarrow import fs
fs.initialize_s3(fs.S3LogLevel.Debug)
sfs = fs.S3FileSystem()
with sfs.open_output_stream("a_bucket/test", buffer_size=1000):
time.sleep(10)Component(s)
Python 3.10
Platform: MacOS and Linux
Version: 15.0.0