Skip to content

[C++] S3Filesystem always initiates multipart uploads, regardless of input size #40557

@orf

Description

@orf

Describe the bug, including details regarding any error messages, version, and platform.

Running the following snippet shows that open_output_stream() initiates a multipart upload immediately, before anything is written.

This is quite unexpected: I would expect that the buffer_size argument would ensure that a multipart upload is not initiated until at least 1,000 bytes are written. The issue with the current behaviour is that writing a single byte results in three requests to s3: one to create the multipart upload, one to upload the 1-byte part, and one to finish the multipart upload.

This is very inefficient if you are writing a small file to S3, where a simple put object (without multipart uploading) would suffice. Using background_writes=False and fs.copy_files(...) with a local, "known-sized" small file also results in a multipart upload.

While this behaviour keeps the implementation simple, it is surprising and I couldn't find it described in the documentation anywhere.

import time

from pyarrow import fs

fs.initialize_s3(fs.S3LogLevel.Debug)

sfs = fs.S3FileSystem()
with sfs.open_output_stream("a_bucket/test", buffer_size=1000):
    time.sleep(10)

Component(s)

Python 3.10

Platform: MacOS and Linux

Version: 15.0.0

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions