Skip to content

Implement option to write to a custom buffer during S3 writes#547

Merged
mpenkov merged 10 commits intodevelopfrom
diskbuffer
Feb 15, 2021
Merged

Implement option to write to a custom buffer during S3 writes#547
mpenkov merged 10 commits intodevelopfrom
diskbuffer

Conversation

@mpenkov
Copy link
Copy Markdown
Collaborator

@mpenkov mpenkov commented Oct 6, 2020

Currently, we buffer in-memory. This works well in most cases, but can
be a problem when performing many uploads in parallel. Each upload
needs to buffer at least 5MB of data (that is the minimum part size for
AWS S3). When performing hundreds or thousands of uploads, memory can
become a bottleneck.

This PR relieves that bottleneck by buffering to temporary files
instead. Each part gets saved to a separate temporary file and discarded
after upload. This reduces memory usage at the cost of increased IO with
the local disk.

Currently, we buffer in-memory. This works well in most cases, but can
be a problem when performing many uploads in parallel.  Each upload
needs to buffer at least 5MB of data (that is the minimum part size for
AWS S3). When performing hundreds or thousands of uploads, memory can
become a bottleneck.

This PR relieves that bottleneck by buffering to temporary files
instead. Each part gets saved to a separate temporary file and discarded
after upload. This reduces memory usage at the cost of increased IO with
the local disk.
Copy link
Copy Markdown
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation? Who needs thousands of parallel uploads and why?

Seems like a strange obfuscation so -1 on including, unless clearly motivated.

Comment thread smart_open/s3.py Outdated
Comment thread smart_open/s3.py Outdated
Comment thread help.txt

FILE
/home/misha/git/smart_open/smart_open/__init__.py
/Users/misha/git/smart_open/smart_open/__init__.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osx migration, again :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration between home and office :)

re-use the same buffer in the same writer instance
@mpenkov
Copy link
Copy Markdown
Collaborator Author

mpenkov commented Oct 7, 2020

What's the motivation? Who needs thousands of parallel uploads and why?

I'm partitioning a large (100GB, hundreds of millions of lines) dataset in preparation for some data processing. I need each partition to be relatively small, so I'm using a large number of partitions (hundreds/thousands). If I have all the partitions open for writing simultaneously, the partitioning is very simple:

for record in dataset:
    partition_index = calculate_partition(record)
    partitions[partition_index].write(record)

Seems like a strange obfuscation so -1 on including, unless clearly motivated.

I updated the implementation. It's not really an obfuscation anymore - the custom buffer is now an optional parameter. If people don't pass it in, the implementation behaves identically to the current version in PyPI.

@mpenkov mpenkov changed the title Implement option to buffer to disk for S3 writes Implement option write to a custom buffer S3 writes Oct 7, 2020
@mpenkov mpenkov changed the title Implement option write to a custom buffer S3 writes Implement option to write to a custom buffer during S3 writes Oct 7, 2020
Comment thread smart_open/s3.py
@mpenkov mpenkov merged commit 648d198 into develop Feb 15, 2021
@mpenkov mpenkov deleted the diskbuffer branch February 15, 2021 06:41
Comment thread howto.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants