Skip to content

[s3] Optimize forward seeks within buffered data to avoid redundant GET#892

Merged
ddelange merged 1 commit intodevelopfrom
optimize-seek
Oct 11, 2025
Merged

[s3] Optimize forward seeks within buffered data to avoid redundant GET#892
ddelange merged 1 commit intodevelopfrom
optimize-seek

Conversation

@ddelange
Copy link
Collaborator

@ddelange ddelange commented Oct 11, 2025

Please pick a concise, informative and complete title for your PR.

The title is important because it will appear in our change log.

Motivation

Please explain the motivation behind this PR.

If you're fixing a bug, link to the issue using a supported keyword like "Fixes #{issue_number}".

If you're adding a new feature, then consider opening a ticket and discussing it with the maintainers before you actually do the hard work.

Ref #712, fix #622, fix #742

Before this change, any seek operation would close the current HTTP connection and open a new GET request to S3, even when seeking forward by a small amount within already-buffered data.

This commit adds an optimization to Reader.seek() that checks if a forward seek can be satisfied from the existing buffer. If the target position falls within the currently buffered range, the implementation simply advances the buffer position without making a new S3 request.

Backward seeks and forward seeks beyond the buffer still require new GET requests as before.

Tests

If you're fixing a bug, consider test-driven development:

  1. Create a unit test that demonstrates the bug. The test should fail.
  2. Implement your bug fix.
  3. The test you created should now pass.

If you're implementing a new feature, include unit tests for it.

Make sure all existing unit tests pass.
You can run them locally using:

pytest tests

If there are any failures, please fix them before creating the PR (or mark it as WIP, see below).

Work in progress

If you're still working on your PR, mark the PR as draft PR.

We'll skip reviewing it for the time being.

Once it's ready, mark the PR as "ready for review", and ping one of the maintainers (e.g. mpenkov).

Checklist

Before you mark the PR as "ready for review", please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Run python update_helptext.py in case there are API changes
  • Checked that all unit tests pass

Workflow

Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress.

Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

@ddelange ddelange force-pushed the optimize-seek branch 4 times, most recently from 6c5e136 to 0aec1d7 Compare October 11, 2025 19:57
@ddelange ddelange marked this pull request as ready for review October 11, 2025 20:04
@ddelange ddelange changed the title Optimize forward seeks within buffered data to avoid redundant GET [s3] Optimize forward seeks within buffered data to avoid redundant GET Oct 11, 2025
@ddelange ddelange merged commit 4b07d25 into develop Oct 11, 2025
6 checks passed
ddelange added a commit that referenced this pull request Oct 11, 2025
…o chunked_s3

* 'develop' of https://github.com/piskvorky/smart_open:
  Optimize forward seeks within buffered data to avoid redundant GET (#892)
  Add macos to CI (#891)
@piskvorky piskvorky deleted the optimize-seek branch October 12, 2025 05:16
ddelange added a commit that referenced this pull request Oct 12, 2025
…o fix-ssh

* 'develop' of https://github.com/piskvorky/smart_open: (66 commits)
  Optimize forward seeks within buffered data to avoid redundant GET (#892)
  Add macos to CI (#891)
  Simplify CI, use uv (#890)
  [s3] Improve handling of InvalidRange and seek on empty file (#889)
  Protect against hanging tests (#888)
  Bump the github-actions group with 2 updates (#886)
  build: fix invalid `fallback_version` when builing with `uv` (#884)
  Remove travis leftover (#881)
  Disambiguate URI examples in README.rst (#879)
  Update CHANGELOG.md
  Add .xz and increase performance of compression module (#875)
  Bump pypa/gh-action-pypi-publish in /.github/workflows (#878)
  Bump actions/checkout from 4 to 5 in the github-actions group (#877)
  Fix release.sh for the final merge back into develop (#872)
  Update CHANGELOG.md
  Drop 3.7 support in pyproject.toml (#871)
  Fix CI badge (#869)
  Bump softprops/action-gh-release in the github-actions group (#867)
  Fix release.sh merge message and final merge (#868)
  Update CHANGELOG.md
  ...
ddelange added a commit that referenced this pull request Oct 20, 2025
* develop:
  Update CHANGELOG.md
  Use compression.zstd (PEP-784) (#895)
  Drop python 3.8, add python 3.14 (#896)
  [s3] Add range_chunk_size param to read using multiple GET requests (#887)
  Run tests in parallel (#893)
  Optimize forward seeks within buffered data to avoid redundant GET (#892)
  Add macos to CI (#891)
  Simplify CI, use uv (#890)
  [s3] Improve handling of InvalidRange and seek on empty file (#889)
  Protect against hanging tests (#888)
  Bump the github-actions group with 2 updates (#886)
  build: fix invalid `fallback_version` when builing with `uv` (#884)
  Remove travis leftover (#881)
  Disambiguate URI examples in README.rst (#879)
@github-actions
Copy link

Released v7.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

slow .zip uncompression between s3 Random seeking from S3 is very slow

1 participant