Skip to content

Improve S3 Retry Strategy for backup and restore#84421

Merged
jkartseva merged 11 commits intoClickHouse:masterfrom
jkartseva:s3-retry-strategy
Aug 9, 2025
Merged

Improve S3 Retry Strategy for backup and restore#84421
jkartseva merged 11 commits intoClickHouse:masterfrom
jkartseva:s3-retry-strategy

Conversation

@jkartseva
Copy link
Copy Markdown
Member

@jkartseva jkartseva commented Jul 25, 2025

https://github.com/ClickHouse/clickhouse-private/issues/19914

  • Introduce jitter
  • Fix bug where the first retry is attempted without delay
  • Introduce configurable parameters for the S3 retry backoff strategy to tune initial delay, max delay, and jitter factor
  • Introduce a throttler to S3 mocks to inject a "Slow Down" error

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Introduce backup_restore_s3_retry_initial_backoff_ms, backup_restore_s3_retry_max_backoff_ms, backup_restore_s3_retry_jitter_factor to configure the S3 retry backoff strategy used during backup and restore operations.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@jkartseva jkartseva added the can be tested Allows running workflows for external contributors label Jul 25, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 25, 2025

Workflow [PR], commit [75396b0]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Jul 25, 2025
@jkartseva jkartseva marked this pull request as draft July 25, 2025 04:48
@jkartseva jkartseva marked this pull request as ready for review July 25, 2025 06:24
@pufit pufit self-assigned this Jul 25, 2025
Copy link
Copy Markdown
Member

@pufit pufit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything besides the failing test looks good.

assert response == "OK", response


class Throttler:
Copy link
Copy Markdown
Member

@CheSema CheSema Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, you are using broken_s3 mock :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it helped emulate the 'Slow Down' error.

@jkartseva jkartseva force-pushed the s3-retry-strategy branch from 730cd32 to cc7db5f Compare July 29, 2025 02:45
@serxa serxa requested a review from Copilot July 31, 2025 11:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the S3 retry strategy for backup and restore operations by introducing jitter, fixing a bug where the first retry had no delay, and adding configurable parameters for the retry backoff strategy.

  • Introduces jitter to the S3 retry strategy to avoid thundering herd problems
  • Fixes a bug where the first retry attempt had no delay
  • Adds configurable parameters for initial delay, max delay, and jitter factor in the S3 retry backoff strategy

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/IO/S3/PocoHTTPClient.h Defines new RetryStrategy struct with configurable parameters
src/IO/S3/PocoHTTPClient.cpp Adds validation for retry strategy parameters
src/IO/S3/Client.h Updates RetryStrategy class to use new configuration
src/IO/S3/Client.cpp Implements improved retry delay calculation with jitter
src/Core/Settings.cpp Adds new settings for retry configuration
src/Core/SettingsChangesHistory.cpp Records new settings in version history
src/Backups/BackupIO_S3.cpp Uses new retry strategy configuration for backups
tests/integration/test_backup_restore_s3/test.py Updates test structure and adds throttling test
tests/integration/helpers/s3_mocks/broken_s3.py Implements S3 throttling mock functionality

"enable_s3_requests_logging": "1",
}

node = cluster.instances["node"]
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable node is assigned twice in this function (lines 936 and 946). The second assignment is redundant and should be removed.

Suggested change
node = cluster.instances["node"]

Copilot uses AI. Check for mistakes.
@jkartseva
Copy link
Copy Markdown
Member Author

@pufit @serxa Please take a look at #84854, which slows down all threads after a retryable error. I'd like to merge it before this PR.

@jkartseva jkartseva enabled auto-merge August 8, 2025 22:10
@jkartseva jkartseva disabled auto-merge August 8, 2025 22:18
@jkartseva jkartseva enabled auto-merge August 8, 2025 22:20
@jkartseva jkartseva added this pull request to the merge queue Aug 9, 2025
Merged via the queue into ClickHouse:master with commit 6c2c666 Aug 9, 2025
244 checks passed
@jkartseva jkartseva deleted the s3-retry-strategy branch August 9, 2025 01:25
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 9, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 11, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 12, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 13, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 14, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 15, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 16, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants