Skip to content

[segment replication] Add cluster setting for retry timeout of publish checkpoint tx action#17749

Merged
ashking94 merged 12 commits intoopensearch-project:mainfrom
guojialiang92:dev/PublishCheckpointAction_use_never_give_up_retry_strategy
Apr 15, 2025
Merged

[segment replication] Add cluster setting for retry timeout of publish checkpoint tx action#17749
ashking94 merged 12 commits intoopensearch-project:mainfrom
guojialiang92:dev/PublishCheckpointAction_use_never_give_up_retry_strategy

Conversation

@guojialiang92
Copy link
Copy Markdown
Contributor

@guojialiang92 guojialiang92 commented Apr 1, 2025

Description

Added a test. In the current situation, if the primary shard publish checkpoint fails, it will cause the replica shard and the primary shard to fail to synchronize.
TransportReplicationAction support specifying retryTimeout.
PublishCheckpointAction use the never give up retry strategy.

Related Issues

Resolves 17595

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2025

❌ Gradle check result for 1edc0ca: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…eckpointAction use the never give up strategy.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/PublishCheckpointAction_use_never_give_up_retry_strategy branch from 1edc0ca to e49aa81 Compare April 1, 2025 11:09
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2025

✅ Gradle check result for e49aa81: SUCCESS

Copy link
Copy Markdown
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for b744f4b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/PublishCheckpointAction_use_never_give_up_retry_strategy branch from b744f4b to 68a5e9d Compare April 14, 2025 16:16
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 68a5e9d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…Action_use_never_give_up_retry_strategy

# Conflicts:
#	CHANGELOG.md
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 3eb976e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ashking94
Copy link
Copy Markdown
Member

❌ Gradle check result for 3eb976e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Restarted the pr build.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 3eb976e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for a3a23a7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for e7b926a: SUCCESS

@ashking94 ashking94 merged commit c44d230 into opensearch-project:main Apr 15, 2025
31 checks passed
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…h checkpoint tx action (opensearch-project#17749)

* TransportReplicationAction support specifying retryTimeout, PublishCheckpointAction use the never give up strategy.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* support  PublishCheckpointAction PUBLISH_CHECK_POINT_RETRY_TIMEOUT to override the default retry timeout

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* add TransportReplicationAction.getRetryTimeoutSetting

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* add entry to CHANGELOG.md

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* rewrite the PR title

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* modify changelog entry

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* add comments

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* update

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

---------

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Signed-off-by: Harsh Kothari <techarsh@amazon.com>
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…h checkpoint tx action (opensearch-project#17749)

* TransportReplicationAction support specifying retryTimeout, PublishCheckpointAction use the never give up strategy.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* support  PublishCheckpointAction PUBLISH_CHECK_POINT_RETRY_TIMEOUT to override the default retry timeout

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* add TransportReplicationAction.getRetryTimeoutSetting

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* add entry to CHANGELOG.md

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* rewrite the PR title

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* modify changelog entry

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* add comments

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

* update

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

---------

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Signed-off-by: Harsh Kothari <techarsh@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] segment replication stops when publish checkpoint fails

2 participants