Skip to content

[segment replication] Add async publish checkpoint task#17619

Closed
guojialiang92 wants to merge 12 commits intoopensearch-project:mainfrom
guojialiang92:dev/add_async_publish_checkpoint_task
Closed

[segment replication] Add async publish checkpoint task#17619
guojialiang92 wants to merge 12 commits intoopensearch-project:mainfrom
guojialiang92:dev/add_async_publish_checkpoint_task

Conversation

@guojialiang92
Copy link
Copy Markdown
Contributor

Description

Added a test. In the current situation, if the primary shard publish checkpoint fails, it will cause the replica shard and the primary shard to fail to synchronize.
Added an asynchronous task. When the primary shard detects that the replica is behind for more than a certain time threshold, it triggers a publish checkpoint. And ensure that the above tests can be passed.

Related Issues

Resolves 17595

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions github-actions bot added bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Mar 18, 2025
@guojialiang92 guojialiang92 changed the title Dev/add async publish checkpoint task [segment replication] Add async publish checkpoint task Mar 18, 2025
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 23c1b87 to 4394239 Compare March 22, 2025 12:57
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4394239: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 4394239 to 9b5a236 Compare March 24, 2025 02:10
@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for 9b5a236: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 546787d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from 546787d to 5e09825 Compare March 24, 2025 09:53
@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for 5e09825: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for af8670a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/add_async_publish_checkpoint_task branch from af8670a to e21129f Compare March 25, 2025 02:39
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for e21129f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for c01232b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 89574de:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 90f7337:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for 77c1d59: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Indexing:Replication Issues and PRs related to core replication framework eg segrep

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] segment replication stops when publish checkpoint fails

1 participant