Skip to content

[merged segment warmer] Introduce merged segment checkpoint retention to avoid listAll in computeReferencedSegmentsCheckpoint#18890

Open
guojialiang92 wants to merge 30 commits intoopensearch-project:mainfrom
guojialiang92:dev/optimize-redundant-merged-segment-cleanup
Open

[merged segment warmer] Introduce merged segment checkpoint retention to avoid listAll in computeReferencedSegmentsCheckpoint#18890
guojialiang92 wants to merge 30 commits intoopensearch-project:mainfrom
guojialiang92:dev/optimize-redundant-merged-segment-cleanup

Conversation

@guojialiang92
Copy link
Copy Markdown
Contributor

@guojialiang92 guojialiang92 commented Aug 1, 2025

Description

This PR is based on [18720]'s follow-up work. Based on the discussion with @ashking94, I made some optimizations. The main purpose is to avoid the listAll operation in IndexShard#computeReferencedSegmentsCheckpoint.

Main Changes

  • I Introduced IndexShard#primaryMergedSegmentCheckpoints for primary shard to track merged segment checkpoints that have been published for pre-warm. We will add merged segment checkpoint to IndexShard#primaryMergedSegmentCheckpoints before publish_merged_segment.
  • To avoid memory leaks in IndexShard#primaryMergedSegmentCheckpoints, I introduced a configuration IndexSettings#INDEX_MERGED_SEGMENT_CHECKPOINT_RETENTION_TIME (default 15 minutes) to clean up expired checkpoints after publish_referenced_segments.
  • Rename IndexShard#pendingMergedSegmentCheckpoints to IndexShard#replicaMergedSegmentCheckpoints, which is used for replica shard record the pre-copied merged segment checkpoints, which are not yet refreshed.
  • I introduced MergedSegmentWarmerIT#testPrimaryMergedSegmentCheckpointRetentionTimeout. Construct a case with redundant merged segment checkpoint in the primary shard and delete it based on the expiration time.

Concurrent Control

After a relatively long period of verification in the production environment, we have also fixed some issues. To enhance the robustness of the new logic, I introduced concurrent control for cleaning merged segments and segment replication.

We replaced the higher-overhead listAll operation with primaryMergedSegmentCheckpoints and refreshed segment recorded in memory. However, primaryMergedSegmentCheckpoints will be deleted based on the expiration time. Considering the situation where the merged segment ckp of primaryMergedSegmentCheckpoints has not been refreshed and has already expired, it will not be included in ReferencedSegmentsCheckpoint. However, once it is refreshed, segment replication will reference it. Since the merged segment cleanup task and the segment replication task are executed concurrently, it may lead to the problem that the merged segment referenced by segment replication is incorrectly deleted.

This occurs when the merge segment wamer takes a long time and exceeds the expiration time. Although increasing the expiration time can reduce the probability of occurrence, we also need a more robust design.

Through synchronous control, we can ensure that the merged segment referenced by the current segment replication task will definitely not be deleted by the merged segment cleanup task. If the merged segment is deleted by the cleanup task before the segment replication task starts execution, these files will be sent to replica shard again through segment replication.

Summary

The correctness of the task to clean up redundant merged segments in the replica shard depends on the accuracy of obtaining the ReferencedSegmentsCheckpoint of the primary shard. ReferencedSegmentsCheckpoint consists of two parts, primaryMergedSegmentCheckpoints and refreshed segments.

After the replica node receives ReferencedSegmentsCheckpoint, it will iterate through replicaMergedSegmentCheckpoints. If ReferencedSegmentsCheckpoint does not contain MergedSegmentCheckpoint and ReferencedSegmentsCheckpoint is ahead of MergedSegmentCheckpoint, then MergedSegmentCheckpoint will be deleted.

For a clearer explanation, I will elaborate on different scenarios.

  1. The merged segment ckp of primaryMergedSegmentCheckpoints has not expired yet. ReferencedSegmentsCheckpoint will contain it, and the merged segment in the replicas will not be deleted.
  2. The merged segment ckp of primaryMergedSegmentCheckpoints has been refreshed and has already expired. ReferencedSegmentsCheckpoint will contain it, and the merged segment in the replicas will not be deleted.
  3. The merged segment ckp of primaryMergedSegmentCheckpoints has not been refreshed and has already expired. ReferencedSegmentsCheckpoint will not contain it. Through the above concurrency control, it is necessary to avoid incorrectly deleting the merged segment referenced by segment replication.
  4. The merged segment ckp of primaryMergedSegmentCheckpoints has already been deleted due to being merged again and has already expired. ReferencedSegmentsCheckpoint will not contain it. The replica shard can safely delete the redundant merged segment.

Related Issues

Resolves #[18845]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • New Features

    • Configurable merged-segment checkpoint retention time (default 15 minutes) exposed as a cluster setting.
  • Improvements

    • Distinct primary vs. replica merged-segment tracking with expiration/pruning for primary checkpoints.
    • Improved cleanup and synchronization of merged-segment metadata and files; replica state cleared on shard close/promotion.
    • Index shard now tracks current segment replication checkpoint for coordination.
  • Tests

    • Added/renamed tests covering retention timeout and primary–replica file synchronization.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for 0c06302: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for 253c15f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 2, 2025

❌ Gradle check result for 1c90ffc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 0764133: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/optimize-redundant-merged-segment-cleanup branch from 0764133 to f2e0d8a Compare December 12, 2025 16:30
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)

399-409: Checkpoint-tracking state split looks reasonable; consider documenting thread-safety expectations.

The primary/replica split plus a dedicated mutex for “replica cleanup vs in-flight replication” is a good direction. One small suggestion: add a brief comment clarifying which operations must synchronize on cleanupReplicaMergedSegmentMutex (reads/writes of currentSegmentReplicationCheckpoint, and only that), since replicaMergedSegmentCheckpoints is concurrently mutated outside the lock (by design).


1814-1830: Replica refreshed-segment cleanup: LGTM; tweak log message wording.

The refreshedSegmentNames set + removeIf is clear and keeps the set in sync after refresh. The log still says “pending merged segments” even though the structure is now replicaMergedSegmentCheckpoints; consider renaming the message for easier debugging.


1904-1909: Retention cleanup implementation is fine once time units are confirmed.

Using removeIf on the concurrent set is straightforward and keeps the “no listAll” goal intact. After unit verification (see earlier comment), consider renaming locals to reflect the actual unit (e.g., retentionNanos, nowEpochNanos, createdEpochNanos) to avoid future regressions.


1911-1963: Public API shape looks safe; consider minimizing surface area if truly “tests only”.

  • Deprecated getter returning Collections.emptySet() is the safe choice.
  • Collections.unmodifiableSet(...) prevents external mutation (good), but these getters are still public. If they’re strictly test-only, consider @VisibleForTesting + package-private (where feasible) to reduce API surface.

Also: clearReplicaMergedSegmentState() correctly synchronizes (fixes visibility/order).


2071-2085: Referenced segments union logic is clear; minor micro-optimization available.

Logic is correct: include both refreshed segments and “published-for-prewarm” segment names. Minor: segmentInfosGatedCloseable.get() is invoked multiple times; caching to a local SegmentInfos si = ... would reduce repeated calls and reads better.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0764133 and f2e0d8a.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: detect-breaking-change
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: Analyze (java)
🔇 Additional comments (3)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)

1871-1902: Deletion gate with currentSegmentReplicationCheckpoint is good; verify metadata-map intersection is sufficient.

The synchronized “don’t delete if any file is referenced by the current replication checkpoint” is the right shape for preventing races with concurrent segment-replication tasks. Please sanity-check that:

  • ReplicationCheckpoint#getMetadataMap() keys overlap the same filenames as MergedSegmentCheckpoint#getMetadataMap() (i.e., the intersection test is meaningful), and
  • all in-flight references are captured by currentSegmentReplicationCheckpoint at the time deletion can happen.

If either assumption breaks, this could delete files still needed by an active replication.


2055-2062: finally cleanup on publish: good resiliency.

Ensuring removeExpiredPrimaryMergedSegmentCheckpoints() runs even on publish failures matches the “avoid leaks” objective and avoids coupling cleanup to successful publish.


2088-2098: Rollback on publish failure: good.

Adding to primaryMergedSegmentCheckpoints before publishing and removing on failure prevents stale entries from influencing referenced-segment computation for the full retention window.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for f2e0d8a: SUCCESS

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (2)

1814-1830: Minor: log message still says “pending merged segments” after rename.
The message at Lines 1825-1826 reads like the old terminology; consider updating it to “replica merged segment checkpoints” (or similar) to match the new model.


1871-1902: Prevent TOCTOU deletion: re-check membership under mutex before deleting files.
pendingDeleteCheckpoints is built outside cleanupReplicaMergedSegmentMutex, so a checkpoint can be removed/updated concurrently (e.g., by refresh/finalize paths) before you call store.deleteQuiet(...). Re-checking replicaMergedSegmentCheckpoints.contains(...) under the mutex makes the deletion decision robust against concurrent state changes.

Proposed diff:

 synchronized (cleanupReplicaMergedSegmentMutex) {
     for (MergedSegmentCheckpoint mergedSegmentCheckpoint : pendingDeleteCheckpoints) {
+        if (replicaMergedSegmentCheckpoints.contains(mergedSegmentCheckpoint) == false) {
+            continue;
+        }
         if ((currentSegmentReplicationCheckpoint == null)
             || mergedSegmentCheckpoint.getMetadataMap()
                 .keySet()
                 .stream()
                 .noneMatch(currentSegmentReplicationCheckpoint.getMetadataMap()::containsKey)) {
             logger.trace("replica shard remove redundant pending merge segment {}", mergedSegmentCheckpoint.getSegmentName());
             store.deleteQuiet(mergedSegmentCheckpoint.getMetadataMap().keySet().toArray(new String[0]));
             replicaMergedSegmentCheckpoints.remove(mergedSegmentCheckpoint);
         } else {
             logger.trace(
                 "replica shard can not remove redundant pending merge segment {}",
                 mergedSegmentCheckpoint.getSegmentName()
             );
         }
     }
 }
♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)

99-101: Fix/verify retention timestamp units (currently looks inconsistent: “nowNanos” vs DateUtils.toLong(Instant.now())).
Right now removeExpiredPrimaryMergedSegmentCheckpoints() subtracts nowNanos - createdTimeStamp and compares against retentionTimeInNanos, but the correctness hinges entirely on DateUtils.toLong(Instant) and MergedSegmentCheckpoint#getCreatedTimeStamp() having the same unit and time source. This was previously raised and still seems unresolved.

Suggested direction once units are confirmed:

  • If timestamps are epoch-millis: compare using retentionTime.getMillis() and rename vars to *Millis.
  • If timestamps are epoch-nanos: ensure both are epoch-nanos (and rename to *EpochNanos).
  • If timestamps are System.nanoTime(): do not mix with Instant.now(); use the same monotonic source for both.

Verification script to pin this down:

#!/bin/bash
set -euo pipefail

# What unit does DateUtils.toLong(Instant) return?
rg -n --hidden --no-mmap 'class DateUtils\b|long\s+toLong\s*\(\s*Instant' -S .

# What unit/source is MergedSegmentCheckpoint createdTimeStamp?
rg -n --hidden --no-mmap 'class MergedSegmentCheckpoint\b|getCreatedTimeStamp\s*\(|createdTimeStamp' -S server/src/main/java

# Where is createdTimeStamp assigned (Instant.now/System.currentTimeMillis/System.nanoTime)?
rg -n --hidden --no-mmap 'getCreatedTimeStamp\s*\(|createdTimeStamp\s*=|Instant\.now|System\.currentTimeMillis|System\.nanoTime' -S server/src/main/java

Also applies to: 229-231, 1904-1909, 1923-1925

🧹 Nitpick comments (3)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)

399-409: Make nullability/guardrails explicit for replica-tracking state.
Consider marking currentSegmentReplicationCheckpoint as @Nullable (it’s explicitly set to null) to make the contract clearer at call sites and during future refactors.


1919-1949: Nice: primary/replica checkpoint APIs are clear; consider returning snapshots for test getters.
Collections.unmodifiableSet(primaryMergedSegmentCheckpoints) still exposes a live concurrent view (fine), but returning an immutable snapshot can reduce test flakiness due to concurrent mutation.


2071-2085: Behavior change looks correct: referenced segments now include published merged-segment names.
This avoids listAll while still keeping referenced names conservative. Minor nit: you can avoid intermediate Collectors.toSet() by adding via forEach(segmentNames::add) to reduce allocations.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2e0d8a and e78533c.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)
  • Sets (57-185)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)
🔇 Additional comments (5)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)

1904-1909: Retention cleanup logic is well-placed, but depends on the timestamp-unit fix.
Calling this from publishReferencedSegments() (finally) is a good lifecycle hook; just ensure the time arithmetic is correct per the units/source verification.


1911-1917: Good BWC behavior: deprecated getter returns empty set (not null).
This keeps deprecated API usage safe for any lingering callers.


1951-1963: Mutex usage is consistent for replica state resets/updates.
setCurrentSegmentReplicationCheckpoint() and clearReplicaMergedSegmentState() correctly synchronize on the same mutex used by deletion logic.


2055-2062: Good: retention cleanup runs even if publish fails (finally).
This matches the “avoid memory leaks” goal and preserves exception behavior.


2088-2098: Good rollback on publish failure to avoid stale primary checkpoint state.
This prevents retention-based “ghost” references after publisher exceptions.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for e78533c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 8a513f5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 force-pushed the dev/optimize-redundant-merged-segment-cleanup branch from 8a513f5 to 7ff4838 Compare December 15, 2025 05:53
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)

1905-1910: Clarify/lock in timestamp units for retention math

removeExpiredPrimaryMergedSegmentCheckpoints() assumes that:

  • nowNanos = DateUtils.toLong(Instant.now()), and
  • c.v2() (from primaryMergedSegmentCheckpoints)

are in the same time unit suitable for Duration.ofNanos(nowNanos - c.v2()).

The comment above the field mentions createdTimeStampInNanos, which suggests these longs represent nanoseconds, but that contract currently lives only in comments and in the implementation of MergedSegmentCheckpoint#getCreatedTimeStamp() / DateUtils.toLong(Instant).

To prevent subtle bugs if either side ever changes representation, I’d suggest:

  • Renaming the variable/tuple component to something like createdTimeNanos and nowNanosSinceEpoch (or similar) to make units explicit, and/or
  • Adding a short Javadoc on getCreatedTimeStamp() explicitly stating its unit, and referencing that here.

This is mainly about future maintainability; behavior looks consistent with the current comments and tests.

You can double‑check the unit consistency with a quick grep:

#!/bin/bash
set -euo pipefail

# Inspect how createdTimeStamp is produced and documented
rg -n "class MergedSegmentCheckpoint\b" -S .
rg -n "getCreatedTimeStamp\s*\(" -S .
rg -n "DateUtils\.toLong\s*\(\s*Instant" -S .

# Sanity-check that tests cover expiration behavior
rg -n "MergedSegmentCheckpointRetention" -S .
🧹 Nitpick comments (3)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)

1873-1903: Replica redundant-merge cleanup logic is reasonable; small micro‑cleanup possible

The two‑phase approach (select candidates from replicaMergedSegmentCheckpoints, then re‑check against currentSegmentReplicationCheckpoint under cleanupReplicaMergedSegmentMutex before store.deleteQuiet(...)) correctly avoids races with in‑flight segment replication and ensures visibility of currentSegmentReplicationCheckpoint. The use of a concurrent set means iteration and removals are safe.

If you ever want to shave a bit of overhead, you could cache mergedSegmentCheckpoint.getMetadataMap().keySet() into a local Set<String> per checkpoint so you don’t rebuild the key set and array multiple times, but this is a micro‑optimization and not required.


1912-1950: Deprecated pending-merge API behavior and new checkpoint helpers are safe; consider tightening test API

Turning addPendingMergeSegmentCheckpoint into a no‑op and getPendingMergedSegmentCheckpoints into Collections.emptySet() is a safe BWC strategy for any lingering callers, avoiding NPEs while steering usage to the new primary/replica‑specific APIs.

The new helpers (addPrimaryMergedSegmentCheckpoint, removePrimaryMergedSegmentCheckpoint, addReplicaMergedSegmentCheckpoint, and their getters) are straightforward and correctly use unmodifiable views to avoid external mutation.

If these getters are only used in tests, you might optionally narrow their visibility (e.g., package‑private + @VisibleForTesting) to avoid growing the public surface area of IndexShard, and, if you ever want a stable snapshot instead of a live view, wrap in new HashSet<>(...) before unmodifiableSet(...). Not required, but would tighten the API.


2073-2085: Including primary merged checkpoints in referenced-segment computation looks correct; minor allocation nit

Building segmentNames from:

  • the live SegmentInfos snapshot, plus
  • primaryMergedSegmentCheckpoints.stream().map(Tuple::v1)

correctly ensures that pre‑warmed (but not yet fully refreshed) merged segments remain “referenced” from the primary’s point of view. That matches the retention and pre‑warm design described in the PR.

Very minor: you could avoid a couple of intermediate Collectors.toSet() allocations by streaming directly into segmentNames (e.g., infos.asList().forEach(...); primaryMergedSegmentCheckpoints.forEach(...);), but that’s purely cosmetic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a513f5 and 7ff4838.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)
  • Sets (57-185)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: detect-breaking-change
🔇 Additional comments (5)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)

400-410: Primary/replica checkpoint state separation and locking look sound

The split between primaryMergedSegmentCheckpoints and replicaMergedSegmentCheckpoints, plus the dedicated cleanupReplicaMergedSegmentMutex for currentSegmentReplicationCheckpoint, gives a clear concurrency story: primary-only state relies on a concurrent set, while replica cleanup coordinates via a shared mutex. I don’t see correctness issues with this layout.


1815-1831: Replica checkpoint pruning on finalizeReplication is correct

Building refreshedSegmentNames from infos and then doing replicaMergedSegmentCheckpoints.removeIf(...) is a straightforward way to drop any pre‑copied merged checkpoints that have now been refreshed. This keeps replica state bounded without introducing extra locking, and is safe given the concurrent set semantics.


1952-1963: Replica merged-segment state clearing now correctly participates in the mutex protocol

Wrapping both setCurrentSegmentReplicationCheckpoint(...) and clearReplicaMergedSegmentState() in synchronized (cleanupReplicaMergedSegmentMutex) brings these writes into the same critical section as the deletion logic in cleanupRedundantPendingMergeSegment. This removes the earlier data‑race/visibility risk on currentSegmentReplicationCheckpoint and ensures consistent behavior under concurrency.


2056-2062: Always running primary checkpoint retention after referenced-segment publish is a good robustness improvement

Wrapping referencedSegmentsPublisher.publish(...) in a try with removeExpiredPrimaryMergedSegmentCheckpoints() in finally guarantees retention cleanup runs even if publish throws, which aligns with the goal of avoiding memory leaks from stale primary checkpoints. Since the cleanup method is pure in‑memory logic and cannot throw IOException, this doesn’t alter the exception surface of publishReferencedSegments().


2089-2098: Merged-segment publish now correctly tracks primary checkpoints and rolls back on failure

Computing the MergedSegmentCheckpoint, adding it to primaryMergedSegmentCheckpoints before publishing, and then removing it in the catch path if mergedSegmentPublisher.publish(...) throws prevents stale primary entries from lingering on failed publishes. This keeps the in‑memory checkpoint set consistent while still allowing retention cleanup to prune old, successfully‑published entries.

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for 7ff4838: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)

1905-1911: Retention logic for primary merged-segment checkpoints

Assuming MergedSegmentCheckpoint#getCreatedTimeStamp() and DateUtils.toLong(Instant.now()) are both in nanoseconds since the epoch (as the comment states), nowNanosSinceEpoch - c.v2() with Duration.ofNanos(...).toMillis() is a correct way to compare against a millisecond retention value. If those units ever change, this is the key place that would need updating.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bf1c09 and a11f367.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)
  • Sets (57-185)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
🔇 Additional comments (7)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (7)

100-111: New time imports and primary/replica checkpoint fields look consistent

Use of DateUtils, Duration, and Instant for retention plus Sets.newConcurrentHashSet() and a dedicated cleanupReplicaMergedSegmentMutex give you thread-safe containers and a clear separation between primary and replica merged-segment state. No issues spotted here.

Also applies to: 230-231, 400-409


1815-1831: Batch removal of refreshed replica merged segments is safe and efficient

Collecting refreshedSegmentNames into a HashSet and then using replicaMergedSegmentCheckpoints.removeIf(...) is both clear and safe with the concurrent set; it avoids per-segment mutation while iterating infos. Behavior matches the intent of dropping pending merged segments once they’re visible after replication.


1874-1902: Replica cleanup now correctly guards against races with in‑flight segment replication

The two-phase approach (build pendingDeleteCheckpoints first, then re-check under cleanupReplicaMergedSegmentMutex against currentSegmentReplicationCheckpoint and only delete when there is no metadata overlap) properly prevents deleting files required by an ongoing segment replication. The locking discipline is consistent across setCurrentSegmentReplicationCheckpoint, clearReplicaMergedSegmentState, and this method.


1913-1940: Deprecated pending-merged API and primary checkpoint helpers are safe

Turning addPendingMergeSegmentCheckpoint into a no-op and getPendingMergedSegmentCheckpoints into Collections.emptySet() is a safe deprecation path that avoids NPEs for any lingering callers. The new add/remove/getPrimaryMergedSegmentCheckpoint* helpers encapsulate the <segmentName, createdTimeStamp> tuples cleanly and expose an unmodifiable view for tests.


1942-1964: Replica merged-segment state helpers and mutex usage look correct

addReplicaMergedSegmentCheckpoint writes into a concurrent set, and getReplicaMergedSegmentCheckpoints() returns an unmodifiable view for tests. setCurrentSegmentReplicationCheckpoint and clearReplicaMergedSegmentState now both synchronize on cleanupReplicaMergedSegmentMutex, matching the mutex used in cleanupRedundantPendingMergeSegment and fixing the earlier data-race/visibility concern.


2059-2063: Referenced-segments publishing now accounts for pre‑warmed primary segments and always runs retention

Wrapping referencedSegmentsPublisher.publish(this, computeReferencedSegmentsCheckpoint()) in try/finally ensures removeExpiredPrimaryMergedSegmentCheckpoints() runs even if publish throws, which is important for avoiding leaks. Adding both on-disk segment names (from SegmentInfos) and names from primaryMergedSegmentCheckpoints into segmentNames correctly informs replicas about still-referenced pre‑warmed segments.

Also applies to: 2075-2078


2092-2099: Rollback on failed merged-segment publish prevents stale primary checkpoints

Computing the MergedSegmentCheckpoint, adding it to primaryMergedSegmentCheckpoints, and then rolling it back in the catch block if mergedSegmentPublisher.publish(...) fails avoids leaving a stale primary checkpoint that could incorrectly influence computeReferencedSegmentsCheckpoint() until retention cleanup. This behavior aligns well with the new tracking model.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for a11f367: SUCCESS

@guojialiang92
Copy link
Copy Markdown
Contributor Author

Hi, @ashking94
I have already implemented improvements based on all the suggestions you provided. In the past few months, we have also fixed concurrency control issue in the production environment, which I have added to the description of the PR.

Hope you can continue to help review this PR, thank you. :)

@guojialiang92
Copy link
Copy Markdown
Contributor Author

@ashking94
Just to bump this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants