[merged segment warmer] Introduce merged segment checkpoint retention to avoid listAll in computeReferencedSegmentsCheckpoint#18890
Conversation
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
|
❌ Gradle check result for 0c06302: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 253c15f: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…t-merged-segment-cleanup
|
❌ Gradle check result for 1c90ffc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 0764133: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
0764133 to
f2e0d8a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)
399-409: Checkpoint-tracking state split looks reasonable; consider documenting thread-safety expectations.The primary/replica split plus a dedicated mutex for “replica cleanup vs in-flight replication” is a good direction. One small suggestion: add a brief comment clarifying which operations must synchronize on
cleanupReplicaMergedSegmentMutex(reads/writes ofcurrentSegmentReplicationCheckpoint, and only that), sincereplicaMergedSegmentCheckpointsis concurrently mutated outside the lock (by design).
1814-1830: Replica refreshed-segment cleanup: LGTM; tweak log message wording.The
refreshedSegmentNamesset +removeIfis clear and keeps the set in sync after refresh. The log still says “pending merged segments” even though the structure is nowreplicaMergedSegmentCheckpoints; consider renaming the message for easier debugging.
1904-1909: Retention cleanup implementation is fine once time units are confirmed.Using
removeIfon the concurrent set is straightforward and keeps the “no listAll” goal intact. After unit verification (see earlier comment), consider renaming locals to reflect the actual unit (e.g.,retentionNanos,nowEpochNanos,createdEpochNanos) to avoid future regressions.
1911-1963: Public API shape looks safe; consider minimizing surface area if truly “tests only”.
- Deprecated getter returning
Collections.emptySet()is the safe choice.Collections.unmodifiableSet(...)prevents external mutation (good), but these getters are still public. If they’re strictly test-only, consider@VisibleForTesting+ package-private (where feasible) to reduce API surface.Also:
clearReplicaMergedSegmentState()correctly synchronizes (fixes visibility/order).
2071-2085: Referenced segments union logic is clear; minor micro-optimization available.Logic is correct: include both refreshed segments and “published-for-prewarm” segment names. Minor:
segmentInfosGatedCloseable.get()is invoked multiple times; caching to a localSegmentInfos si = ...would reduce repeated calls and reads better.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java(9 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: detect-breaking-change
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: Analyze (java)
🔇 Additional comments (3)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)
1871-1902: Deletion gate withcurrentSegmentReplicationCheckpointis good; verify metadata-map intersection is sufficient.The synchronized “don’t delete if any file is referenced by the current replication checkpoint” is the right shape for preventing races with concurrent segment-replication tasks. Please sanity-check that:
ReplicationCheckpoint#getMetadataMap()keys overlap the same filenames asMergedSegmentCheckpoint#getMetadataMap()(i.e., the intersection test is meaningful), and- all in-flight references are captured by
currentSegmentReplicationCheckpointat the time deletion can happen.If either assumption breaks, this could delete files still needed by an active replication.
2055-2062:finallycleanup on publish: good resiliency.Ensuring
removeExpiredPrimaryMergedSegmentCheckpoints()runs even on publish failures matches the “avoid leaks” objective and avoids coupling cleanup to successful publish.
2088-2098: Rollback on publish failure: good.Adding to
primaryMergedSegmentCheckpointsbefore publishing and removing on failure prevents stale entries from influencing referenced-segment computation for the full retention window.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (2)
1814-1830: Minor: log message still says “pending merged segments” after rename.
The message at Lines 1825-1826 reads like the old terminology; consider updating it to “replica merged segment checkpoints” (or similar) to match the new model.
1871-1902: Prevent TOCTOU deletion: re-check membership under mutex before deleting files.
pendingDeleteCheckpointsis built outsidecleanupReplicaMergedSegmentMutex, so a checkpoint can be removed/updated concurrently (e.g., by refresh/finalize paths) before you callstore.deleteQuiet(...). Re-checkingreplicaMergedSegmentCheckpoints.contains(...)under the mutex makes the deletion decision robust against concurrent state changes.Proposed diff:
synchronized (cleanupReplicaMergedSegmentMutex) { for (MergedSegmentCheckpoint mergedSegmentCheckpoint : pendingDeleteCheckpoints) { + if (replicaMergedSegmentCheckpoints.contains(mergedSegmentCheckpoint) == false) { + continue; + } if ((currentSegmentReplicationCheckpoint == null) || mergedSegmentCheckpoint.getMetadataMap() .keySet() .stream() .noneMatch(currentSegmentReplicationCheckpoint.getMetadataMap()::containsKey)) { logger.trace("replica shard remove redundant pending merge segment {}", mergedSegmentCheckpoint.getSegmentName()); store.deleteQuiet(mergedSegmentCheckpoint.getMetadataMap().keySet().toArray(new String[0])); replicaMergedSegmentCheckpoints.remove(mergedSegmentCheckpoint); } else { logger.trace( "replica shard can not remove redundant pending merge segment {}", mergedSegmentCheckpoint.getSegmentName() ); } } }
♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
99-101: Fix/verify retention timestamp units (currently looks inconsistent: “nowNanos” vsDateUtils.toLong(Instant.now())).
Right nowremoveExpiredPrimaryMergedSegmentCheckpoints()subtractsnowNanos - createdTimeStampand compares againstretentionTimeInNanos, but the correctness hinges entirely onDateUtils.toLong(Instant)andMergedSegmentCheckpoint#getCreatedTimeStamp()having the same unit and time source. This was previously raised and still seems unresolved.Suggested direction once units are confirmed:
- If timestamps are epoch-millis: compare using
retentionTime.getMillis()and rename vars to*Millis.- If timestamps are epoch-nanos: ensure both are epoch-nanos (and rename to
*EpochNanos).- If timestamps are System.nanoTime(): do not mix with
Instant.now(); use the same monotonic source for both.Verification script to pin this down:
#!/bin/bash set -euo pipefail # What unit does DateUtils.toLong(Instant) return? rg -n --hidden --no-mmap 'class DateUtils\b|long\s+toLong\s*\(\s*Instant' -S . # What unit/source is MergedSegmentCheckpoint createdTimeStamp? rg -n --hidden --no-mmap 'class MergedSegmentCheckpoint\b|getCreatedTimeStamp\s*\(|createdTimeStamp' -S server/src/main/java # Where is createdTimeStamp assigned (Instant.now/System.currentTimeMillis/System.nanoTime)? rg -n --hidden --no-mmap 'getCreatedTimeStamp\s*\(|createdTimeStamp\s*=|Instant\.now|System\.currentTimeMillis|System\.nanoTime' -S server/src/main/javaAlso applies to: 229-231, 1904-1909, 1923-1925
🧹 Nitpick comments (3)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)
399-409: Make nullability/guardrails explicit for replica-tracking state.
Consider markingcurrentSegmentReplicationCheckpointas@Nullable(it’s explicitly set to null) to make the contract clearer at call sites and during future refactors.
1919-1949: Nice: primary/replica checkpoint APIs are clear; consider returning snapshots for test getters.
Collections.unmodifiableSet(primaryMergedSegmentCheckpoints)still exposes a live concurrent view (fine), but returning an immutable snapshot can reduce test flakiness due to concurrent mutation.
2071-2085: Behavior change looks correct: referenced segments now include published merged-segment names.
This avoidslistAllwhile still keeping referenced names conservative. Minor nit: you can avoid intermediateCollectors.toSet()by adding viaforEach(segmentNames::add)to reduce allocations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)
Sets(57-185)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: detect-breaking-change
- GitHub Check: Analyze (java)
🔇 Additional comments (5)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)
1904-1909: Retention cleanup logic is well-placed, but depends on the timestamp-unit fix.
Calling this frompublishReferencedSegments()(finally) is a good lifecycle hook; just ensure the time arithmetic is correct per the units/source verification.
1911-1917: Good BWC behavior: deprecated getter returns empty set (not null).
This keeps deprecated API usage safe for any lingering callers.
1951-1963: Mutex usage is consistent for replica state resets/updates.
setCurrentSegmentReplicationCheckpoint()andclearReplicaMergedSegmentState()correctly synchronize on the same mutex used by deletion logic.
2055-2062: Good: retention cleanup runs even if publish fails (finally).
This matches the “avoid memory leaks” goal and preserves exception behavior.
2088-2098: Good rollback on publish failure to avoid stale primary checkpoint state.
This prevents retention-based “ghost” references after publisher exceptions.
|
❌ Gradle check result for e78533c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 8a513f5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
8a513f5 to
7ff4838
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
1905-1910: Clarify/lock in timestamp units for retention math
removeExpiredPrimaryMergedSegmentCheckpoints()assumes that:
nowNanos = DateUtils.toLong(Instant.now()), andc.v2()(fromprimaryMergedSegmentCheckpoints)are in the same time unit suitable for
Duration.ofNanos(nowNanos - c.v2()).The comment above the field mentions
createdTimeStampInNanos, which suggests these longs represent nanoseconds, but that contract currently lives only in comments and in the implementation ofMergedSegmentCheckpoint#getCreatedTimeStamp()/DateUtils.toLong(Instant).To prevent subtle bugs if either side ever changes representation, I’d suggest:
- Renaming the variable/tuple component to something like
createdTimeNanosandnowNanosSinceEpoch(or similar) to make units explicit, and/or- Adding a short Javadoc on
getCreatedTimeStamp()explicitly stating its unit, and referencing that here.This is mainly about future maintainability; behavior looks consistent with the current comments and tests.
You can double‑check the unit consistency with a quick grep:
#!/bin/bash set -euo pipefail # Inspect how createdTimeStamp is produced and documented rg -n "class MergedSegmentCheckpoint\b" -S . rg -n "getCreatedTimeStamp\s*\(" -S . rg -n "DateUtils\.toLong\s*\(\s*Instant" -S . # Sanity-check that tests cover expiration behavior rg -n "MergedSegmentCheckpointRetention" -S .
🧹 Nitpick comments (3)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)
1873-1903: Replica redundant-merge cleanup logic is reasonable; small micro‑cleanup possibleThe two‑phase approach (select candidates from
replicaMergedSegmentCheckpoints, then re‑check againstcurrentSegmentReplicationCheckpointundercleanupReplicaMergedSegmentMutexbeforestore.deleteQuiet(...)) correctly avoids races with in‑flight segment replication and ensures visibility ofcurrentSegmentReplicationCheckpoint. The use of a concurrent set means iteration and removals are safe.If you ever want to shave a bit of overhead, you could cache
mergedSegmentCheckpoint.getMetadataMap().keySet()into a localSet<String>per checkpoint so you don’t rebuild the key set and array multiple times, but this is a micro‑optimization and not required.
1912-1950: Deprecated pending-merge API behavior and new checkpoint helpers are safe; consider tightening test APITurning
addPendingMergeSegmentCheckpointinto a no‑op andgetPendingMergedSegmentCheckpointsintoCollections.emptySet()is a safe BWC strategy for any lingering callers, avoiding NPEs while steering usage to the new primary/replica‑specific APIs.The new helpers (
addPrimaryMergedSegmentCheckpoint,removePrimaryMergedSegmentCheckpoint,addReplicaMergedSegmentCheckpoint, and their getters) are straightforward and correctly use unmodifiable views to avoid external mutation.If these getters are only used in tests, you might optionally narrow their visibility (e.g., package‑private +
@VisibleForTesting) to avoid growing the public surface area ofIndexShard, and, if you ever want a stable snapshot instead of a live view, wrap innew HashSet<>(...)beforeunmodifiableSet(...). Not required, but would tighten the API.
2073-2085: Including primary merged checkpoints in referenced-segment computation looks correct; minor allocation nitBuilding
segmentNamesfrom:
- the live
SegmentInfossnapshot, plusprimaryMergedSegmentCheckpoints.stream().map(Tuple::v1)correctly ensures that pre‑warmed (but not yet fully refreshed) merged segments remain “referenced” from the primary’s point of view. That matches the retention and pre‑warm design described in the PR.
Very minor: you could avoid a couple of intermediate
Collectors.toSet()allocations by streaming directly intosegmentNames(e.g.,infos.asList().forEach(...); primaryMergedSegmentCheckpoints.forEach(...);), but that’s purely cosmetic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)
Sets(57-185)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: detect-breaking-change
🔇 Additional comments (5)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)
400-410: Primary/replica checkpoint state separation and locking look soundThe split between
primaryMergedSegmentCheckpointsandreplicaMergedSegmentCheckpoints, plus the dedicatedcleanupReplicaMergedSegmentMutexforcurrentSegmentReplicationCheckpoint, gives a clear concurrency story: primary-only state relies on a concurrent set, while replica cleanup coordinates via a shared mutex. I don’t see correctness issues with this layout.
1815-1831: Replica checkpoint pruning on finalizeReplication is correctBuilding
refreshedSegmentNamesfrominfosand then doingreplicaMergedSegmentCheckpoints.removeIf(...)is a straightforward way to drop any pre‑copied merged checkpoints that have now been refreshed. This keeps replica state bounded without introducing extra locking, and is safe given the concurrent set semantics.
1952-1963: Replica merged-segment state clearing now correctly participates in the mutex protocolWrapping both
setCurrentSegmentReplicationCheckpoint(...)andclearReplicaMergedSegmentState()insynchronized (cleanupReplicaMergedSegmentMutex)brings these writes into the same critical section as the deletion logic incleanupRedundantPendingMergeSegment. This removes the earlier data‑race/visibility risk oncurrentSegmentReplicationCheckpointand ensures consistent behavior under concurrency.
2056-2062: Always running primary checkpoint retention after referenced-segment publish is a good robustness improvementWrapping
referencedSegmentsPublisher.publish(...)in atrywithremoveExpiredPrimaryMergedSegmentCheckpoints()infinallyguarantees retention cleanup runs even if publish throws, which aligns with the goal of avoiding memory leaks from stale primary checkpoints. Since the cleanup method is pure in‑memory logic and cannot throwIOException, this doesn’t alter the exception surface ofpublishReferencedSegments().
2089-2098: Merged-segment publish now correctly tracks primary checkpoints and rolls back on failureComputing the
MergedSegmentCheckpoint, adding it toprimaryMergedSegmentCheckpointsbefore publishing, and then removing it in thecatchpath ifmergedSegmentPublisher.publish(...)throws prevents stale primary entries from lingering on failed publishes. This keeps the in‑memory checkpoint set consistent while still allowing retention cleanup to prune old, successfully‑published entries.
|
❕ Gradle check result for 7ff4838: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
1905-1911: Retention logic for primary merged-segment checkpointsAssuming
MergedSegmentCheckpoint#getCreatedTimeStamp()andDateUtils.toLong(Instant.now())are both in nanoseconds since the epoch (as the comment states),nowNanosSinceEpoch - c.v2()withDuration.ofNanos(...).toMillis()is a correct way to compare against a millisecond retention value. If those units ever change, this is the key place that would need updating.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)
Sets(57-185)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: Analyze (java)
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
🔇 Additional comments (7)
server/src/main/java/org/opensearch/index/shard/IndexShard.java (7)
100-111: New time imports and primary/replica checkpoint fields look consistentUse of
DateUtils,Duration, andInstantfor retention plusSets.newConcurrentHashSet()and a dedicatedcleanupReplicaMergedSegmentMutexgive you thread-safe containers and a clear separation between primary and replica merged-segment state. No issues spotted here.Also applies to: 230-231, 400-409
1815-1831: Batch removal of refreshed replica merged segments is safe and efficientCollecting
refreshedSegmentNamesinto aHashSetand then usingreplicaMergedSegmentCheckpoints.removeIf(...)is both clear and safe with the concurrent set; it avoids per-segment mutation while iteratinginfos. Behavior matches the intent of dropping pending merged segments once they’re visible after replication.
1874-1902: Replica cleanup now correctly guards against races with in‑flight segment replicationThe two-phase approach (build
pendingDeleteCheckpointsfirst, then re-check undercleanupReplicaMergedSegmentMutexagainstcurrentSegmentReplicationCheckpointand only delete when there is no metadata overlap) properly prevents deleting files required by an ongoing segment replication. The locking discipline is consistent acrosssetCurrentSegmentReplicationCheckpoint,clearReplicaMergedSegmentState, and this method.
1913-1940: Deprecated pending-merged API and primary checkpoint helpers are safeTurning
addPendingMergeSegmentCheckpointinto a no-op andgetPendingMergedSegmentCheckpointsintoCollections.emptySet()is a safe deprecation path that avoids NPEs for any lingering callers. The newadd/remove/getPrimaryMergedSegmentCheckpoint*helpers encapsulate the<segmentName, createdTimeStamp>tuples cleanly and expose an unmodifiable view for tests.
1942-1964: Replica merged-segment state helpers and mutex usage look correct
addReplicaMergedSegmentCheckpointwrites into a concurrent set, andgetReplicaMergedSegmentCheckpoints()returns an unmodifiable view for tests.setCurrentSegmentReplicationCheckpointandclearReplicaMergedSegmentStatenow both synchronize oncleanupReplicaMergedSegmentMutex, matching the mutex used incleanupRedundantPendingMergeSegmentand fixing the earlier data-race/visibility concern.
2059-2063: Referenced-segments publishing now accounts for pre‑warmed primary segments and always runs retentionWrapping
referencedSegmentsPublisher.publish(this, computeReferencedSegmentsCheckpoint())intry/finallyensuresremoveExpiredPrimaryMergedSegmentCheckpoints()runs even if publish throws, which is important for avoiding leaks. Adding both on-disk segment names (fromSegmentInfos) and names fromprimaryMergedSegmentCheckpointsintosegmentNamescorrectly informs replicas about still-referenced pre‑warmed segments.Also applies to: 2075-2078
2092-2099: Rollback on failed merged-segment publish prevents stale primary checkpointsComputing the
MergedSegmentCheckpoint, adding it toprimaryMergedSegmentCheckpoints, and then rolling it back in thecatchblock ifmergedSegmentPublisher.publish(...)fails avoids leaving a stale primary checkpoint that could incorrectly influencecomputeReferencedSegmentsCheckpoint()until retention cleanup. This behavior aligns well with the new tracking model.
|
Hi, @ashking94 Hope you can continue to help review this PR, thank you. :) |
|
@ashking94 |
Description
This PR is based on [18720]'s follow-up work. Based on the discussion with @ashking94, I made some optimizations. The main purpose is to avoid the
listAlloperation inIndexShard#computeReferencedSegmentsCheckpoint.Main Changes
IndexShard#primaryMergedSegmentCheckpointsfor primary shard to track merged segment checkpoints that have been published for pre-warm. We will add merged segment checkpoint toIndexShard#primaryMergedSegmentCheckpointsbeforepublish_merged_segment.IndexShard#primaryMergedSegmentCheckpoints, I introduced a configurationIndexSettings#INDEX_MERGED_SEGMENT_CHECKPOINT_RETENTION_TIME(default15minutes) to clean up expired checkpoints afterpublish_referenced_segments.IndexShard#pendingMergedSegmentCheckpointstoIndexShard#replicaMergedSegmentCheckpoints, which is used for replica shard record the pre-copied merged segment checkpoints, which are not yet refreshed.MergedSegmentWarmerIT#testPrimaryMergedSegmentCheckpointRetentionTimeout. Construct a case with redundant merged segment checkpoint in the primary shard and delete it based on the expiration time.Concurrent Control
After a relatively long period of verification in the production environment, we have also fixed some issues. To enhance the robustness of the new logic, I introduced concurrent control for cleaning merged segments and segment replication.
We replaced the higher-overhead
listAlloperation withprimaryMergedSegmentCheckpointsand refreshed segment recorded in memory. However,primaryMergedSegmentCheckpointswill be deleted based on the expiration time. Considering the situation where the merged segment ckp ofprimaryMergedSegmentCheckpointshas not been refreshed and has already expired, it will not be included inReferencedSegmentsCheckpoint. However, once it is refreshed, segment replication will reference it. Since the merged segment cleanup task and the segment replication task are executed concurrently, it may lead to the problem that the merged segment referenced by segment replication is incorrectly deleted.This occurs when the merge segment wamer takes a long time and exceeds the expiration time. Although increasing the expiration time can reduce the probability of occurrence, we also need a more robust design.
Through synchronous control, we can ensure that the merged segment referenced by the current segment replication task will definitely not be deleted by the merged segment cleanup task. If the merged segment is deleted by the cleanup task before the segment replication task starts execution, these files will be sent to replica shard again through segment replication.
Summary
The correctness of the task to clean up redundant merged segments in the replica shard depends on the accuracy of obtaining the
ReferencedSegmentsCheckpointof the primary shard.ReferencedSegmentsCheckpointconsists of two parts,primaryMergedSegmentCheckpointsand refreshed segments.After the replica node receives
ReferencedSegmentsCheckpoint, it will iterate throughreplicaMergedSegmentCheckpoints. IfReferencedSegmentsCheckpointdoes not containMergedSegmentCheckpointandReferencedSegmentsCheckpointis ahead ofMergedSegmentCheckpoint, thenMergedSegmentCheckpointwill be deleted.For a clearer explanation, I will elaborate on different scenarios.
primaryMergedSegmentCheckpointshas not expired yet.ReferencedSegmentsCheckpointwill contain it, and the merged segment in the replicas will not be deleted.primaryMergedSegmentCheckpointshas been refreshed and has already expired.ReferencedSegmentsCheckpointwill contain it, and the merged segment in the replicas will not be deleted.primaryMergedSegmentCheckpointshas not been refreshed and has already expired.ReferencedSegmentsCheckpointwill not contain it. Through the above concurrency control, it is necessary to avoid incorrectly deleting the merged segment referenced by segment replication.primaryMergedSegmentCheckpointshas already been deleted due to being merged again and has already expired.ReferencedSegmentsCheckpointwill not contain it. The replica shard can safely delete the redundant merged segment.Related Issues
Resolves #[18845]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.