Segment Replication - Fix NoSuchFileException errors caused when computing metadata snapshot on primary shards.#4422
Conversation
…uting metadata snapshot on primary shards. (opensearch-project#4366) * Segment Replication - Fix NoSuchFileException errors caused when computing metadata snapshot on primary shards. This change fixes the errors that occur when computing metadata snapshots on primary shards from the latest in-memory SegmentInfos. The error occurs when a segments_N file that is referenced by the in-memory infos is deleted as part of a concurrent commit. The segments themselves are incref'd by IndexWriter.incRefDeleter but the commit file (Segments_N) is not. This change resolves this by ignoring the segments_N file when computing metadata for CopyState and only sending incref'd segment files to replicas. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix spotless. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update StoreTests.testCleanupAndPreserveLatestCommitPoint to assert additional segments are deleted. Signed-off-by: Marc Handalian <handalm@amazon.com> * Rename snapshot to metadataMap in CheckpointInfoResponse. Signed-off-by: Marc Handalian <handalm@amazon.com> * Refactor segmentReplicationDiff method to compute off two maps instead of MetadataSnapshots. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix spotless. Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert catchall in SegmentReplicationSourceService. Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert log lvl change. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix SegmentReplicationTargetTests Signed-off-by: Marc Handalian <handalm@amazon.com> * Cleanup unused logger. Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
|
Last run failed with below flaky test failure. Refiring! |
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 2.x #4422 +/- ##
==========================================
Coverage 70.54% 70.55%
- Complexity 56942 57101 +159
==========================================
Files 4572 4584 +12
Lines 273816 274453 +637
Branches 40152 40220 +68
==========================================
+ Hits 193170 193629 +459
- Misses 64455 64595 +140
- Partials 16191 16229 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| final List<ShardSegments> replicaShardSegments = segmentListMap.get(false); | ||
| // if we don't have any segments yet, proceed. | ||
| final ShardSegments primaryShardSegments = primaryShardSegmentsList.stream().findFirst().get(); | ||
| logger.debug("Primary Segments: {}", primaryShardSegments.getSegments()); |
There was a problem hiding this comment.
Did you mean to leave this in?
There was a problem hiding this comment.
Yeah, I think this can remain.
| final Map<String, Segment> latestPrimarySegments = getLatestSegments(primaryShardSegments); | ||
| final Long latestPrimaryGen = latestPrimarySegments.values().stream().findFirst().map(Segment::getGeneration).get(); | ||
| for (ShardSegments shardSegments : replicaShardSegments) { | ||
| logger.debug("Replica {} Segments: {}", shardSegments.getShardRouting(), shardSegments.getSegments()); |
| return new MetadataSnapshot(segmentInfos, directory, logger); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Nit - could add in a line explaining why we're leaving out the segments_n files
There was a problem hiding this comment.
Thanks @Poojita-Raj for the comment. This change is needed to fix the file not found exception.
PR against main #4366 contains more details around the issue and fix.
Manual backport of #4366 to 2.x