Skip to content

Add catalogSnapshotManager for dataformat aware engine#20982

Merged
mgodwan merged 4 commits intoopensearch-project:mainfrom
alchemist51:catalog
Apr 2, 2026
Merged

Add catalogSnapshotManager for dataformat aware engine#20982
mgodwan merged 4 commits intoopensearch-project:mainfrom
alchemist51:catalog

Conversation

@alchemist51
Copy link
Copy Markdown
Contributor

@alchemist51 alchemist51 commented Mar 24, 2026

Description

The PR brings the catalogSnapshotManager needed for the dataformat aware engine.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Closes #21067

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable. N/A
  • Public documentation issue/PR created, if applicable. N/A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 24, 2026

PR Reviewer Guide 🔍

(Review updated until commit 0d62178)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add Writeable serialization support for Segment and WriterFileSet

Relevant files:

  • server/src/main/java/org/opensearch/index/engine/exec/WriterFileSet.java
  • server/src/main/java/org/opensearch/index/engine/exec/Segment.java
  • server/src/test/java/org/opensearch/index/engine/exec/WriterFileSetTests.java
  • server/src/test/java/org/opensearch/index/engine/exec/SegmentTests.java

Sub-PR theme: Introduce CatalogSnapshot abstraction and concrete implementations

Relevant files:

  • server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshot.java
  • server/src/main/java/org/opensearch/index/engine/exec/coord/DataformatAwareCatalogSnapshot.java
  • server/src/main/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshot.java
  • server/src/main/java/org/opensearch/index/engine/exec/coord/package-info.java
  • server/src/test/java/org/opensearch/index/engine/exec/coord/DataformatAwareCatalogSnapshotTests.java
  • server/src/test/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshotTests.java

Sub-PR theme: Integrate CatalogSnapshotManager into DataFormatAwareEngine

Relevant files:

  • server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java
  • server/src/main/java/org/opensearch/index/engine/DataFormatAwareEngine.java
  • server/src/test/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManagerTests.java
  • server/src/test/java/org/opensearch/index/engine/dataformat/DataFormatPluginTests.java
  • sandbox/plugins/analytics-engine/src/test/java/org/opensearch/analytics/exec/DefaultPlanExecutorTests.java

⚡ Recommended focus areas for review

Race Condition

In acquireSnapshot(), there is a potential race condition between reading latestCatalogSnapshot (volatile read) and calling tryIncRef(). If commitNewSnapshot() replaces the snapshot and calls decRefAndRemove() between the volatile read and tryIncRef(), the snapshot could reach zero refs and be closed before tryIncRef() succeeds. The current code throws an IllegalStateException in this case, but this could be a spurious failure for a valid concurrent acquire. A retry loop or synchronized block may be needed.

public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
    if (closed.get()) {
        throw new IllegalStateException("CatalogSnapshotManager is closed");
    }
    final CatalogSnapshot snapshot = latestCatalogSnapshot;
    if (snapshot.tryIncRef() == false) {
        throw new IllegalStateException("CatalogSnapshot [gen=" + snapshot.getGeneration() + "] is already closed");
    }
    return new GatedCloseable<>(snapshot, () -> decRefAndRemove(snapshot));
}
Snapshot Leak

In commitNewSnapshot(), the new snapshot is added to catalogSnapshotMap only implicitly (it is never explicitly put into the map). The old snapshot is removed via decRefAndRemove, but the new snapshot is never added to catalogSnapshotMap. This means the map only tracks the initial snapshot and any snapshots that are still referenced, but newly committed snapshots are never tracked. If the map is intended to track all live snapshots, this is a bug.

public synchronized void commitNewSnapshot(List<Segment> refreshedSegments) {
    assert closed.get() == false : "Cannot commit to a closed CatalogSnapshotManager";

    DataformatAwareCatalogSnapshot newSnapshot = new DataformatAwareCatalogSnapshot(
        latestCatalogSnapshot.getId() + 1,
        latestCatalogSnapshot.getGeneration() + 1,
        latestCatalogSnapshot.getVersion(),
        refreshedSegments,
        latestCatalogSnapshot.getLastWriterGeneration() + 1,
        latestCatalogSnapshot.getUserData()
    );

    CatalogSnapshot oldSnapshot = latestCatalogSnapshot;
    latestCatalogSnapshot = newSnapshot;
    decRefAndRemove(oldSnapshot);
}
Null Directory

In the StreamInput constructor, SegmentInfos.readCommit is called with null as the directory argument. This may cause a NullPointerException or unexpected behavior depending on the Lucene version and how readCommit uses the directory. This should be validated or documented as intentional.

this.segmentInfos = SegmentInfos.readCommit(
    null,
    new BufferedChecksumIndexInput(new ByteArrayIndexInput("SegmentInfos", segmentInfosBytes)),
    0L
);
Unchecked Cast

The close() method of DataFormatAwareReader wraps IOException in a RuntimeException. Since GatedCloseable.close() can throw IOException, callers using try-with-resources may not expect a RuntimeException to be thrown. This could mask errors or cause unexpected behavior in production. Consider propagating IOException directly or logging and swallowing it with a clear rationale.

public void close() {
    try {
        snapshotRef.close();
    } catch (IOException e) {
        throw new RuntimeException("Failed to release catalog snapshot reference", e);
    }
Visibility Concern

The ref-counting methods (decRef, tryIncRef, refCount) are package-private, but the comment says they are only accessible within exec.coord. However, test classes in DataformatAwareCatalogSnapshotTests call these methods directly, which works because the tests are in the same package. This is intentional but should be clearly documented, and care should be taken that no external code bypasses the CatalogSnapshotManager for ref management.

// Package-private ref counting — only accessible within exec.coord (i.e., CatalogSnapshotManager)

/**
 * Decrements the reference count. Returns {@code true} if the count reached zero
 * and {@link #closeInternal()} was invoked.
 */
boolean decRef() {
    return refCounter.decRef();
}

/**
 * Tries to increment the reference count. Returns {@code false} if the snapshot is already closed.
 */
boolean tryIncRef() {
    return refCounter.tryIncRef();
}

/**
 * Returns the current reference count.
 */
int refCount() {
    return refCounter.refCount();
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 24, 2026

PR Code Suggestions ✨

Latest suggestions up to 0d62178

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
New snapshot not added to tracking map

The new snapshot is not added to catalogSnapshotMap before the old one is removed.
This means the new snapshot is never tracked in the map, so decRefAndRemove will
never be able to clean it up from the map when its ref count reaches zero. Add
catalogSnapshotMap.put(newSnapshot.getGeneration(), newSnapshot) before replacing
latestCatalogSnapshot.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [93-108]

 public synchronized void commitNewSnapshot(List<Segment> refreshedSegments) {
     assert closed.get() == false : "Cannot commit to a closed CatalogSnapshotManager";
 
     DataformatAwareCatalogSnapshot newSnapshot = new DataformatAwareCatalogSnapshot(
         latestCatalogSnapshot.getId() + 1,
         latestCatalogSnapshot.getGeneration() + 1,
         latestCatalogSnapshot.getVersion(),
         refreshedSegments,
         latestCatalogSnapshot.getLastWriterGeneration() + 1,
         latestCatalogSnapshot.getUserData()
     );
 
+    catalogSnapshotMap.put(newSnapshot.getGeneration(), newSnapshot);
     CatalogSnapshot oldSnapshot = latestCatalogSnapshot;
     latestCatalogSnapshot = newSnapshot;
     decRefAndRemove(oldSnapshot);
 }
Suggestion importance[1-10]: 8

__

Why: The newSnapshot is never added to catalogSnapshotMap, so decRefAndRemove can never clean it up when its ref count reaches zero. This is a real bug that would cause the map to only ever contain the initial snapshot and never track subsequent ones, breaking the cleanup logic.

Medium
Null directory passed to SegmentInfos deserialization

SegmentInfos.readCommit is called with a null directory, which may cause a
NullPointerException when Lucene tries to resolve segment file paths during
deserialization. A proper Directory instance should be provided, or an alternative
deserialization method that does not require a directory should be used.

server/src/main/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshot.java [62-70]

-public SegmentInfosCatalogSnapshot(StreamInput in) throws IOException {
+// Provide a valid Directory instance instead of null, or use a directory-independent deserialization approach.
+// Example (if a directory is available via constructor parameter):
+public SegmentInfosCatalogSnapshot(StreamInput in, Directory directory) throws IOException {
     super(in);
     byte[] segmentInfosBytes = in.readByteArray();
     this.segmentInfos = SegmentInfos.readCommit(
-        null,
+        directory,
         new BufferedChecksumIndexInput(new ByteArrayIndexInput("SegmentInfos", segmentInfosBytes)),
         0L
     );
 }
Suggestion importance[1-10]: 6

__

Why: Passing null as the directory to SegmentInfos.readCommit is potentially problematic and could cause NullPointerException at runtime when Lucene tries to resolve segment paths. However, the improved code changes the constructor signature which may not be directly applicable without broader refactoring.

Low
General
Race condition between close check and snapshot acquisition

There is a TOCTOU race between checking closed.get() and reading
latestCatalogSnapshot. A concurrent close() call could set closed to true and decRef
the snapshot between the check and the tryIncRef, causing tryIncRef to fail and
throwing a misleading exception. The closed check should be re-verified after
tryIncRef fails, or the entire acquire should be synchronized/use a lock consistent
with commitNewSnapshot.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [75-84]

 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
     if (closed.get()) {
         throw new IllegalStateException("CatalogSnapshotManager is closed");
     }
     final CatalogSnapshot snapshot = latestCatalogSnapshot;
     if (snapshot.tryIncRef() == false) {
+        if (closed.get()) {
+            throw new IllegalStateException("CatalogSnapshotManager is closed");
+        }
         throw new IllegalStateException("CatalogSnapshot [gen=" + snapshot.getGeneration() + "] is already closed");
     }
     return new GatedCloseable<>(snapshot, () -> decRefAndRemove(snapshot));
 }
Suggestion importance[1-10]: 5

__

Why: There is a valid TOCTOU race between the closed.get() check and tryIncRef, but the suggested fix only improves the error message rather than actually preventing the race. The improvement is minor since the behavior is still correct (tryIncRef will fail safely if the snapshot is closed).

Low
CatalogSnapshotManager not closed with engine

The close() method of DataFormatAwareEngine (visible in the unchanged context) does
not close or null out the catalogSnapshotManager. If the engine is closed while the
manager is still open, the manager's snapshot reference will be leaked. The close()
method should call catalogSnapshotManager.close() if it is non-null.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareEngine.java [133-134]

+@Override
 public void close() throws IOException {
+    if (catalogSnapshotManager != null) {
+        catalogSnapshotManager.close();
+    }
+    // ... rest of existing close logic
+}
Suggestion importance[1-10]: 5

__

Why: The close() method of DataFormatAwareEngine not closing the catalogSnapshotManager is a valid resource leak concern. However, the existing code in the diff is not shown in detail, and the suggestion references code outside the PR diff's visible changes, making it harder to verify the exact impact.

Low

Previous suggestions

Suggestions up to commit 0d62178
CategorySuggestion                                                                                                                                    Impact
Possible issue
New snapshot not added to tracking map on commit

The new snapshot is not added to catalogSnapshotMap before the old one is removed.
This means the new snapshot is never tracked in the map, so decRefAndRemove will
never clean it up from the map (since it was never inserted), and the map will
always be empty after the first commit. The new snapshot should be added to the map
before decRefing the old one.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [93-108]

 public synchronized void commitNewSnapshot(List<Segment> refreshedSegments) {
     assert closed.get() == false : "Cannot commit to a closed CatalogSnapshotManager";
 
     DataformatAwareCatalogSnapshot newSnapshot = new DataformatAwareCatalogSnapshot(
         latestCatalogSnapshot.getId() + 1,
         latestCatalogSnapshot.getGeneration() + 1,
         latestCatalogSnapshot.getVersion(),
         refreshedSegments,
         latestCatalogSnapshot.getLastWriterGeneration() + 1,
         latestCatalogSnapshot.getUserData()
     );
 
     CatalogSnapshot oldSnapshot = latestCatalogSnapshot;
     latestCatalogSnapshot = newSnapshot;
+    catalogSnapshotMap.put(newSnapshot.getGeneration(), newSnapshot);
     decRefAndRemove(oldSnapshot);
 }
Suggestion importance[1-10]: 8

__

Why: The commitNewSnapshot method creates a new snapshot but never adds it to catalogSnapshotMap, meaning the map only ever contains the initial snapshot. When decRefAndRemove is called for the new snapshot later (e.g., on close()), it won't find it in the map. This is a real tracking bug that could cause memory leaks or incorrect cleanup behavior.

Medium
Hardcoded generation causes deserialization mismatch

SegmentInfos.readCommit is called with a hardcoded generation of 0L, but the actual
generation is already deserialized from the stream via super(in) and stored in
this.generation. Using 0L will cause a mismatch between the snapshot's generation
and the SegmentInfos generation, potentially breaking equality checks and snapshot
tracking. The deserialized generation should be passed instead.

server/src/main/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshot.java [62-70]

 public SegmentInfosCatalogSnapshot(StreamInput in) throws IOException {
     super(in);
     byte[] segmentInfosBytes = in.readByteArray();
     this.segmentInfos = SegmentInfos.readCommit(
         null,
         new BufferedChecksumIndexInput(new ByteArrayIndexInput("SegmentInfos", segmentInfosBytes)),
-        0L
+        this.generation
     );
 }
Suggestion importance[1-10]: 6

__

Why: Using hardcoded 0L as the generation in SegmentInfos.readCommit may cause a mismatch with the actual generation deserialized via super(in). However, the third parameter to readCommit is used for validation, and the actual impact depends on Lucene's internal behavior — it may or may not cause failures in practice.

Low
General
Race condition between close check and snapshot acquisition

There is a TOCTOU race between reading closed.get() and reading
latestCatalogSnapshot. A concurrent close() call could set closed to true and decRef
the snapshot after the check but before tryIncRef, leaving the snapshot in a closed
state. While tryIncRef would return false in that case, the error message says
"already closed" rather than indicating the manager was closed, which is misleading.
More importantly, the check should be done atomically or the tryIncRef failure
should also check closed to provide a clear error.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [75-84]

 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
     if (closed.get()) {
         throw new IllegalStateException("CatalogSnapshotManager is closed");
     }
     final CatalogSnapshot snapshot = latestCatalogSnapshot;
     if (snapshot.tryIncRef() == false) {
+        if (closed.get()) {
+            throw new IllegalStateException("CatalogSnapshotManager is closed");
+        }
         throw new IllegalStateException("CatalogSnapshot [gen=" + snapshot.getGeneration() + "] is already closed");
     }
     return new GatedCloseable<>(snapshot, () -> decRefAndRemove(snapshot));
 }
Suggestion importance[1-10]: 4

__

Why: The TOCTOU race is real but partially mitigated by tryIncRef() returning false if the snapshot is already closed. The improved error message clarity is a minor improvement, but the core safety is already handled by the ref-counting mechanism.

Low
Redundant exception catch block in deserialization

The catch (IOException e) { throw e; } block is redundant since the outer catch
(Exception e) already re-wraps non-IOExceptions. However, more critically,
Base64.getDecoder().decode() throws IllegalArgumentException (not IOException) for
invalid Base64 input, which is correctly caught by the outer handler. The redundant
inner catch should be removed to simplify the code and avoid confusion.

server/src/main/java/org/opensearch/index/engine/exec/coord/DataformatAwareCatalogSnapshot.java [162-172]

 try {
     byte[] bytes = Base64.getDecoder().decode(serializedData);
     try (BytesStreamInput in = new BytesStreamInput(bytes)) {
         return new DataformatAwareCatalogSnapshot(in, directoryResolver);
     }
-} catch (IOException e) {
-    throw e;
 } catch (Exception e) {
+    if (e instanceof IOException) {
+        throw (IOException) e;
+    }
     throw new IOException("Failed to deserialize DataformatAwareCatalogSnapshot: " + e.getMessage(), e);
 }
Suggestion importance[1-10]: 3

__

Why: The redundant catch (IOException e) { throw e; } block is a minor code style issue. The existing code is functionally correct — IOException is re-thrown as-is and other exceptions are wrapped. The suggested refactoring doesn't meaningfully improve correctness or readability.

Low
Suggestions up to commit f8c7c70
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix race condition in snapshot acquisition

There is a TOCTOU race between reading latestCatalogSnapshot and calling tryIncRef.
If commitNewSnapshot runs between these two operations and the old snapshot's ref
count drops to zero, tryIncRef will fail even though the manager is not closed. The
method should retry acquiring the latest snapshot in a loop when tryIncRef fails,
rather than throwing an exception.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [75-84]

 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
     if (closed.get()) {
         throw new IllegalStateException("CatalogSnapshotManager is closed");
     }
-    final CatalogSnapshot snapshot = latestCatalogSnapshot;
-    if (snapshot.tryIncRef() == false) {
-        throw new IllegalStateException("CatalogSnapshot [gen=" + snapshot.getGeneration() + "] is already closed");
+    while (true) {
+        final CatalogSnapshot snapshot = latestCatalogSnapshot;
+        if (snapshot.tryIncRef()) {
+            return new GatedCloseable<>(snapshot, () -> decRefAndRemove(snapshot));
+        }
+        // snapshot was concurrently replaced and closed; retry with the new latest
+        if (closed.get()) {
+            throw new IllegalStateException("CatalogSnapshotManager is closed");
+        }
     }
-    return new GatedCloseable<>(snapshot, () -> decRefAndRemove(snapshot));
 }
Suggestion importance[1-10]: 8

__

Why: There is a genuine TOCTOU race between reading latestCatalogSnapshot and calling tryIncRef. If commitNewSnapshot replaces the snapshot between these two operations and the old snapshot's ref count drops to zero, tryIncRef will fail and throw an IllegalStateException even though the manager is still open. The retry loop correctly handles this concurrent scenario.

Medium
Register new snapshot in map before removing old one

The new snapshot is not added to catalogSnapshotMap before the old one is removed,
leaving a window where the map contains no entry for the current snapshot. The new
snapshot should be registered in the map before decRefing the old one to ensure
consistent tracking.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [93-108]

 public synchronized void commitNewSnapshot(List<Segment> refreshedSegments) {
     assert closed.get() == false : "Cannot commit to a closed CatalogSnapshotManager";
 
     DataformatAwareCatalogSnapshot newSnapshot = new DataformatAwareCatalogSnapshot(
         latestCatalogSnapshot.getId() + 1,
         latestCatalogSnapshot.getGeneration() + 1,
         latestCatalogSnapshot.getVersion(),
         refreshedSegments,
         latestCatalogSnapshot.getLastWriterGeneration() + 1,
         latestCatalogSnapshot.getUserData()
     );
 
+    catalogSnapshotMap.put(newSnapshot.getGeneration(), newSnapshot);
     CatalogSnapshot oldSnapshot = latestCatalogSnapshot;
     latestCatalogSnapshot = newSnapshot;
     decRefAndRemove(oldSnapshot);
 }
Suggestion importance[1-10]: 7

__

Why: The catalogSnapshotMap is never populated with new snapshots after construction, so it only ever contains the initial snapshot. Adding catalogSnapshotMap.put(newSnapshot.getGeneration(), newSnapshot) before decRefing the old one ensures consistent tracking and prevents a window where the map has no entry for the current snapshot.

Medium
Null directory passed to SegmentInfos deserialization

SegmentInfos.readCommit is called with a null directory, which may cause a
NullPointerException when Lucene tries to resolve segment file paths during
deserialization. A proper Directory instance should be provided, or an alternative
deserialization method that does not require a directory should be used.

server/src/main/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshot.java [62-70]

-public SegmentInfosCatalogSnapshot(StreamInput in) throws IOException {
-    super(in);
-    byte[] segmentInfosBytes = in.readByteArray();
-    this.segmentInfos = SegmentInfos.readCommit(
-        null,
-        new BufferedChecksumIndexInput(new ByteArrayIndexInput("SegmentInfos", segmentInfosBytes)),
-        0L
-    );
-}
+// SegmentInfos.readCommit requires a non-null Directory for resolving segment files.
+// Ensure a valid Directory is passed here, or use an in-memory Directory implementation.
Suggestion importance[1-10]: 5

__

Why: Passing null as the directory to SegmentInfos.readCommit is a potential NullPointerException risk depending on the Lucene version and what segment files are referenced. However, the 'improved_code' only adds a comment rather than providing a concrete fix, making it less actionable.

Low
General
Document unsafe no-arg constructor to prevent misuse

The no-arg constructor leaves catalogSnapshotManager as null, and acquireReader()
will throw IllegalStateException if called before setCatalogSnapshotManager. This is
a footgun that can cause runtime failures. Consider removing this constructor or
adding a clear @Deprecated or @VisibleForTesting annotation to discourage its use in
production code paths.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareEngine.java [54-56]

+/** @deprecated Use {@link #DataFormatAwareEngine(Map, CatalogSnapshotManager)} instead. */
+@Deprecated
 public DataFormatAwareEngine(Map<DataFormat, EngineReaderManager<?>> readerManagers) {
     this.readerManagers = readerManagers;
 }
Suggestion importance[1-10]: 3

__

Why: Adding a @Deprecated annotation to the no-arg constructor is a minor documentation improvement that helps prevent misuse, but it doesn't fix any functional issue and has low impact on correctness.

Low
Suggestions up to commit 7bd7893
CategorySuggestion                                                                                                                                    Impact
Possible issue
Register new snapshot in map before removing old one

The new snapshot is not added to catalogSnapshotMap before the old one is removed.
If decRefAndRemove triggers closeInternal on the old snapshot and any cleanup logic
queries the map, the new snapshot won't be found. Add the new snapshot to the map
before decRefing the old one.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [93-108]

 public synchronized void commitNewSnapshot(List<Segment> refreshedSegments) {
     assert closed.get() == false : "Cannot commit to a closed CatalogSnapshotManager";
 
     DataformatAwareCatalogSnapshot newSnapshot = new DataformatAwareCatalogSnapshot(
         latestCatalogSnapshot.getId() + 1,
         latestCatalogSnapshot.getGeneration() + 1,
         latestCatalogSnapshot.getVersion(),
         refreshedSegments,
         latestCatalogSnapshot.getLastWriterGeneration() + 1,
         latestCatalogSnapshot.getUserData()
     );
 
     CatalogSnapshot oldSnapshot = latestCatalogSnapshot;
+    catalogSnapshotMap.put(newSnapshot.getGeneration(), newSnapshot);
     latestCatalogSnapshot = newSnapshot;
     decRefAndRemove(oldSnapshot);
 }
Suggestion importance[1-10]: 6

__

Why: The new snapshot is never added to catalogSnapshotMap, so the map only tracks the initial snapshot. This means the map-based tracking is incomplete, though the current decRefAndRemove logic only removes entries. The suggestion correctly identifies that newSnapshot should be registered before the old one is removed to maintain consistent map state.

Low
Fix race condition between close and acquire

There is a TOCTOU race between checking closed.get() and calling tryIncRef. After
the closed check passes, close() could be called concurrently, decRefing the
snapshot to zero before tryIncRef is called. The tryIncRef failure path handles
this, but the error message says "already closed" which is misleading — it should
also cover the case where the manager was closed concurrently. More importantly,
after close() sets closed=true and decRefs the snapshot, a concurrent
acquireSnapshot could still succeed via tryIncRef if the snapshot hasn't reached
zero yet, allowing acquisition on a closing manager. Consider re-checking closed
after a successful tryIncRef.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [75-84]

 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
     if (closed.get()) {
         throw new IllegalStateException("CatalogSnapshotManager is closed");
     }
     final CatalogSnapshot snapshot = latestCatalogSnapshot;
     if (snapshot.tryIncRef() == false) {
         throw new IllegalStateException("CatalogSnapshot [gen=" + snapshot.getGeneration() + "] is already closed");
     }
+    // Re-check closed after tryIncRef to avoid acquiring on a closing manager
+    if (closed.get()) {
+        decRefAndRemove(snapshot);
+        throw new IllegalStateException("CatalogSnapshotManager is closed");
+    }
     return new GatedCloseable<>(snapshot, () -> decRefAndRemove(snapshot));
 }
Suggestion importance[1-10]: 5

__

Why: There is a valid TOCTOU race between the closed.get() check and tryIncRef, where a concurrent close() could allow acquisition on a closing manager. The fix of re-checking closed after tryIncRef is a reasonable mitigation, though the window is narrow and the existing tryIncRef failure path partially handles it.

Low
General
Null directory in deserialization may cause runtime failures

SegmentInfos.readCommit with a null directory will fail if the SegmentInfos
references any segment files that need to be resolved against a directory. This will
cause a NullPointerException at runtime when the deserialized snapshot is used. A
proper directory or a no-op directory implementation should be provided.

server/src/main/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshot.java [62-70]

 public SegmentInfosCatalogSnapshot(StreamInput in) throws IOException {
     super(in);
     byte[] segmentInfosBytes = in.readByteArray();
+    // NOTE: Passing null directory is only safe if the SegmentInfos does not reference
+    // any segment files requiring directory resolution (e.g., empty or metadata-only infos).
     this.segmentInfos = SegmentInfos.readCommit(
         null,
         new BufferedChecksumIndexInput(new ByteArrayIndexInput("SegmentInfos", segmentInfosBytes)),
         0L
     );
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion points out a potential issue with passing null as the directory to SegmentInfos.readCommit, but the 'improved_code' only adds a comment rather than fixing the issue. This makes it a documentation-only change that doesn't actually resolve the potential NullPointerException.

Low
Suggestions up to commit 06fffce
CategorySuggestion                                                                                                                                    Impact
Possible issue
Track new snapshot in map before replacing current

The new snapshot is not added to catalogSnapshotMap before the old one is decRef'd.
If the old snapshot's ref count reaches zero immediately, decRefAndRemove will try
to remove it, but the new snapshot was never tracked. Any concurrent acquireSnapshot
call after the volatile write of latestCatalogSnapshot but before the map insertion
could also result in an untracked snapshot. The new snapshot should be added to the
map before replacing latestCatalogSnapshot.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [93-108]

 public synchronized void commitNewSnapshot(List<Segment> refreshedSegments) {
     assert closed.get() == false : "Cannot commit to a closed CatalogSnapshotManager";
 
     DataformatAwareCatalogSnapshot newSnapshot = new DataformatAwareCatalogSnapshot(
         latestCatalogSnapshot.getId() + 1,
         latestCatalogSnapshot.getGeneration() + 1,
         latestCatalogSnapshot.getVersion(),
         refreshedSegments,
         latestCatalogSnapshot.getLastWriterGeneration() + 1,
         latestCatalogSnapshot.getUserData()
     );
 
+    catalogSnapshotMap.put(newSnapshot.getGeneration(), newSnapshot);
     DataformatAwareCatalogSnapshot oldSnapshot = latestCatalogSnapshot;
     latestCatalogSnapshot = newSnapshot;
     decRefAndRemove(oldSnapshot);
 }
Suggestion importance[1-10]: 7

__

Why: The new snapshot is never added to catalogSnapshotMap, which means it's untracked and decRefAndRemove will never find it to clean up. This is a real correctness issue — the map is used for lifecycle tracking but the new snapshot is always missing from it.

Medium
Use correct generation when deserializing SegmentInfos

SegmentInfos.readCommit is called with a hardcoded generation of 0L, but the actual
generation was already deserialized from the stream via super(in) and is available
as this.generation. Using 0L will cause a mismatch between the snapshot's generation
and the SegmentInfos generation, potentially breaking generation-based logic.

server/src/main/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshot.java [62-70]

 public SegmentInfosCatalogSnapshot(StreamInput in) throws IOException {
     super(in);
     byte[] segmentInfosBytes = in.readByteArray();
     this.segmentInfos = SegmentInfos.readCommit(
         null,
         new BufferedChecksumIndexInput(new ByteArrayIndexInput("SegmentInfos", segmentInfosBytes)),
-        0L
+        generation
     );
 }
Suggestion importance[1-10]: 6

__

Why: Using hardcoded 0L instead of generation when calling SegmentInfos.readCommit could cause a generation mismatch. The actual generation is already available from super(in), so this is a straightforward fix that prevents subtle bugs in generation-based logic.

Low
General
Fix race condition in snapshot acquisition check

There is a TOCTOU race: the manager could be closed between the closed.get() check
and the tryIncRef() call. If close() runs in between, tryIncRef() will return false
and an IllegalStateException is thrown, but the error message says "already closed"
rather than "manager is closed", which is misleading. More critically, if tryIncRef
succeeds on a snapshot that was already decRef'd to zero by close(), the snapshot
could be used after closeInternal() was called. The closed flag check should be
re-verified after a failed tryIncRef to provide a correct error message, and the
overall acquire logic should be made safe against this race.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [75-84]

 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
     if (closed.get()) {
         throw new IllegalStateException("CatalogSnapshotManager is closed");
     }
     final DataformatAwareCatalogSnapshot snapshot = latestCatalogSnapshot;
     if (snapshot.tryIncRef() == false) {
+        if (closed.get()) {
+            throw new IllegalStateException("CatalogSnapshotManager is closed");
+        }
         throw new IllegalStateException("CatalogSnapshot [gen=" + snapshot.getGeneration() + "] is already closed");
     }
     return new GatedCloseable<>(snapshot, () -> decRefAndRemove(snapshot));
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies a TOCTOU race between the closed.get() check and tryIncRef(), but the improved code only improves the error message rather than fixing the actual race. The core issue (using a snapshot after closeInternal()) is not addressed by the proposed fix.

Low
Suggestions up to commit 3537db0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Use correct generation when deserializing SegmentInfos

SegmentInfos.readCommit is called with a hardcoded generation of 0L, but the actual
generation was already deserialized from the stream via super(in) and stored in
this.generation. Passing 0L instead of the actual generation may cause Lucene to
reject the commit or produce a snapshot with an incorrect generation, leading to
subtle bugs in segment tracking.

server/src/main/java/org/opensearch/index/engine/exec/coord/SegmentInfosCatalogSnapshot.java [62-70]

 public SegmentInfosCatalogSnapshot(StreamInput in) throws IOException {
     super(in);
     byte[] segmentInfosBytes = in.readByteArray();
     this.segmentInfos = SegmentInfos.readCommit(
         null,
         new BufferedChecksumIndexInput(new ByteArrayIndexInput("SegmentInfos", segmentInfosBytes)),
-        0L
+        generation
     );
 }
Suggestion importance[1-10]: 7

__

Why: Passing hardcoded 0L instead of the actual generation to SegmentInfos.readCommit could cause Lucene to reject the commit or produce incorrect behavior. Using this.generation (already deserialized via super(in)) is the correct fix and addresses a real potential bug.

Medium
Track new snapshot in map before replacing old one

The new snapshot is not added to catalogSnapshotMap before decRefAndRemove is called
on the old one. If the old snapshot's ref count reaches zero immediately, it gets
removed, but the new snapshot was never tracked. Any concurrent acquireSnapshot call
that races between the volatile write and the map update could result in a snapshot
that is live but not tracked, breaking the invariant that all live snapshots are in
the map.
Add the new snapshot to catalogSnapshotMap before replacing
latestCatalogSnapshot and before decRefing the old one.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [109-124]

 public synchronized void commitNewSnapshot(List<Segment> refreshedSegments) {
     assert closed.get() == false : "Cannot commit to a closed CatalogSnapshotManager";
 
     DataformatAwareCatalogSnapshot newSnapshot = new DataformatAwareCatalogSnapshot(
         latestCatalogSnapshot.getId() + 1,
         latestCatalogSnapshot.getGeneration() + 1,
         latestCatalogSnapshot.getVersion(),
         refreshedSegments,
         latestCatalogSnapshot.getLastWriterGeneration() + 1,
         latestCatalogSnapshot.getUserData()
     );
 
+    if (catalogSnapshotMap.putIfAbsent(newSnapshot.getGeneration(), newSnapshot) != null) {
+        throw new IllegalStateException(
+            "Duplicate snapshot generation [" + newSnapshot.getGeneration() + "] in catalog snapshot map"
+        );
+    }
     DataformatAwareCatalogSnapshot oldSnapshot = latestCatalogSnapshot;
     latestCatalogSnapshot = newSnapshot;
     decRefAndRemove(oldSnapshot);
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that the new snapshot is never added to catalogSnapshotMap, which breaks the invariant that all live snapshots are tracked. However, since commitNewSnapshot is synchronized and acquireSnapshot uses tryIncRef on the volatile latestCatalogSnapshot, the race condition risk is somewhat mitigated. Still, the map tracking invariant is a real concern worth fixing.

Low
Use conditional map removal to prevent race conditions

decRefAndRemove only accepts DataformatAwareCatalogSnapshot, but catalogSnapshotMap
is typed as Map<Long, CatalogSnapshot>. The acquireSnapshot method returns a
GatedCloseable whose close action calls decRefAndRemove(snapshot) where snapshot is
a DataformatAwareCatalogSnapshot. However, if the map ever holds a different
CatalogSnapshot subtype (e.g., after a future refactor), the cast would fail
silently or cause a compile error. More critically, the GatedCloseable close lambda
captures the DataformatAwareCatalogSnapshot directly, so if latestCatalogSnapshot is
replaced between acquire and close, the wrong generation could be removed from the
map. The catalogSnapshotMap.remove(gen) should use remove(gen, snapshot) to avoid
removing a newer snapshot with the same generation.

server/src/main/java/org/opensearch/index/engine/exec/coord/CatalogSnapshotManager.java [130-135]

 private void decRefAndRemove(DataformatAwareCatalogSnapshot snapshot) {
     final long gen = snapshot.getGeneration();
     if (snapshot.decRef()) {
-        catalogSnapshotMap.remove(gen);
+        catalogSnapshotMap.remove(gen, snapshot);
     }
 }
Suggestion importance[1-10]: 5

__

Why: Using catalogSnapshotMap.remove(gen, snapshot) instead of catalogSnapshotMap.remove(gen) is a valid defensive improvement to avoid accidentally removing a newer snapshot with the same generation. The improvement is minor but correct and improves correctness of the map cleanup logic.

Low
General
Enforce required dependency at construction time

The no-arg constructor and setCatalogSnapshotManager setter allow the engine to
exist in an uninitialized state where catalogSnapshotManager is null. While
acquireReader() guards against this, concurrent calls between
setCatalogSnapshotManager and acquireReader are not thread-safe since
catalogSnapshotManager is volatile but the null-check and read in acquireReader are
not atomic. Consider removing the no-arg constructor and setter to enforce
initialization at construction time, or at minimum document the threading
requirements clearly.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareEngine.java [54-60]

+// Remove the no-arg constructor and setter; require CatalogSnapshotManager at construction time.
+// If deferred initialization is truly needed, document that setCatalogSnapshotManager must be
+// called-happens-before any acquireReader() call (e.g., under external synchronization).
 public DataFormatAwareEngine(Map<DataFormat, EngineReaderManager<?>> readerManagers) {
-    this.readerManagers = readerManagers;
+    throw new UnsupportedOperationException(
+        "CatalogSnapshotManager is required; use DataFormatAwareEngine(readerManagers, snapshotManager)"
+    );
 }
 
-public void setCatalogSnapshotManager(CatalogSnapshotManager catalogSnapshotManager) {
-    this.catalogSnapshotManager = catalogSnapshotManager;
-}
-
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a valid design concern about the two-phase initialization pattern, but the improved_code replaces the no-arg constructor with an UnsupportedOperationException which would break existing callers. The suggestion is more of a design recommendation than a concrete fix, and the volatile field with null-check in acquireReader provides adequate safety for the documented use case.

Low

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 48e6d44

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 48e6d44: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 49f3ea4

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b0a01b1

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c915bcd

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for c915bcd: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 90.22989% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.29%. Comparing base (0f8cc13) to head (0d62178).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
...opensearch/index/engine/DataFormatAwareEngine.java 61.11% 6 Missing and 1 partial ⚠️
...ndex/engine/exec/coord/CatalogSnapshotManager.java 87.50% 1 Missing and 3 partials ⚠️
...ine/exec/coord/DataformatAwareCatalogSnapshot.java 93.44% 3 Missing and 1 partial ⚠️
...engine/exec/coord/SegmentInfosCatalogSnapshot.java 92.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20982      +/-   ##
============================================
+ Coverage     73.20%   73.29%   +0.09%     
- Complexity    72751    73130     +379     
============================================
  Files          5871     5921      +50     
  Lines        332688   333648     +960     
  Branches      48017    48110      +93     
============================================
+ Hits         243543   244551    +1008     
+ Misses        69625    69546      -79     
- Partials      19520    19551      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alchemist51 alchemist51 added the Indexing Indexing, Bulk Indexing and anything related to indexing label Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f0f5940

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for f0f5940: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b33d545

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for b33d545: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit d06c7ba

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for d06c7ba: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ae1660b

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for ae1660b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@alchemist51 alchemist51 force-pushed the catalog branch 2 times, most recently from 76e3153 to 7539806 Compare March 30, 2026 11:45
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 7539806

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 7539806: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 2b0fddb

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

❌ Gradle check result for a0c39a0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 920c47e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit aa1828b

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 4239439

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

❌ Gradle check result for 4239439: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 3537db0

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 06fffce

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 7bd7893

Signed-off-by: Arpit Bandejiya <abandeji@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit f8c7c70

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

❕ Gradle check result for f8c7c70: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Persistent review updated to latest commit 0d62178

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

❌ Gradle check result for 0d62178: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@alchemist51 alchemist51 closed this Apr 2, 2026
@alchemist51 alchemist51 reopened this Apr 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Persistent review updated to latest commit 0d62178

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

✅ Gradle check result for 0d62178: SUCCESS

@mgodwan mgodwan merged commit 4b0dc3d into opensearch-project:main Apr 2, 2026
27 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Indexing Indexing, Bulk Indexing and anything related to indexing lucene

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CatalogSnapshotManager for DataformatAwareEngine

3 participants