Skip to content

Add current_application_duration_ms to cluster state download stats in node stats API#20922

Draft
Ayushiarya246 wants to merge 1 commit intoopensearch-project:mainfrom
Ayushiarya246:main
Draft

Add current_application_duration_ms to cluster state download stats in node stats API#20922
Ayushiarya246 wants to merge 1 commit intoopensearch-project:mainfrom
Ayushiarya246:main

Conversation

@Ayushiarya246
Copy link
Copy Markdown

Description

This PR adds a new metric current_application_duration_ms to the cluster state download stats exposed via the Node Stats API (_nodes/stats/discovery).

Related Issues

Resolves #20527

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added Cluster Manager enhancement Enhancement or improvement to existing feature or request labels Mar 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

PR Reviewer Guide 🔍

(Review updated until commit ec210dc)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add getCurrentApplicationDurationMs tracking to ClusterApplierService

Relevant files:

  • server/src/main/java/org/opensearch/cluster/service/ClusterApplier.java
  • server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java
  • server/src/test/java/org/opensearch/cluster/service/ClusterApplierServiceTests.java

Sub-PR theme: Expose current_application_duration_ms in cluster state download stats

Relevant files:

  • server/src/main/java/org/opensearch/gateway/remote/RemoteDownloadStats.java
  • server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
  • server/src/main/java/org/opensearch/node/Node.java
  • server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateServiceTests.java
  • server/src/test/java/org/opensearch/gateway/GatewayMetaStatePersistedStateTests.java
  • server/src/test/java/org/opensearch/cluster/coordination/PersistedStateStatsTests.java
  • server/src/internalClusterTest/java/org/opensearch/gateway/remote/RemoteStatePublicationIT.java

⚡ Recommended focus areas for review

Race Condition

The applicationStartTimeNanos field is set to System.nanoTime() at the start of runTask, but there is a potential race condition: if getCurrentApplicationDurationMs() is called between the time applicationStartTimeNanos is set and when the task actually begins meaningful work, the duration could be slightly inflated. More critically, if applicationStartTimeNanos is 0 initially (default for volatile long), the check if (startNanos == 0) correctly handles the idle case, but there is no guarantee that the field is reset to 0 atomically before the next task starts, since the reset happens at the end of the task and the next task could start on the same thread immediately after. This is generally safe for a single-threaded applier, but should be validated.

this.applicationStartTimeNanos = System.nanoTime();
Side Effect in Getter

The getFullDownloadStats() and getDiffDownloadStats() methods have side effects: they call setCurrentApplicationDurationMs() on the stats object before returning it. This is unexpected behavior for getter methods and could cause issues if these methods are called multiple times or in contexts where mutation is not expected (e.g., during serialization or logging). Consider separating the update logic from the retrieval logic.

    ((RemoteDownloadStats) remoteStateStats.getRemoteFullDownloadStats())
        .setCurrentApplicationDurationMs(applicationDurationMsSupplier.getAsLong());
    return remoteStateStats.getRemoteFullDownloadStats();
}

public PersistedStateStats getDiffDownloadStats() {
    ((RemoteDownloadStats) remoteStateStats.getRemoteDiffDownloadStats())
        .setCurrentApplicationDurationMs(applicationDurationMsSupplier.getAsLong());
    return remoteStateStats.getRemoteDiffDownloadStats();
Weak Assertion

The test testCurrentApplicationDurationInDownloadStats asserts duration >= 0 during application, which is a very weak assertion. Since getCurrentApplicationDurationMs returns 0 when idle and a positive value when running, the test should assert duration > 0 to meaningfully verify that the metric is being tracked during an active application. The current assertion would pass even if the feature is broken and always returns 0.

)
Stale Value

The currentApplicationDurationMs field in RemoteDownloadStats is an AtomicLong that is set via setCurrentApplicationDurationMs() only when getFullDownloadStats() or getDiffDownloadStats() is called. If the stats are serialized or read without going through those getters (e.g., directly from remoteStateStats), the value could be stale or zero. The design couples stat freshness to getter invocation, which is fragile.

public void setCurrentApplicationDurationMs(long durationMs) {
    currentApplicationDurationMs.set(durationMs);
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

PR Code Suggestions ✨

Latest suggestions up to ec210dc

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Avoid redundant double-call when setting and returning stats

The getFullDownloadStats() and getDiffDownloadStats() methods call
getRemoteFullDownloadStats() and getRemoteDiffDownloadStats() twice each — once for
the cast/set and once for the return. This is redundant and could cause issues if
the underlying method ever returns a different instance. Assign the result to a
local variable before casting and returning.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java [2121-2130]

 public PersistedStateStats getFullDownloadStats() {
-    ((RemoteDownloadStats) remoteStateStats.getRemoteFullDownloadStats())
-        .setCurrentApplicationDurationMs(applicationDurationMsSupplier.getAsLong());
-    return remoteStateStats.getRemoteFullDownloadStats();
+    RemoteDownloadStats stats = (RemoteDownloadStats) remoteStateStats.getRemoteFullDownloadStats();
+    stats.setCurrentApplicationDurationMs(applicationDurationMsSupplier.getAsLong());
+    return stats;
 }
 
 public PersistedStateStats getDiffDownloadStats() {
-    ((RemoteDownloadStats) remoteStateStats.getRemoteDiffDownloadStats())
-        .setCurrentApplicationDurationMs(applicationDurationMsSupplier.getAsLong());
-    return remoteStateStats.getRemoteDiffDownloadStats();
+    RemoteDownloadStats stats = (RemoteDownloadStats) remoteStateStats.getRemoteDiffDownloadStats();
+    stats.setCurrentApplicationDurationMs(applicationDurationMsSupplier.getAsLong());
+    return stats;
 }
Suggestion importance[1-10]: 6

__

Why: The current code calls getRemoteFullDownloadStats() and getRemoteDiffDownloadStats() twice each, which is redundant and could be fragile if the method ever returns different instances. Using a local variable is cleaner and safer.

Low
Possible issue
Guard against race condition in duration calculation

There is a potential race condition: applicationStartTimeNanos is set to a non-zero
value before the task begins, but System.nanoTime() is called after reading it. If
applicationStartTimeNanos is reset to 0 between the read and the System.nanoTime()
call, the subtraction could yield a very large positive value (since startNanos is
non-zero but the task has already finished). A guard check after computing the
elapsed time would prevent returning a stale/incorrect duration.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [800-806]

 @Override
 public long getCurrentApplicationDurationMs() {
     long startNanos = this.applicationStartTimeNanos;
     if (startNanos == 0) {
         return 0;
     }
-    return TimeValue.nsecToMSec(System.nanoTime() - startNanos);
+    long elapsed = TimeValue.nsecToMSec(System.nanoTime() - startNanos);
+    // Re-check in case the task completed between the read and nanoTime call
+    if (this.applicationStartTimeNanos == 0) {
+        return 0;
+    }
+    return elapsed;
 }
Suggestion importance[1-10]: 5

__

Why: There is a theoretical TOCTOU race where applicationStartTimeNanos could be reset to 0 after the initial read but before System.nanoTime() is called, potentially returning a stale large value. The double-check pattern mitigates this, though the practical impact is low since the field is volatile and the window is tiny.

Low
Fix type mismatch in long assertion comparison

The assertEquals(0, ...) call compares an int literal 0 with a long value returned
by AtomicLong.get(). This can cause an AssertionError due to type mismatch in
JUnit's assertEquals(long, long) vs assertEquals(Object, Object). Use
assertEquals(0L, ...) to ensure the correct overload is used.

server/src/internalClusterTest/java/org/opensearch/gateway/remote/RemoteStatePublicationIT.java [311]

-boolean foundMatchingStats = false;
-for (PersistedStateStats stats : persistenceStats) {
-    if (FULL_DOWNLOAD_STATS.equals(stats.getStatsName()) || DIFF_DOWNLOAD_STATS.equals(stats.getStatsName())) {
-        foundMatchingStats = true;
-        assertTrue(
-            "Expected extended field current_application_duration_ms in " + stats.getStatsName(),
-            stats.getExtendedFields().containsKey("current_application_duration_ms")
-        );
-        // Application is complete, so duration should be 0
-        assertEquals(0, stats.getExtendedFields().get("current_application_duration_ms").get());
-    }
-}
-assertTrue("Expected at least one FULL_DOWNLOAD_STATS or DIFF_DOWNLOAD_STATS entry in persistence stats", foundMatchingStats);
+assertEquals(0L, stats.getExtendedFields().get("current_application_duration_ms").get());
Suggestion importance[1-10]: 5

__

Why: Using 0 (int literal) with assertEquals when comparing against AtomicLong.get() (which returns long) can invoke the wrong overload and cause unexpected behavior. Using 0L ensures the correct assertEquals(long, long) overload is used.

Low

Previous suggestions

Suggestions up to commit 0e968dc
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null checks before accessing download stats

If getFullDownloadStats() or getDiffDownloadStats() returns null, calling
addToExtendedFields on it will throw a NullPointerException. Add null checks before
calling addToExtendedFields and before adding to the stats list to prevent potential
NPEs.

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java [958-963]

 PersistedStateStats fullDownloadStats = remoteClusterStateService.getFullDownloadStats();
 PersistedStateStats diffDownloadStats = remoteClusterStateService.getDiffDownloadStats();
-fullDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
-diffDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
+if (fullDownloadStats != null) {
+    fullDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
+    stats.add(fullDownloadStats);
+}
+if (diffDownloadStats != null) {
+    diffDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
+    stats.add(diffDownloadStats);
+}
Suggestion importance[1-10]: 6

__

Why: If getFullDownloadStats() or getDiffDownloadStats() returns null, calling addToExtendedFields would throw a NullPointerException. The improved code also correctly restructures the stats.add calls to be inside the null checks, which matches the original code's intent more safely.

Low
General
Set start time closer to actual work start

The applicationStartTimeNanos field is set before the task starts executing, but if
runTask is called concurrently (or re-entered), the start time could be overwritten.
More critically, if applicationStartTimeNanos is set to a non-zero value and
getCurrentApplicationDurationMs is called between the assignment and the actual
start of work, it could return a slightly inflated duration. Consider using
System.nanoTime() only after confirming the previous value is 0 to avoid accidental
overwrites in edge cases.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [474-475]

+logger.debug("processing [{}]: execute", task.source);
 this.applicationStartTimeNanos = System.nanoTime();
-logger.debug("processing [{}]: execute", task.source);
Suggestion importance[1-10]: 3

__

Why: Moving applicationStartTimeNanos assignment after the debug log is a minor ordering improvement with negligible practical impact. The runTask method runs on a single-threaded executor, so concurrent re-entry is not a real concern here.

Low
Guard against negative elapsed time from race condition

There is a potential race condition: applicationStartTimeNanos could be reset to 0
between reading startNanos and calling System.nanoTime(), causing System.nanoTime()
- startNanos to return a very large positive value (since startNanos is non-zero but
the task is already done). The result could be a misleadingly large duration.
Consider clamping the result to 0 if it appears unreasonably large, or use an
AtomicLong for safer compare-and-read semantics.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [800-806]

 public long getCurrentApplicationDurationMs() {
     long startNanos = this.applicationStartTimeNanos;
     if (startNanos == 0) {
         return 0;
     }
-    return TimeValue.nsecToMSec(System.nanoTime() - startNanos);
+    long elapsed = TimeValue.nsecToMSec(System.nanoTime() - startNanos);
+    return elapsed < 0 ? 0 : elapsed;
 }
Suggestion importance[1-10]: 3

__

Why: The race condition described is theoretically possible with volatile reads, but System.nanoTime() always moves forward so the elapsed value would be large positive (not negative) in the described scenario. The suggested fix of clamping negative values doesn't address the actual concern of a spuriously large positive value, making the improvement only marginally useful.

Low
Suggestions up to commit 3c881d2
CategorySuggestion                                                                                                                                    Impact
General
Handle interrupt and strengthen assertion

The test silently swallows InterruptedException without restoring the interrupt
flag, which can mask threading issues. Additionally, the assertion only checks that
duration is non-negative, which is too weak to verify the feature works correctly.
Add a more meaningful assertion and properly handle the interrupt.

server/src/test/java/org/opensearch/cluster/service/ClusterApplierServiceTests.java [687-716]

 public void testGetCurrentApplicationDurationMsDuringApplication() throws Exception {
 
     CountDownLatch taskStarted = new CountDownLatch(1);
     CountDownLatch taskCanFinish = new CountDownLatch(1);
     CountDownLatch taskDone = new CountDownLatch(1);
 
     clusterApplierService.onNewClusterState("test", () -> {
         taskStarted.countDown();
         try {
             taskCanFinish.await();
-        } catch (InterruptedException e) {}
+        } catch (InterruptedException e) {
+            Thread.currentThread().interrupt();
+        }
         return ClusterState.builder(clusterApplierService.state()).build();
     }, ...);
 
     taskStarted.await();  // Wait for task to start
     long duration = clusterApplierService.getCurrentApplicationDurationMs();
-    assertTrue("Duration should be >= 0 during application, got: " + duration, duration >= 0);
+    assertTrue("Duration should be > 0 during application, got: " + duration, duration > 0);
     taskCanFinish.countDown();  // Let task finish
     taskDone.await();  // Wait for task to fully complete to avoid interference with other tests
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies two issues: the swallowed InterruptedException should restore the interrupt flag, and the assertion duration >= 0 is weak. However, changing it to duration > 0 is problematic because the duration could legitimately be 0 or very small due to timing precision, making the test flaky. Restoring the interrupt flag is good practice, but the assertion improvement needs refinement.

Low
Move timer initialization closer to actual work

The applicationStartTimeNanos is set after the lifecycle check but before task
execution begins. If an exception occurs during state retrieval or logging, the
timer will remain active indefinitely. Move the timer initialization to immediately
before the actual cluster state application logic to ensure accurate tracking.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [469-476]

 private void runTask(UpdateTask task) {
     if (!lifecycle.started()) {
         logger.debug("processing [{}]: ignoring, cluster applier service not started", task.source);
         return;
     }
-    this.applicationStartTimeNanos = System.nanoTime();
     logger.debug("processing [{}]: execute", task.source);
     final ClusterState previousClusterState = state.get();
+    this.applicationStartTimeNanos = System.nanoTime();
Suggestion importance[1-10]: 4

__

Why: While the suggestion has merit in principle, moving applicationStartTimeNanos initialization after state.get() would actually make the timing less accurate by excluding the state retrieval time from the measurement. The current placement is reasonable for measuring total application duration. The concern about exceptions during logging is minor since the timer is reset on all completion paths.

Low
Suggestions up to commit f081389
CategorySuggestion                                                                                                                                    Impact
General
Restore interrupt flag and verify measurable duration

The test silently swallows InterruptedException without restoring the interrupt
flag, which can mask threading issues. Additionally, the assertion only checks that
duration is non-negative but doesn't verify it's actually measuring the application
time. Add Thread.currentThread().interrupt() in the catch block and add a small
sleep before checking duration to ensure measurable time has elapsed.

server/src/test/java/org/opensearch/cluster/service/ClusterApplierServiceTests.java [687-716]

 public void testGetCurrentApplicationDurationMsDuringApplication() throws Exception {
 
     CountDownLatch taskStarted = new CountDownLatch(1);
     CountDownLatch taskCanFinish = new CountDownLatch(1);
     CountDownLatch taskDone = new CountDownLatch(1);
 
     clusterApplierService.onNewClusterState("test", () -> {
         taskStarted.countDown();
         try {
             taskCanFinish.await();
-        } catch (InterruptedException e) {}
+        } catch (InterruptedException e) {
+            Thread.currentThread().interrupt();
+        }
         return ClusterState.builder(clusterApplierService.state()).build();
     }, ...);
 
     taskStarted.await();  // Wait for task to start
+    Thread.sleep(10);  // Ensure measurable time has elapsed
     long duration = clusterApplierService.getCurrentApplicationDurationMs();
-    assertTrue("Duration should be >= 0 during application, got: " + duration, duration >= 0);
+    assertTrue("Duration should be > 0 during application, got: " + duration, duration > 0);
     taskCanFinish.countDown();  // Let task finish
     taskDone.await();  // Wait for task to fully complete to avoid interference with other tests
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that InterruptedException should restore the interrupt flag, which is a best practice for thread handling. However, the Thread.sleep(10) addition is a workaround rather than a robust solution, and the assertion change from >= 0 to > 0 could introduce flakiness in timing-sensitive tests. The interrupt flag restoration is valuable but the overall suggestion has moderate impact.

Low
Move timer initialization closer to execution

The applicationStartTimeNanos is set after the lifecycle check but before task
execution begins. If an exception occurs during task setup or if the task is
rejected, the timer will not be reset, leaving a stale timestamp. Move the timer
initialization to immediately before the actual cluster state application logic to
ensure accuracy.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [469-475]

 private void runTask(UpdateTask task) {
     if (!lifecycle.started()) {
         logger.debug("processing [{}]: ignoring, cluster applier service not started", task.source);
         return;
     }
+    logger.debug("processing [{}]: execute", task.source);
     this.applicationStartTimeNanos = System.nanoTime();
-    logger.debug("processing [{}]: execute", task.source);
+    final ClusterState previousClusterState = state.get();
Suggestion importance[1-10]: 3

__

Why: While the suggestion raises a valid concern about timer accuracy, moving the initialization after the debug log would only marginally improve precision and the current placement is reasonable since the lifecycle check ensures the service is started. The suggestion's impact is minimal and the existing code placement is acceptable for practical purposes.

Low
Suggestions up to commit dd9e6d6
CategorySuggestion                                                                                                                                    Impact
General
Fix interrupt handling and strengthen assertion

The test silently swallows InterruptedException in the lambda, which can mask
threading issues and make the test unreliable. Additionally, the assertion only
checks that duration is non-negative, which is too weak to verify the feature works
correctly. Add proper exception handling and assert that the duration is actually
greater than zero to confirm the timer is functioning.

server/src/test/java/org/opensearch/cluster/service/ClusterApplierServiceTests.java [687-707]

 public void testGetCurrentApplicationDurationMsDuringApplication() throws Exception {
 
     CountDownLatch taskStarted = new CountDownLatch(1);
     CountDownLatch taskCanFinish = new CountDownLatch(1);
     CountDownLatch taskDone = new CountDownLatch(1);
 
     clusterApplierService.onNewClusterState("test", () -> {
         taskStarted.countDown();
-        try { taskCanFinish.await(); } catch (InterruptedException e) {}
+        try { taskCanFinish.await(); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
         return ClusterState.builder(clusterApplierService.state()).build();
     }, new ClusterApplier.ClusterApplyListener() {
         @Override public void onSuccess(String source) { taskDone.countDown(); }
         @Override public void onFailure(String source, Exception e) { taskDone.countDown(); }
     });
 
     taskStarted.await();  // Wait for task to start
     long duration = clusterApplierService.getCurrentApplicationDurationMs();
-    assertTrue("Duration should be >= 0 during application, got: " + duration, duration >= 0);
+    assertTrue("Duration should be > 0 during application, got: " + duration, duration > 0);
     taskCanFinish.countDown();  // Let task finish
     taskDone.await();  // Wait for task to fully complete to avoid interference with other tests
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies two issues: silent exception swallowing and a weak assertion. Restoring the interrupt flag via Thread.currentThread().interrupt() is good practice, and changing the assertion from >= 0 to > 0 provides better verification that the timer is actually functioning. However, the improvement is moderate since the original test would likely still pass in most scenarios.

Low
Move timer initialization closer to actual work

The applicationStartTimeNanos is set after the lifecycle check but before task
execution begins. If an exception occurs during task execution setup (before the
actual cluster state application), the timer will not be reset, causing subsequent
calls to getCurrentApplicationDurationMs() to return incorrect durations. Move the
timer initialization to immediately before the actual cluster state application
logic.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [471-477]

 private void runTask(UpdateTask task) {
     if (!lifecycle.started()) {
         logger.debug("processing [{}]: ignoring, cluster applier service not started", task.source);
         return;
     }
+    logger.debug("processing [{}]: execute", task.source);
+    final ClusterState previousClusterState = state.get();
     this.applicationStartTimeNanos = System.nanoTime();
-    logger.debug("processing [{}]: execute", task.source);
Suggestion importance[1-10]: 2

__

Why: While the suggestion identifies a potential timing concern, the current implementation is actually reasonable. The timer is set early to capture the full task processing time, including setup. Moving it later would only measure application logic, not the complete task lifecycle. The suggestion contradicts the apparent design intent without strong justification.

Low
Suggestions up to commit 4d82611
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null checks before accessing stats objects

If getFullDownloadStats() or getDiffDownloadStats() returns null, calling
addToExtendedFields on it will throw a NullPointerException. Add null checks before
calling addToExtendedFields to prevent this.

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java [958-963]

 PersistedStateStats fullDownloadStats = remoteClusterStateService.getFullDownloadStats();
 PersistedStateStats diffDownloadStats = remoteClusterStateService.getDiffDownloadStats();
-fullDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
-diffDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
+if (fullDownloadStats != null) {
+    fullDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
+    stats.add(fullDownloadStats);
+}
+if (diffDownloadStats != null) {
+    diffDownloadStats.addToExtendedFields("current_application_duration_ms", new AtomicLong(durationMs));
+    stats.add(diffDownloadStats);
+}
Suggestion importance[1-10]: 7

__

Why: If getFullDownloadStats() or getDiffDownloadStats() returns null, a NullPointerException would occur. The improved code also correctly restructures the stats.add() calls inside the null checks, preventing duplicate additions that exist in the original code.

Medium
Use atomic type for thread-safe field access

Using volatile long for applicationStartTimeNanos is not thread-safe for compound
check-then-act operations. Since getCurrentApplicationDurationMs() reads this field
from a different thread than the one writing it, and the value 0 is used as a
sentinel, consider using AtomicLong to ensure atomic reads and writes, preventing
potential visibility and race condition issues.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [133]

-private volatile long applicationStartTimeNanos;
+private final AtomicLong applicationStartTimeNanos = new AtomicLong(0);
Suggestion importance[1-10]: 5

__

Why: While volatile ensures visibility across threads, using AtomicLong would provide stronger atomicity guarantees. However, for this specific use case (single writer, multiple readers, with 0 as sentinel), volatile long is actually sufficient since Java guarantees atomic reads/writes for volatile long. The suggestion is valid but the improvement is marginal.

Low
General
Guard against negative elapsed time from race condition

There is a potential race condition: applicationStartTimeNanos could be reset to 0
between the read into startNanos and the System.nanoTime() call, resulting in a very
large (incorrect) duration value. After computing the elapsed time, validate that
the result is non-negative before returning it.

server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java [802-808]

 public long getCurrentApplicationDurationMs() {
     long startNanos = this.applicationStartTimeNanos;
     if (startNanos == 0) {
         return 0;
     }
-    return TimeValue.nsecToMSec(System.nanoTime() - startNanos);
+    long elapsed = TimeValue.nsecToMSec(System.nanoTime() - startNanos);
+    return elapsed < 0 ? 0 : elapsed;
 }
Suggestion importance[1-10]: 5

__

Why: There is a valid race condition where applicationStartTimeNanos could be reset to 0 after the null check but before System.nanoTime() is called, potentially yielding a large incorrect value. Adding a non-negative guard prevents returning a misleading duration value.

Low

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 18c1649

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 18c1649: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 4868dfc

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4868dfc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 4d82611

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4d82611: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit dd9e6d6

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for dd9e6d6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f081389

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for f081389: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 3c881d2

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 3c881d2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 0e968dc

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 0e968dc: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

stats.add(remoteClusterStateService.getFullDownloadStats());
stats.add(remoteClusterStateService.getDiffDownloadStats());
long durationMs = clusterApplier.getCurrentApplicationDurationMs();
PersistedStateStats fullDownloadStats = remoteClusterStateService.getFullDownloadStats();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these stats be directly created inside the remote Cluster state Service like the other stats?

…n node stats API; add UT and IT for test coverage

Signed-off-by: Ayushi Arya <ayuaryak@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Persistent review updated to latest commit ec210dc

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

❌ Gradle check result for ec210dc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cluster Manager enhancement Enhancement or improvement to existing feature or request

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add observability for ongoing cluster state update

2 participants