feat: [Presto-Reliability] Add observability for query state transitions in Presto (#26418)#26418
Merged
Conversation
Contributor
Reviewer's GuideThis PR introduces a QueryStateTransitionMonitor to record query state transition durations and detect anomalies, integrates monitoring hooks into QueryStateMachine, binds and exports the monitor metrics in the server module, and adds comprehensive unit tests for all monitoring scenarios. Sequence diagram for query state transition monitoring integrationsequenceDiagram
participant QSM as QueryStateMachine
participant Monitor as QueryStateTransitionMonitor
participant Tracker as QueryTransitionTracker
QSM->>Monitor: setStateTransitionMonitor(monitor)
Monitor->>QSM: registerQuery(queryId)
QSM->>QSM: addStateChangeListener(trackStateTransition)
QSM->>QSM: trackStateTransition(newState)
QSM->>Monitor: recordStateTransition(queryId, prevState, newState, durationMillis)
Monitor->>Tracker: recordTransition(prevState, newState, durationMillis)
alt Anomaly detected
Monitor->>Monitor: checkAndLogAnomaly(...)
end
alt Query done
Monitor->>Monitor: activeQueries.remove(queryId)
end
Class diagram for QueryStateTransitionMonitor and integration with QueryStateMachineclassDiagram
class QueryStateTransitionMonitor {
+registerQuery(queryId)
+recordStateTransition(queryId, fromState, toState, durationMillis)
+getTotalQueriesTracked()
+getActiveQueriesCount()
+getDispatchingTimeStats()
+getFinishingTimeStats()
+getPlanningTimeStats()
+getRunningTimeStats()
+getAnomalousDispatchingCount()
+getAnomalousFinishingCount()
+getAnomalousPlanningCount()
+getAnomalousRunningCount()
+getStatsSummary()
-activeQueries: Map<QueryId, QueryTransitionTracker>
-dispatchingTimeStats: TimeStat
-finishingTimeStats: TimeStat
-planningTimeStats: TimeStat
-runningTimeStats: TimeStat
-anomalousDispatchingCount: CounterStat
-anomalousFinishingCount: CounterStat
-anomalousPlanningCount: CounterStat
-anomalousRunningCount: CounterStat
-totalQueriesTracked: AtomicLong
}
class QueryTransitionTracker {
-queryId: QueryId
-lastState: QueryState
-lastTransitionTimeMillis: long
+recordTransition(fromState, toState, durationMillis)
+getLastState()
+getLastTransitionTimeMillis()
}
class QueryStateMachine {
-stateTransitionMonitor: AtomicReference<QueryStateTransitionMonitor>
-previousState: AtomicReference<QueryState>
-previousStateTimestamp: AtomicLong
+setStateTransitionMonitor(monitor)
+trackStateTransition(newState)
}
QueryStateTransitionMonitor "1" *-- "many" QueryTransitionTracker
QueryStateMachine "1" o-- "1" QueryStateTransitionMonitor
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/execution/QueryStateMachine.java:1013-1023` </location>
<code_context>
+
+ QueryState prevState = previousState.get();
+ long prevTimestamp = previousStateTimestamp.get();
+ long currentTimestamp = System.currentTimeMillis();
+ long durationMillis = currentTimestamp - prevTimestamp;
+
+ if (prevState != null) {
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Using System.currentTimeMillis() for duration may be affected by system clock changes.
System.nanoTime() is preferred for measuring elapsed time, as it is not affected by system clock changes.
```suggestion
QueryState prevState = previousState.get();
long prevTimestampNanos = previousStateTimestamp.get();
long currentTimestampNanos = System.nanoTime();
long durationMillis = (currentTimestampNanos - prevTimestampNanos) / 1_000_000;
if (prevState != null) {
monitor.recordStateTransition(queryId, prevState, newState, durationMillis);
}
previousState.set(newState);
previousStateTimestamp.set(currentTimestampNanos);
```
</issue_to_address>
### Comment 2
<location> `presto-main-base/src/test/java/com/facebook/presto/execution/TestQueryStateTransitionMonitor.java:50-59` </location>
<code_context>
+ assertNotNull(monitor.getStatsSummary());
+ }
+
+ @Test
+ public void testConcurrentQueryRegistration()
+ {
+ int numQueries = 100;
+ for (int i = 0; i < numQueries; i++) {
+ QueryId queryId = new QueryId("concurrent_query_" + i);
+ monitor.registerQuery(queryId);
+ monitor.recordStateTransition(queryId, DISPATCHING, PLANNING, 1000L + i);
+ }
+
+ assertEquals(monitor.getTotalQueriesTracked(), numQueries);
+ }
+
+ @Test
+ public void testStateTransitionWithZeroDuration()
+ {
+ monitor.registerQuery(testQueryId);
</code_context>
<issue_to_address>
**suggestion (testing):** Edge case for negative duration is not tested.
Please add a test for negative duration values in state transitions to verify the monitor's behavior in these cases.
Suggested implementation:
```java
@Test
public void testStateTransitionWithNegativeDuration()
{
monitor.registerQuery(testQueryId);
// Attempt to record a state transition with a negative duration
monitor.recordStateTransition(testQueryId, DISPATCHING, PLANNING, -500L);
// Assert that the query is still tracked and no exception is thrown
assertEquals(monitor.getTotalQueriesTracked(), 1);
assertEquals(monitor.getActiveQueriesCount(), 1);
// Optionally, check stats summary if available
assertNotNull(monitor.getStatsSummary());
}
@Test
public void testStateTransitionWithZeroDuration()
{
monitor.registerQuery(testQueryId);
```
If the monitor is expected to handle negative durations in a specific way (e.g., clamp to zero, ignore, or log a warning), you may want to add more specific assertions based on the implementation of `recordStateTransition` and `getStatsSummary()`. Adjust the assertions as needed to match the expected behavior.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches. Re-running this action after a short time may resolve the issue. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
facebook-github-bot
pushed a commit
that referenced
this pull request
Nov 4, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
270569f to
6df29ed
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Nov 5, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
6df29ed to
b0f5dc2
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Nov 5, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
b0f5dc2 to
0dd7d61
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Nov 5, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
0dd7d61 to
73ecaf2
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Nov 5, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
73ecaf2 to
d2e053f
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Nov 11, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
d2e053f to
a1a6650
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Dec 4, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
a1a6650 to
1d2cc6e
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Dec 4, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
1d2cc6e to
f7d9b13
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Dec 10, 2025
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
f7d9b13 to
d59ac3f
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Feb 3, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
6a9fb63 to
59c49cf
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Feb 3, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
59c49cf to
723cc3d
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Feb 10, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
723cc3d to
137de1e
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Feb 10, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
137de1e to
66bea34
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Feb 10, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
66bea34 to
32e786f
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Feb 10, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
32e786f to
0858971
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Feb 11, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
0858971 to
a0b8205
Compare
prashantgolash
approved these changes
Mar 13, 2026
feilong-liu
approved these changes
Mar 14, 2026
meta-codesync Bot
pushed a commit
that referenced
this pull request
Apr 5, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
a0b8205 to
567a6ba
Compare
meta-codesync Bot
pushed a commit
that referenced
this pull request
Apr 5, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
567a6ba to
e70a2b0
Compare
meta-codesync Bot
pushed a commit
that referenced
this pull request
Apr 5, 2026
… Presto (#26418) Summary: Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
e70a2b0 to
0dbf103
Compare
… Presto (#26418) Summary: Pull Request resolved: #26418 Adds comprehensive monitoring for query state transitions to detect and diagnose queries spending abnormally long time in DISPATCHING and FINISHING states. This addresses issues observed on DKL cluster where queries were getting stuck during dispatch and commit phases (ref: S561047). Differential Revision: D85353849
31 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Adds comprehensive monitoring for query state transitions to detect and diagnose
queries spending abnormally long time in DISPATCHING and FINISHING states. This
addresses issues observed on DKL cluster where queries were getting stuck during
dispatch and commit phases (ref: S561047).
Differential Revision: D85353849
Release Notes
Please follow release notes guidelines and fill in the release notes below.