Add query cancellation support via _tasks/_cancel API for PPL queries#5254
Add query cancellation support via _tasks/_cancel API for PPL queries#5254Swiddis merged 2 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>
PR Reviewer Guide 🔍(Review updated until commit b2f0dc0)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to b2f0dc0 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit fd7642c
Suggestions up to commit 024a8ca
|
|
Hi @sunil9977, thanks for the changes! Please take a look at the CI failures |
| @@ -0,0 +1,44 @@ | |||
| /* | |||
There was a problem hiding this comment.
We can rename this file to PPLQueryTask
|
CI failure seems to be caused by spotless check, please run |
|
During e2e testing with the OSD cancel button, we noticed that Adding an interrupt check in // opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java
@Override
public boolean moveNext() {
if (Thread.interrupted()) {
throw new NonFallbackCalciteException(
String.format("Query was cancelled after processing %d rows.", queryCount));
}
if (queryCount >= maxResponseSize) {
return false;
}
// ... rest of method
}This is outside the scope of this PR but would be a nice follow-up to make the cancel experience snappy. In our testing it brought cancel response time from ~50s down to <5s. |
| String prefix = (queryId != null) ? "PPL [queryId=" + queryId + "]: " : "PPL: "; | ||
|
|
||
| if (pplQuery != null && pplQuery.length() > 512) { | ||
| return prefix + pplQuery.substring(0,512) + "..."; |
There was a problem hiding this comment.
suggestion: I don't think this is worth truncating.
The only way to access this description is to directly check the task by ID. Seeing the full query might be useful debug context. Similar debug logic in OSD core exists and doesn't truncate.
| if (Thread.interrupted() || e.getCause() instanceof InterruptedException) { | ||
| if (callBack != null && callBack.isCancelled()) { | ||
| LOG.info("Query was cancelled"); | ||
| throw new OpenSearchException("Query was cancelled."); |
There was a problem hiding this comment.
issue: This exception will propagate as a 500, which is bad for people monitoring errors.
OpenSearchException is the generic top-level exception class, it defaults as internal server error when the status isn't set. You want to use one of its subclasses.
TaskCancelledException seems like a better choice.
That's on me lol, I thought when I implemented it that the CPU part of joins would be fast (the bulk of the time on most joins is waiting for batches), so it'd be overkill to check every iteration and we could just throw when we request new batches. Happy to take it up if there's a followup task. Do you have more precise repro steps? Since I haven't noticed this issue before. I'm also interested in if doing this on every row would have a negative effect on benchmarks. |
|
|
||
| @Override | ||
| public boolean shouldCancelChildrenOnCancellation() { | ||
| return false; |
There was a problem hiding this comment.
Should this be false? Query tasks tend to spawn background IO tasks, but I don't know if those count as children.
There was a problem hiding this comment.
Yes, updated it to true.
I was doing a |
|
Based on @Swiddis's observation that almost no core code overrides I think a simpler approach might work here: pass the |
|
Oh, interesting. If I'm understanding right they implemented their own thread interruption mechanism? I wonder why not use regular |
|
Ok, talked with the core folks, task cancellation is the better approach. Since opensearch uses a thread pool system, interrupting the thread has potential to poison the whole pool. It works for us because we're careful to uninterrupt the thread when handling interruption errors, but there might theoretically be a path that leaves a perpetually-interrupted thread (either currently or in the future). Cancellation also has the benefit of propagating across clusters better (which is more relevant to Core which is fanning out requests to many data nodes regularly). So the correct approach is change our thread interrupts to use the native task cancellation feature. If all that understanding is correct, it means we can also clean up the logic that un-interrupts the thread, but I think it's only like 2 lines anyway so not a massive win. |
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit b2f0dc0.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit fd7642c |
|
Thank you all for reviews. |
| LOG.warn( | ||
| "Query execution timed out after {}. Interrupting execution thread.", | ||
| timeout); | ||
| executionThread.interrupt(); |
There was a problem hiding this comment.
future PR: this interrupt needs to be changed to cancellation as well (and all code paths that depend on it)
|
@sunil9977 LGTM, one more chore: please add DCO signoffs to the commits as described in https://github.com/opensearch-project/sql/pull/5254/checks?check_run_id=68346380999 & https://github.com/opensearch-project/.github/blob/main/CONTRIBUTING.md#developer-certificate-of-origin. |
…gestions. Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>
|
Persistent review updated to latest commit b2f0dc0 |
Enable cancellation of in-flight PPL queries from the Discover UI cancel button by leveraging the OpenSearch task framework. - Generate a client-side UUID (queryId) for each synchronous PPL query - Pass queryId through the full request pipeline to OpenSearch PPL API - Add POST /api/enhancements/ppl/cancel server route that lists PPL tasks via _tasks API, matches by queryId in task description, and cancels all matching tasks - Wire the existing Discover abort flow to call the cancel route on abort - Cancel all matching tasks (data + histogram queries share the same queryId) Depends on opensearch-project/sql#5254 for backend task registration. Signed-off-by: Andrew Kim <andrewkcs@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Kai Huang <ahkcs@amazon.com>
Enable cancellation of in-flight PPL queries from the Discover UI cancel button by leveraging the OpenSearch task framework. - Generate a client-side UUID (queryId) for each synchronous PPL query - Pass queryId through the full request pipeline to OpenSearch PPL API - Add POST /api/enhancements/ppl/cancel server route that lists PPL tasks via _tasks API, matches by queryId in task description, and cancels all matching tasks - Wire the existing Discover abort flow to call the cancel route on abort - Cancel all matching tasks (data + histogram queries share the same queryId) Depends on opensearch-project/sql#5254 for backend task registration. Signed-off-by: Kai Huang <ahkcs@amazon.com>
* [Query] Add PPL query cancellation support via Discover cancel button Enable cancellation of in-flight PPL queries from the Discover UI cancel button by leveraging the OpenSearch task framework. - Generate a client-side UUID (queryId) for each synchronous PPL query - Pass queryId through the full request pipeline to OpenSearch PPL API - Add POST /api/enhancements/ppl/cancel server route that lists PPL tasks via _tasks API, matches by queryId in task description, and cancels all matching tasks - Wire the existing Discover abort flow to call the cancel route on abort - Cancel all matching tasks (data + histogram queries share the same queryId) Depends on opensearch-project/sql#5254 for backend task registration. Signed-off-by: Kai Huang <ahkcs@amazon.com> * Fire-and-forget PPL cancel request to avoid blocking UI Remove await from the cancel HTTP call so the AbortError propagates immediately. The cancel is a best-effort notification to the backend and should not block the frontend from moving on. Signed-off-by: Kai Huang <ahkcs@amazon.com> * Use exact PPL action name instead of wildcard in task filter Use 'cluster:admin/opensearch/ppl' instead of '*ppl*' to avoid matching unrelated tasks. The _tasks API does not support filtering by description or queryId, so client-side matching remains necessary. Signed-off-by: Kai Huang <ahkcs@amazon.com> * Revert action filter back to wildcard *ppl* The exact action name is less resilient to future changes. The wildcard is sufficient since we still match by queryId in the task description. Signed-off-by: Kai Huang <ahkcs@amazon.com> --------- Signed-off-by: Kai Huang <ahkcs@amazon.com>
* Init CLAUDE.md (#5259) Signed-off-by: Heng Qian <qianheng@amazon.com> * Add label to exempt specific PRs from stalled labeling (#5263) * Implement `reverse` performance optimization (#4775) Co-authored-by: Jialiang Liang <jiallian@amazon.com> * Add songkant-aws as maintainer (#5244) * Move some maintainers from active to Emeritus (#5260) * Move inactive current maintainers to Emeritus Signed-off-by: Lantao Jin <ltjin@amazon.com> * Remove affiliation column for emeritus maintainers Signed-off-by: Lantao Jin <ltjin@amazon.com> * formatted Signed-off-by: Lantao Jin <ltjin@amazon.com> * Fix formatting in MAINTAINERS.md Signed-off-by: Simeon Widdis <sawiddis@gmail.com> Signed-off-by: Simeon Widdis <sawiddis@gmail.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com> Signed-off-by: Simeon Widdis <sawiddis@gmail.com> Co-authored-by: Simeon Widdis <sawiddis@gmail.com> * Add query cancellation support via _tasks/_cancel API for PPL queries (#5254) * Add query cancellation support via _tasks/_cancel API for PPL queries Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com> * Refactor PPL query cancellation to cooperative model and other PR suggestions. Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com> --------- Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com> * Add Calcite native SQL planning in UnifiedQueryPlanner (#5257) * feat(api): Add Calcite native SQL planning path in UnifiedQueryPlanner Add SQL support to the unified query API using Calcite's native parser pipeline (SqlParser → SqlValidator → SqlToRelConverter → RelNode), bypassing the ANTLR parser used by PPL. Changes: - UnifiedQueryPlanner: use PlanningStrategy to dispatch CalciteNativeStrategy vs CustomVisitorStrategy - CalciteNativeStrategy: Calcite Planner with try-with-resources for ANSI SQL - CustomVisitorStrategy: ANTLR-based path for PPL (and future SQL V2) - UnifiedQueryContext: SqlParser.Config with Casing.UNCHANGED to preserve lowercase OpenSearch index names Signed-off-by: Chen Dai <daichen@amazon.com> * test(api): Add SQL planner tests and refactor test base for multi-language support - Refactor UnifiedQueryTestBase with queryType() hook for subclass override - Add UnifiedSqlQueryPlannerTest covering SELECT, WHERE, GROUP BY, JOIN, ORDER BY, subquery, case sensitivity, namespaces, and error handling - Update UnifiedQueryContextTest to verify SQL context creation Signed-off-by: Chen Dai <daichen@amazon.com> * perf(benchmarks): Add SQL queries to UnifiedQueryBenchmark Add language (PPL/SQL) and queryPattern param dimensions for side-by-side comparison of equivalent queries across both languages. Remove separate UnifiedSqlQueryBenchmark in favor of unified class. Signed-off-by: Chen Dai <daichen@amazon.com> * docs(api): Update README to reflect SQL support in UnifiedQueryPlanner Signed-off-by: Chen Dai <daichen@amazon.com> * fix(api): Normalize trailing whitespace in assertPlan comparison RelOptUtil.toString() appends a trailing newline after the last plan node, which doesn't match Java text block expectations. Also add \r\n normalization for Windows CI compatibility, consistent with the existing pattern in core module tests. Signed-off-by: Chen Dai <daichen@amazon.com> --------- Signed-off-by: Chen Dai <daichen@amazon.com> * [Feature] Support graphLookup with literal value as its start (#5253) * [Feature] Support graphLookup as top-level PPL command (#5243) Add support for graphLookup as the first command in a PPL query with literal start values, instead of requiring piped input from source=. Syntax: graphLookup table start="value" edge=from-->to as output graphLookup table start=("v1", "v2") edge=from-->to as output Signed-off-by: Heng Qian <qianheng@amazon.com> * Spotless check Signed-off-by: Heng Qian <qianheng@amazon.com> * Ignore child pipe if using start value Signed-off-by: Heng Qian <qianheng@amazon.com> * Add graphLookup integration tests per PPL command checklist - Add explain plan tests in CalciteExplainIT with YAML assertions - Add v2-unsupported tests in NewAddedCommandsIT - Add CalcitePPLGraphLookupIT to CalciteNoPushdownIT suite - Skip graphLookup tests when pushdown is disabled (required by impl) - Add expected plan YAML files for piped and top-level graphLookup Signed-off-by: Heng Qian <qianheng@amazon.com> * Remove brace of start value list Signed-off-by: Heng Qian <qianheng@amazon.com> --------- Signed-off-by: Heng Qian <qianheng@amazon.com> * Apply docs website feedback to ppl functions (#5207) * apply doc website feedback to ppl functions Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com> * take out comments Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com> * fix json_append example Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com> * fix json_append example Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com> * fix links Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com> --------- Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com> Signed-off-by: ritvibhatt <53196324+ritvibhatt@users.noreply.github.com> * feat(api): Add profiling support to unified query API (#5268) Add query profiling infrastructure that measures time spent in each query phase (analyze, optimize, execute, format). Profiling is opt-in via UnifiedQueryContext.builder().profiling(true) and uses thread-local context to avoid passing profiling state through every method. Key changes: - QueryProfiling/ProfileContext for thread-local profiling lifecycle - UnifiedQueryContext.measure() API for timing arbitrary phases - Auto-profiling in UnifiedQueryPlanner (analyze) and compiler (optimize) - UnifiedQueryTestBase shared test fixture for unified query tests - Comprehensive profiling tests with non-flaky >= 0 timing assertions Signed-off-by: Chen Dai <daichen@amazon.com> * Add UnifiedQueryParser with language-specific implementations (#5274) Extract parsing logic from UnifiedQueryPlanner into a UnifiedQueryParser interface with language-specific implementations: PPLQueryParser (returns UnresolvedPlan) and CalciteSqlQueryParser (returns SqlNode). UnifiedQueryContext owns the parser instance, created eagerly by the builder which has direct access to query type and future SQL config. Each implementation receives only its required dependencies: PPLQueryParser takes Settings, CalciteSqlQueryParser takes CalcitePlanContext. UnifiedQueryPlanner.CustomVisitorStrategy now obtains the parser from the context via the interface type. Signed-off-by: Chen Dai <daichen@amazon.com> * Fix flaky TPC-H Q1 test due to bugs in `MatcherUtils.closeTo()` (#5283) * Fix the flaky tpch Q1 Signed-off-by: Lantao Jin <ltjin@amazon.com> * Change to ULP-aware to handle floating-point precision differences Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Lantao Jin <ltjin@amazon.com> --------- Signed-off-by: Heng Qian <qianheng@amazon.com> Signed-off-by: Lantao Jin <ltjin@amazon.com> Signed-off-by: Simeon Widdis <sawiddis@gmail.com> Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com> Signed-off-by: Chen Dai <daichen@amazon.com> Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com> Signed-off-by: ritvibhatt <53196324+ritvibhatt@users.noreply.github.com> Signed-off-by: Kai Huang <ahkcs@amazon.com> Co-authored-by: qianheng <qianheng@amazon.com> Co-authored-by: Simeon Widdis <sawiddis@gmail.com> Co-authored-by: Jialiang Liang <jiallian@amazon.com> Co-authored-by: Lantao Jin <ltjin@amazon.com> Co-authored-by: Sunil Ramchandra Pawar <pawar_sr@apple.com> Co-authored-by: Chen Dai <daichen@amazon.com> Co-authored-by: ritvibhatt <53196324+ritvibhatt@users.noreply.github.com>
Fixes #4887