Skip to content

Add query cancellation support via _tasks/_cancel API for PPL queries#5254

Merged
Swiddis merged 2 commits intoopensearch-project:mainfrom
sunil9977:issue-4887
Mar 25, 2026
Merged

Add query cancellation support via _tasks/_cancel API for PPL queries#5254
Swiddis merged 2 commits intoopensearch-project:mainfrom
sunil9977:issue-4887

Conversation

@sunil9977
Copy link
Copy Markdown
Contributor

Fixes #4887

  1. PPL queries now register as CancellableTask, making them visible in GET /_tasks and cancellable via POST /_tasks/{task_id}/_cancel
  2. Supports optional queryId in request body for task identification

Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

PR Reviewer Guide 🔍

(Review updated until commit b2f0dc0)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add queryId support to PPL request/transport layer

Relevant files:

  • ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java
  • plugin/src/main/java/org/opensearch/sql/plugin/request/PPLQueryRequestFactory.java
  • plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryRequest.java

Sub-PR theme: Implement CancellableTask registration and cooperative cancellation for PPL queries

Relevant files:

  • plugin/src/main/java/org/opensearch/sql/plugin/transport/PPLQueryTask.java
  • plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryAction.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java
  • plugin/src/test/java/org/opensearch/sql/plugin/transport/PPLQueryTaskTest.java

⚡ Recommended focus areas for review

ThreadLocal Leak

In submit(), cancellableTask.get() is called and then cancellableTask.remove() is called before scheduling. However, setCancellableTask(cancelTask) is called again inside the scheduled Runnable (line 81), and clearCancellableTask() is only called in the finally block. If the scheduled task throws an exception before reaching the finally block (e.g., during withCurrentContext wrapping), the ThreadLocal is never cleared on the worker thread, potentially leaking the CancellableTask reference for future tasks reusing that thread.

setCancellableTask(cancelTask);

try {
  task.run();
  timeoutTask.cancel();
  // Clear any leftover thread interrupts to keep the thread pool clean
  Thread.interrupted();
} catch (Exception e) {
  timeoutTask.cancel();

  // Special-case handling of timeout-related interruptions
  if (Thread.interrupted() || e.getCause() instanceof InterruptedException) {
    LOG.error("Query was interrupted due to timeout after {}", timeout);
    throw new OpenSearchTimeoutException(
        "Query execution timed out after " + timeout);
  }

  throw e;
} finally {
  clearCancellableTask();
Race Condition

setCancellableTask(pplQueryTask) is called in TransportPPLQueryAction.doExecute() on the transport thread, and then cancellableTask.get() is read in submit() — both on the same thread, which is fine. However, the static ThreadLocal is set in doExecute before submit is called. If submit is ever called from a different thread than doExecute (e.g., due to future refactoring or async dispatch), the ThreadLocal value will be null on the worker thread. The design is fragile because it relies on implicit same-thread execution ordering between doExecute and submit.

CancellableTask cancelTask = cancellableTask.get();
cancellableTask.remove();
schedule(nodeClient, queryPlan::execute, timeout, cancelTask);
Cancellation Check Timing

The cancellation check in moveNext() only fires when iterating rows. If a query is cancelled while waiting inside bgScanner.fetchNextBatch() (a blocking call), the cancellation won't be detected until the next call to moveNext() after the batch returns. For long-running or slow fetches, this means cancellation may be significantly delayed. This is a design limitation worth documenting, but could also be a real usability issue.

if (cancellableTask != null && cancellableTask.isCancelled()) {
  throw new TaskCancelledException("The task is cancelled.");
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

PR Code Suggestions ✨

Latest suggestions up to b2f0dc0

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Inject dependency instead of static coupling

OpenSearchIndexEnumerator is tightly coupled to OpenSearchQueryManager by directly
calling its static getCancellableTask(). This makes the class harder to test and
violates separation of concerns. The CancellableTask should be injected as a
constructor parameter instead.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java [86-88]

 this.bgScanner = new BackgroundSearchScanner(client, maxResultWindow, queryBucketSize);
 this.bgScanner.startScanning(request);
-this.cancellableTask = OpenSearchQueryManager.getCancellableTask();
+this.cancellableTask = cancellableTask; // injected via constructor parameter
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a design concern where OpenSearchIndexEnumerator is tightly coupled to OpenSearchQueryManager via a static call, making it harder to test in isolation. Injecting CancellableTask as a constructor parameter would improve testability and separation of concerns.

Low
Add early cancellation check before task execution

setCancellableTask(cancelTask) is called inside the worker thread, but
OpenSearchIndexEnumerator is constructed during task.run() and calls
getCancellableTask() in its constructor. Since setCancellableTask is called just
before task.run(), this ordering is correct only if no enumerator is constructed
before this point. However, if the task was already cancelled before execution
starts, the cancellation check in moveNext() will catch it, but there's no early
exit before task.run() begins. Consider checking cancelTask.isCancelled() before
starting execution.

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java [81-85]

 setCancellableTask(cancelTask);
+
+if (cancelTask != null && cancelTask.isCancelled()) {
+  clearCancellableTask();
+  return;
+}
 
 try {
   task.run();
   timeoutTask.cancel();
Suggestion importance[1-10]: 4

__

Why: Adding an early cancellation check before task.run() is a valid defensive improvement to avoid unnecessary work when a task is already cancelled. However, this is a minor optimization since the cancellation check in moveNext() would catch it shortly after anyway.

Low
Security
Truncate query in task description

The pplQuery field could be very long and is exposed verbatim in the task
description, which is visible via the _tasks API. This could leak sensitive query
content. Consider truncating pplQuery to a reasonable maximum length in the
description.

plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryRequest.java [166-170]

 @Override
 public String getDescription() {
   String prefix = (queryId != null) ? "PPL [queryId=" + queryId + "]: " : "PPL: ";
-  return prefix + pplQuery;
+  String truncated = pplQuery != null && pplQuery.length() > 1000
+      ? pplQuery.substring(0, 1000) + "..."
+      : pplQuery;
+  return prefix + truncated;
 }
Suggestion importance[1-10]: 4

__

Why: Truncating the query in the task description is a reasonable defensive measure to prevent very long queries from being exposed verbatim via the _tasks API. However, this is a minor concern as the _tasks API is typically admin-only and the query content is not inherently more sensitive than other logged data.

Low
Possible issue
Fix cross-thread ThreadLocal race condition

The setCancellableTask stores the task in a ThreadLocal, but doExecute runs on a
transport thread while submit (which reads and clears it) runs on a different thread
pool thread. This creates a race condition where the ThreadLocal value set on the
transport thread is not visible on the worker thread. The task should be passed
directly rather than through a ThreadLocal.

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java [113-115]

+// Pass cancelTask directly to submit or store it on the queryPlan instead of using ThreadLocal
+// across thread boundaries. For example, store it on the AbstractPlan before scheduling.
 if (task instanceof PPLQueryTask pplQueryTask) {
-  OpenSearchQueryManager.setCancellableTask(pplQueryTask);
+  queryPlan.setCancellableTask(pplQueryTask); // or pass via another thread-safe mechanism
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion claims a race condition, but looking at the code flow: setCancellableTask is called on the transport thread before submit, then in submit the value is read and cleared immediately (cancellableTask.get() + cancellableTask.remove()), and then passed directly as cancelTask to schedule. Inside schedule, setCancellableTask(cancelTask) is called again on the worker thread before task.run(). The OpenSearchIndexEnumerator constructor reads it on the same worker thread. The flow is actually correct. The suggestion's improved_code references a non-existent setCancellableTask method on AbstractPlan.

Low

Previous suggestions

Suggestions up to commit fd7642c
CategorySuggestion                                                                                                                                    Impact
General
Verify enumerator construction timing relative to ThreadLocal set

setCancellableTask is called inside the scheduled runnable, which sets the
ThreadLocal on the worker thread. However, OpenSearchIndexEnumerator reads
getCancellableTask() in its constructor, which is called during task.run(). This
ordering is correct only if the enumerator is constructed inside task.run(). If the
enumerator is constructed before task.run() is invoked, the cancellableTask will be
null. Verify that the enumerator is always constructed within the execution of
task.run().

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java [81-84]

+setCancellableTask(cancelTask);
 
+try {
+  task.run();
+  timeoutTask.cancel();
Suggestion importance[1-10]: 5

__

Why: This is a valid concern about ordering — setCancellableTask must be called before the enumerator constructor runs. The suggestion correctly identifies that if the enumerator is constructed inside task.run(), the ordering is correct. However, the existing_code and improved_code are identical, making this more of a verification note than an actionable fix.

Low
Add null-check logging for missing cancellable task

submit is called on the transport thread, and cancellableTask.get() reads the
ThreadLocal of that thread. But setCancellableTask is called in doExecute on the
same transport thread, so this part is actually consistent. However, if submit is
ever called from a different thread than where setCancellableTask was called, the
value will be null. Consider adding a null-check or logging when cancelTask is null
to aid debugging.

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java [54-56]

 CancellableTask cancelTask = cancellableTask.get();
 cancellableTask.remove();
+if (cancelTask == null) {
+  LOG.debug("No cancellable task found in ThreadLocal for query submission.");
+}
 schedule(nodeClient, queryPlan::execute, timeout, cancelTask);
Suggestion importance[1-10]: 2

__

Why: Adding a debug log for a null cancelTask is a minor improvement with low impact. The code already handles null gracefully since cancelTask is passed to schedule and checked in the enumerator with a null guard.

Low
Possible issue
Ensure cancellation exception is unchecked in enumerator

TaskCancelledException is thrown inside moveNext(), but Calcite's Enumerator
interface does not declare checked exceptions. If TaskCancelledException is a
checked exception, this will cause a compilation error. Verify that
TaskCancelledException extends RuntimeException; if it does not, wrap it in an
unchecked exception.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java [121-123]

 if (cancellableTask != null && cancellableTask.isCancelled()) {
-  throw new TaskCancelledException("The task is cancelled.");
+  throw new NonFallbackCalciteException(new TaskCancelledException("The task is cancelled."));
 }
Suggestion importance[1-10]: 4

__

Why: This is a valid concern — if TaskCancelledException is a checked exception, it would cause a compilation error in moveNext(). However, org.opensearch.core.tasks.TaskCancelledException in OpenSearch extends RuntimeException, so this is not actually a problem in practice, making the suggestion less critical.

Low
ThreadLocal misused for cross-thread task passing

The setCancellableTask stores the task in a ThreadLocal, but doExecute runs on the
transport thread while submit (which reads and clears it) runs on a different thread
pool thread. This means the ThreadLocal value set here will never be visible to the
thread that calls cancellableTask.get() in submit. The task should be passed
directly rather than relying on ThreadLocal for cross-thread communication.

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java [113-115]

-if (task instanceof PPLQueryTask pplQueryTask) {
-  OpenSearchQueryManager.setCancellableTask(pplQueryTask);
-}
+// Pass pplQueryTask directly to the plan or service layer instead of using ThreadLocal
+// e.g., pplService.execute(request, pplQueryTask, listener);
Suggestion importance[1-10]: 2

__

Why: The suggestion claims setCancellableTask and submit run on different threads, but looking at the code, setCancellableTask is called in doExecute on the transport thread, and submit is also called synchronously from that same thread before scheduling. The ThreadLocal is read in submit on the same thread it was set, so the concern is largely unfounded. The improved_code is also not actionable code.

Low
Suggestions up to commit 024a8ca
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against pre-execution cancellation race condition

There is a race condition: if the task is cancelled between
onExecutionThreadAvailable and the start of actual query execution, the interrupt
may be missed. Additionally, if the task is already cancelled before
onExecutionThreadAvailable is called, the query will still proceed. You should check
callBack.isCancelled() immediately after registering the thread and throw early if
already cancelled.

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java [71-78]

 if (callBack != null) {
     callBack.onExecutionThreadAvailable(executionThread);
+    if (callBack.isCancelled()) {
+        LOG.info("Query was cancelled before execution started");
+        throw new OpenSearchException("Query was cancelled.");
+    }
 }
 
 Scheduler.ScheduledCancellable timeoutTask =
     threadPool.schedule(
         () -> {
           LOG.warn(
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a valid race condition where a task cancelled before execution starts would not be detected. Adding an isCancelled() check after onExecutionThreadAvailable is a reasonable defensive measure, though the window is small in practice.

Low
Ensure ThreadLocal cleanup on exception paths

If an exception is thrown between setCancellationCallback and the point where
cancellationCallBackThreadLocal.remove() is called in submit(), the ThreadLocal may
not be cleaned up, causing a memory leak or stale callback in the thread pool. A
try-finally block should be used to ensure cleanup, or the removal should be
guaranteed even on failure paths.

plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryAction.java [113-131]

 if (task instanceof SQLQueryTask sqlQueryTask) {
-
     OpenSearchQueryManager.setCancellationCallback(new OpenSearchQueryManager.CancellationCallBack() {
         @Override
         public void onExecutionThreadAvailable(Thread thread) {
             sqlQueryTask.setExecutionThread(thread);
         }
 
         @Override
         public void onExecutionComplete() {
             sqlQueryTask.clearExecutionThread();
         }
 
         @Override
         public boolean isCancelled() {
             return sqlQueryTask.isCancelled();
         }
     });
 }
+try {
Suggestion importance[1-10]: 3

__

Why: The concern about ThreadLocal cleanup is valid in theory, but the improved_code is incomplete (ends with an unclosed try {) and doesn't show the full solution. The suggestion is partially correct but the implementation is not actionable as provided.

Low
General
Restrict direct access to ThreadLocal field

The cancellationCallBackThreadLocal field is public and mutable, which allows
external code to directly manipulate the ThreadLocal bypassing the provided
setter/cleaner methods. It should be private to enforce encapsulation and prevent
misuse.

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java [43]

-public static ThreadLocal<CancellationCallBack> cancellationCallBackThreadLocal = new ThreadLocal<>();
+private static final ThreadLocal<CancellationCallBack> cancellationCallBackThreadLocal = new ThreadLocal<>();
Suggestion importance[1-10]: 5

__

Why: Making cancellationCallBackThreadLocal public and non-final is a valid encapsulation concern. Making it private static final enforces proper access through the provided setter/cleaner methods and prevents external misuse.

Low
Handle null query in description method

If pplQuery is null, the method returns prefix + null (i.e., the string "PPL:
null"), which is misleading. A null check should be added to handle this case
gracefully.

plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryRequest.java [167-176]

 public String getDescription()
 {
     String prefix = (queryId != null) ? "PPL [queryId=" + queryId + "]: " : "PPL: ";
 
-    if (pplQuery != null && pplQuery.length() > 512) {
-        return prefix + pplQuery.substring(0,512) + "...";
+    if (pplQuery == null) {
+        return prefix;
+    }
+
+    if (pplQuery.length() > 512) {
+        return prefix + pplQuery.substring(0, 512) + "...";
     }
 
     return prefix + pplQuery;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies that a null pplQuery would result in the string "PPL: null" being returned. Adding a null check improves robustness, though pplQuery being null is an edge case given the existing constructors.

Low

@ahkcs ahkcs added PPL Piped processing language enhancement New feature or request labels Mar 23, 2026
@ahkcs
Copy link
Copy Markdown
Collaborator

ahkcs commented Mar 23, 2026

Hi @sunil9977, thanks for the changes! Please take a look at the CI failures

@@ -0,0 +1,44 @@
/*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename this file to PPLQueryTask

@ahkcs
Copy link
Copy Markdown
Collaborator

ahkcs commented Mar 23, 2026

CI failure seems to be caused by spotless check, please run ./gradlew spotlessApply to fix this

@ahkcs
Copy link
Copy Markdown
Collaborator

ahkcs commented Mar 23, 2026

During e2e testing with the OSD cancel button, we noticed that Thread.interrupt() only takes effect when the execution thread hits a blocking call (I/O, Future.get(), etc.). For CPU-bound queries like large joins, this can mean a 50+ second delay between clicking cancel and the query actually stopping.

Adding an interrupt check in OpenSearchIndexEnumerator.moveNext() would make cancellation near-instant, since moveNext() is called for every row:

// opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java

@Override
public boolean moveNext() {
    if (Thread.interrupted()) {
        throw new NonFallbackCalciteException(
            String.format("Query was cancelled after processing %d rows.", queryCount));
    }

    if (queryCount >= maxResponseSize) {
        return false;
    }
    // ... rest of method
}

This is outside the scope of this PR but would be a nice follow-up to make the cancel experience snappy. In our testing it brought cancel response time from ~50s down to <5s.

String prefix = (queryId != null) ? "PPL [queryId=" + queryId + "]: " : "PPL: ";

if (pplQuery != null && pplQuery.length() > 512) {
return prefix + pplQuery.substring(0,512) + "...";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I don't think this is worth truncating.

The only way to access this description is to directly check the task by ID. Seeing the full query might be useful debug context. Similar debug logic in OSD core exists and doesn't truncate.

if (Thread.interrupted() || e.getCause() instanceof InterruptedException) {
if (callBack != null && callBack.isCancelled()) {
LOG.info("Query was cancelled");
throw new OpenSearchException("Query was cancelled.");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This exception will propagate as a 500, which is bad for people monitoring errors.

OpenSearchException is the generic top-level exception class, it defaults as internal server error when the status isn't set. You want to use one of its subclasses.

TaskCancelledException seems like a better choice.

@Swiddis
Copy link
Copy Markdown
Collaborator

Swiddis commented Mar 23, 2026

During e2e testing with the OSD cancel button, we noticed that Thread.interrupt() only takes effect when the execution thread hits a blocking call (I/O, Future.get(), etc.). For CPU-bound queries like large joins, this can mean a 50+ second delay between clicking cancel and the query actually stopping.

That's on me lol, I thought when I implemented it that the CPU part of joins would be fast (the bulk of the time on most joins is waiting for batches), so it'd be overkill to check every iteration and we could just throw when we request new batches. Happy to take it up if there's a followup task. Do you have more precise repro steps? Since I haven't noticed this issue before.

I'm also interested in if doing this on every row would have a negative effect on benchmarks.


@Override
public boolean shouldCancelChildrenOnCancellation() {
return false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be false? Query tasks tend to spawn background IO tasks, but I don't know if those count as children.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated it to true.

@ahkcs
Copy link
Copy Markdown
Collaborator

ahkcs commented Mar 23, 2026

Do you have more precise repro steps?

I was doing a join query from OSD using POC OSD side change for the cancel button. And after query cancelled, the stacktrace showed up 50 seconds later

@ahkcs
Copy link
Copy Markdown
Collaborator

ahkcs commented Mar 23, 2026

Based on @Swiddis's observation that almost no core code overrides onCancelled() — it looks like core's cancellation model is cooperative: cancel() flips isCancelled() to true, and the task is expected to poll that flag and stop itself (e.g., search tasks check task.isCancelled() per Lucene segment).

I think a simpler approach might work here: pass the CancellableTask reference to the worker thread, and check task.isCancelled() in OpenSearchIndexEnumerator.moveNext(). This would remove the need for the CancellationCallBack interface, the ThreadLocal callback, onCancelled() override, thread interruption, and the AtomicReference<Thread> — all replaced by one isCancelled() check per row.

@Swiddis
Copy link
Copy Markdown
Collaborator

Swiddis commented Mar 23, 2026

Oh, interesting. If I'm understanding right they implemented their own thread interruption mechanism? I wonder why not use regular interrupt(). Given we already implemented timeouts via interrupt, I wonder if it's worthwhile to refactor everything or we should just have cancellation point to interrupt. Open to either but given the current impl it'd be less work to stick with interrupts in the core places.

@Swiddis
Copy link
Copy Markdown
Collaborator

Swiddis commented Mar 23, 2026

Ok, talked with the core folks, task cancellation is the better approach. Since opensearch uses a thread pool system, interrupting the thread has potential to poison the whole pool. It works for us because we're careful to uninterrupt the thread when handling interruption errors, but there might theoretically be a path that leaves a perpetually-interrupted thread (either currently or in the future). Cancellation also has the benefit of propagating across clusters better (which is more relevant to Core which is fanning out requests to many data nodes regularly).

So the correct approach is change our thread interrupts to use the native task cancellation feature. If all that understanding is correct, it means we can also clean up the logic that un-interrupts the thread, but I think it's only like 2 lines anyway so not a massive win.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 24, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit b2f0dc0.

PathLineSeverityDescription
plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryRequest.java161lowUser-supplied 'queryId' from the request payload (parsed in PPLQueryRequestFactory.java) is embedded unsanitized into task description strings via getDescription(). While task descriptions are informational and not used for security decisions, arbitrary user input in system-visible task metadata could enable log injection or information disclosure if descriptions are surfaced in monitoring/audit tooling. This has a plausible legitimate explanation (task identification), but warrants review of how task descriptions are consumed downstream.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit fd7642c

@sunil9977
Copy link
Copy Markdown
Contributor Author

Thank you all for reviews.
I've refactored it to use existing methods rather than using thread.interrupt().
Please let me know if my understanding is correct.

LOG.warn(
"Query execution timed out after {}. Interrupting execution thread.",
timeout);
executionThread.interrupt();
Copy link
Copy Markdown
Collaborator

@Swiddis Swiddis Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future PR: this interrupt needs to be changed to cancellation as well (and all code paths that depend on it)

@Swiddis
Copy link
Copy Markdown
Collaborator

Swiddis commented Mar 24, 2026

@sunil9977 LGTM, one more chore: please add DCO signoffs to the commits as described in https://github.com/opensearch-project/sql/pull/5254/checks?check_run_id=68346380999 & https://github.com/opensearch-project/.github/blob/main/CONTRIBUTING.md#developer-certificate-of-origin.

…gestions.

Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b2f0dc0

@Swiddis Swiddis self-requested a review March 25, 2026 03:13
@Swiddis Swiddis enabled auto-merge (squash) March 25, 2026 07:43
@Swiddis Swiddis merged commit d903f24 into opensearch-project:main Mar 25, 2026
38 checks passed
ahkcs added a commit to ahkcs/OpenSearch-Dashboards that referenced this pull request Mar 25, 2026
Enable cancellation of in-flight PPL queries from the Discover UI cancel
button by leveraging the OpenSearch task framework.

- Generate a client-side UUID (queryId) for each synchronous PPL query
- Pass queryId through the full request pipeline to OpenSearch PPL API
- Add POST /api/enhancements/ppl/cancel server route that lists PPL tasks
  via _tasks API, matches by queryId in task description, and cancels all
  matching tasks
- Wire the existing Discover abort flow to call the cancel route on abort
- Cancel all matching tasks (data + histogram queries share the same queryId)

Depends on opensearch-project/sql#5254 for backend task registration.

Signed-off-by: Andrew Kim <andrewkcs@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
ahkcs added a commit to ahkcs/OpenSearch-Dashboards that referenced this pull request Mar 25, 2026
Enable cancellation of in-flight PPL queries from the Discover UI cancel
button by leveraging the OpenSearch task framework.

- Generate a client-side UUID (queryId) for each synchronous PPL query
- Pass queryId through the full request pipeline to OpenSearch PPL API
- Add POST /api/enhancements/ppl/cancel server route that lists PPL tasks
  via _tasks API, matches by queryId in task description, and cancels all
  matching tasks
- Wire the existing Discover abort flow to call the cancel route on abort
- Cancel all matching tasks (data + histogram queries share the same queryId)

Depends on opensearch-project/sql#5254 for backend task registration.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
mengweieric pushed a commit to opensearch-project/OpenSearch-Dashboards that referenced this pull request Mar 26, 2026
* [Query] Add PPL query cancellation support via Discover cancel button

Enable cancellation of in-flight PPL queries from the Discover UI cancel
button by leveraging the OpenSearch task framework.

- Generate a client-side UUID (queryId) for each synchronous PPL query
- Pass queryId through the full request pipeline to OpenSearch PPL API
- Add POST /api/enhancements/ppl/cancel server route that lists PPL tasks
  via _tasks API, matches by queryId in task description, and cancels all
  matching tasks
- Wire the existing Discover abort flow to call the cancel route on abort
- Cancel all matching tasks (data + histogram queries share the same queryId)

Depends on opensearch-project/sql#5254 for backend task registration.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Fire-and-forget PPL cancel request to avoid blocking UI

Remove await from the cancel HTTP call so the AbortError propagates
immediately. The cancel is a best-effort notification to the backend
and should not block the frontend from moving on.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Use exact PPL action name instead of wildcard in task filter

Use 'cluster:admin/opensearch/ppl' instead of '*ppl*' to avoid
matching unrelated tasks. The _tasks API does not support filtering
by description or queryId, so client-side matching remains necessary.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Revert action filter back to wildcard *ppl*

The exact action name is less resilient to future changes. The wildcard
is sufficient since we still match by queryId in the task description.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

---------

Signed-off-by: Kai Huang <ahkcs@amazon.com>
ahkcs added a commit that referenced this pull request Mar 30, 2026
* Init CLAUDE.md (#5259)

Signed-off-by: Heng Qian <qianheng@amazon.com>

* Add label to exempt specific PRs from stalled labeling (#5263)

* Implement `reverse` performance optimization (#4775)

Co-authored-by: Jialiang Liang <jiallian@amazon.com>

* Add songkant-aws as maintainer (#5244)

* Move some maintainers from active to Emeritus (#5260)

* Move inactive current maintainers to Emeritus

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Remove affiliation column for emeritus maintainers

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* formatted

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Fix formatting in MAINTAINERS.md

Signed-off-by: Simeon Widdis <sawiddis@gmail.com>

Signed-off-by: Simeon Widdis <sawiddis@gmail.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@gmail.com>
Co-authored-by: Simeon Widdis <sawiddis@gmail.com>

* Add query cancellation support via _tasks/_cancel API for PPL queries (#5254)

* Add query cancellation support via _tasks/_cancel API for PPL queries

Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>

* Refactor PPL query cancellation to cooperative model and other PR suggestions.

Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>

---------

Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>

* Add Calcite native SQL planning in UnifiedQueryPlanner (#5257)

* feat(api): Add Calcite native SQL planning path in UnifiedQueryPlanner

Add SQL support to the unified query API using Calcite's native parser
pipeline (SqlParser → SqlValidator → SqlToRelConverter → RelNode),
bypassing the ANTLR parser used by PPL.

Changes:
- UnifiedQueryPlanner: use PlanningStrategy to dispatch
  CalciteNativeStrategy vs CustomVisitorStrategy
- CalciteNativeStrategy: Calcite Planner with try-with-resources
  for ANSI SQL
- CustomVisitorStrategy: ANTLR-based path for PPL (and future SQL V2)
- UnifiedQueryContext: SqlParser.Config with Casing.UNCHANGED to
  preserve lowercase OpenSearch index names

Signed-off-by: Chen Dai <daichen@amazon.com>

* test(api): Add SQL planner tests and refactor test base for multi-language support

- Refactor UnifiedQueryTestBase with queryType() hook for subclass override
- Add UnifiedSqlQueryPlannerTest covering SELECT, WHERE, GROUP BY, JOIN,
  ORDER BY, subquery, case sensitivity, namespaces, and error handling
- Update UnifiedQueryContextTest to verify SQL context creation

Signed-off-by: Chen Dai <daichen@amazon.com>

* perf(benchmarks): Add SQL queries to UnifiedQueryBenchmark

Add language (PPL/SQL) and queryPattern param dimensions for
side-by-side comparison of equivalent queries across both languages.
Remove separate UnifiedSqlQueryBenchmark in favor of unified class.

Signed-off-by: Chen Dai <daichen@amazon.com>

* docs(api): Update README to reflect SQL support in UnifiedQueryPlanner

Signed-off-by: Chen Dai <daichen@amazon.com>

* fix(api): Normalize trailing whitespace in assertPlan comparison

RelOptUtil.toString() appends a trailing newline after the last plan
node, which doesn't match Java text block expectations. Also add
\r\n normalization for Windows CI compatibility, consistent with
the existing pattern in core module tests.

Signed-off-by: Chen Dai <daichen@amazon.com>

---------

Signed-off-by: Chen Dai <daichen@amazon.com>

* [Feature] Support graphLookup with literal value as its start (#5253)

* [Feature] Support graphLookup as top-level PPL command (#5243)

Add support for graphLookup as the first command in a PPL query with
literal start values, instead of requiring piped input from source=.

Syntax:
  graphLookup table start="value" edge=from-->to as output
  graphLookup table start=("v1", "v2") edge=from-->to as output

Signed-off-by: Heng Qian <qianheng@amazon.com>

* Spotless check

Signed-off-by: Heng Qian <qianheng@amazon.com>

* Ignore child pipe if using start value

Signed-off-by: Heng Qian <qianheng@amazon.com>

* Add graphLookup integration tests per PPL command checklist

- Add explain plan tests in CalciteExplainIT with YAML assertions
- Add v2-unsupported tests in NewAddedCommandsIT
- Add CalcitePPLGraphLookupIT to CalciteNoPushdownIT suite
- Skip graphLookup tests when pushdown is disabled (required by impl)
- Add expected plan YAML files for piped and top-level graphLookup

Signed-off-by: Heng Qian <qianheng@amazon.com>

* Remove brace of start value list

Signed-off-by: Heng Qian <qianheng@amazon.com>

---------

Signed-off-by: Heng Qian <qianheng@amazon.com>

* Apply docs website feedback to ppl functions (#5207)

* apply doc website feedback to ppl functions

Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>

* take out comments

Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>

* fix json_append example

Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>

* fix json_append example

Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>

* fix links

Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>

---------

Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
Signed-off-by: ritvibhatt <53196324+ritvibhatt@users.noreply.github.com>

* feat(api): Add profiling support to unified query API (#5268)

Add query profiling infrastructure that measures time spent in each
query phase (analyze, optimize, execute, format). Profiling is opt-in
via UnifiedQueryContext.builder().profiling(true) and uses thread-local
context to avoid passing profiling state through every method.

Key changes:
- QueryProfiling/ProfileContext for thread-local profiling lifecycle
- UnifiedQueryContext.measure() API for timing arbitrary phases
- Auto-profiling in UnifiedQueryPlanner (analyze) and compiler (optimize)
- UnifiedQueryTestBase shared test fixture for unified query tests
- Comprehensive profiling tests with non-flaky >= 0 timing assertions

Signed-off-by: Chen Dai <daichen@amazon.com>

* Add UnifiedQueryParser with language-specific implementations (#5274)

Extract parsing logic from UnifiedQueryPlanner into a UnifiedQueryParser
interface with language-specific implementations: PPLQueryParser (returns
UnresolvedPlan) and CalciteSqlQueryParser (returns SqlNode).

UnifiedQueryContext owns the parser instance, created eagerly by the
builder which has direct access to query type and future SQL config.
Each implementation receives only its required dependencies:
PPLQueryParser takes Settings, CalciteSqlQueryParser takes
CalcitePlanContext. UnifiedQueryPlanner.CustomVisitorStrategy now obtains
the parser from the context via the interface type.

Signed-off-by: Chen Dai <daichen@amazon.com>

* Fix flaky TPC-H Q1 test due to bugs in `MatcherUtils.closeTo()` (#5283)

* Fix the flaky tpch Q1

Signed-off-by: Lantao Jin <ltjin@amazon.com>

* Change to ULP-aware to handle floating-point precision differences

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Lantao Jin <ltjin@amazon.com>

---------

Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@gmail.com>
Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
Signed-off-by: ritvibhatt <53196324+ritvibhatt@users.noreply.github.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Co-authored-by: qianheng <qianheng@amazon.com>
Co-authored-by: Simeon Widdis <sawiddis@gmail.com>
Co-authored-by: Jialiang Liang <jiallian@amazon.com>
Co-authored-by: Lantao Jin <ltjin@amazon.com>
Co-authored-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>
Co-authored-by: Chen Dai <daichen@amazon.com>
Co-authored-by: ritvibhatt <53196324+ritvibhatt@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Support task cancellation for long-running PPL queries.

3 participants