GH-32653: [C++] Cleanup error handling in execution engine #15253

westonpace · 2023-01-08T19:47:28Z

Simplifies error handling in exec plans.

There were several different places that exec plan failures could be reported. Now there is just one.

The ExecNode::ErrorReceived mechanism was removed. Now InputReceived and InputFinished return a Status instead. This allows for use of the existing macros instead of things like ErrorNotOk and removes the burden of error propagation from nodes and removes the burden of error handling from sink nodes.

ExecNode::finished has now been removed. This could lead to deadlock if nodes failed to mark the future complete (this was easy to do in error scenarios). In addition, it served no real purpose. A plan is done when all of its tasks have finished.

BREAKING CHANGE: ExecPlan::StartProducing now returns void. Errors that were returned from this method will now be returned from ExecPlan::finished.

BREAKING CHANGE: If a plan is stopped early (with ExecPlan::StopProducing) then it will complete with a cancelled status instead of an ok status (assuming no other errors). This is to reflect the fact that the plan did not produce complete data.

BREAKING CHANGE: Previously the sink node would push some plan errors onto the generator. Now, all errors will be output on ExecPlan::finished. The sink node will never push an error, only batches. Readers should make sure to check ExecPlan::finished.

BREAKING CHANGE: When a plan is cancelled it will no longer attempt to flush output. For example, a plan with an aggregate node will not produce an aggregation based on partial results after a cancel.

Closes: [C++] Centralize Errors in ExecPlan #32653

github-actions · 2023-01-08T19:47:50Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of old issues on JIRA the title also supports:

ARROW-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2023-01-08T19:49:00Z

https://issues.apache.org/jira/browse/ARROW-17381

github-actions · 2023-01-08T19:49:02Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

westonpace · 2023-01-08T19:50:35Z

This was largely based on #13848 (thanks @save-buffer). In fact, I started by trying to rebase #13848 but it ended up being too much work.

@save-buffer do you want to take a look at this PR? It's mostly just an extension of your PR. Probably the most significant change is I did not try and collect multiple errors and instead stuck with the pattern of "return the first error and ignore subsequent errors"

westonpace · 2023-01-08T19:59:13Z

Hmm, it appears that OpenTelemetry was depending on the finished_ future to mark how long a node took to run. I'm not entirely sure I agree with the concept of span-per-node (I think a span should be tied to something more like a thread task).

However, this PR, as it stands, will very much break OT output (spans won't be finished and so simply will not output). I'll try and put up a PR on Monday to fix OT that can come before this one.

save-buffer

Overall seems good. A couple of nits here and there. At first I thought that making StartProducing return void was a bit contentious, but now I like that it makes a plan's Status always be stored in one place, no matter where it fails. Other than that, I'd like to see MapNode removed (lower priority) and to think more about our definition of a sink node (higher priority). But overall I like this.

save-buffer · 2023-01-09T21:52:20Z

cpp/src/arrow/compute/exec/map_node.cc

Would it make sense to while we're at it remove MapNode?

I poked at that a little. I think it still serves some benefit:

Forwards backpressure

Forwards guarantees (and soon, batch index)

Uses an AtomicCounter to trigger an optional finish signal when all processing is done (used by the tee node)

I think this justifies its existence.

save-buffer · 2023-01-09T21:53:53Z

cpp/src/arrow/compute/exec/query_context.cc

Can probably remove return from here

save-buffer · 2023-01-09T21:56:45Z

cpp/src/arrow/compute/exec/exec_plan.h

Why does a sink not have an output schema? Sinks still output stuff batches, I think it would be more correct to define a node to be a sink if its output_ is null.

I only need is_sink for validation purposes at the moment. If a node is a sink we validate that output_ is null (e.g. we don't let you try and use it as an input during plan creation). So there are options here:

We can get rid of the validation check and leave it to the individual nodes (I think I may have already added this validation to SinkNode), this would probably be my preference.

We can pass in some is_sink bool to the ExecNode constructor (I'd rather not do this as it is tedious but it is essentially what we had before)

I'll make sure I can implement the first option and, if so, get rid of this concept.

That being said, I am a bit curious why this is important. I think it is valid to say that a sink node doesn't have an output schema. We don't rely on it having an output schema anywhere (or we shouldn't) that I know of.

That being said, I am a bit curious why this is important. I think it is valid to say that a sink node doesn't have an output schema. We don't rely on it having an output schema anywhere (or we shouldn't) that I know of.

@save-buffer ping

I guess it's a pretty semantic thing, but I feel like a sink node has an output schema, just not a node that it outputs to (since a sink node does output to an AsyncGenerator). Maybe it doesn't matter too much though

For the sake of simplicity I'm going to leave this as-is (I'm about to merge this if CI passes). However, if we want to amend this definition (and remove this check) later then I have no problem with that.

…ing complexity without value and also error prone.

github-actions · 2023-01-23T19:13:27Z

Closes: [C++] Centralize Errors in ExecPlan #32653

github-actions · 2023-01-23T19:13:29Z

⚠️ GitHub issue #32653 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/compute/exec/exec_plan.h

lidavidm · 2023-01-24T14:35:52Z

cpp/src/arrow/compute/exec/asof_join_node.cc

-      Defer cleanup([this]() { finished_.MarkFinished(); });
-      outputs_[0]->InputFinished(this, batches_produced_);
-    }));
+    ARROW_UNUSED(


Should this instead be a DCHECK or a warning or similar?

If Spawn fails it means the thread pool has shut down. There is no point in marking the plan finished and we might as well avoid an abort here so we don't accidentally hide the true cause.

…n_->finished() is always looked at.

paleolimbot

Thanks for doing these R changes!

kou

+1 for the GLib part.

ursabot · 2023-01-26T11:55:00Z

Benchmark runs are scheduled for baseline = 6abe6b6 and contender = 295c664. 295c664 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.86% ⬆️0.19%] test-mac-arm
[Failed ⬇️2.31% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.12% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 295c6644 ec2-t3-xlarge-us-east-2
[Failed] 295c6644 test-mac-arm
[Failed] 295c6644 ursa-i9-9960x
[Finished] 295c6644 ursa-thinkcentre-m75q
[Finished] 6abe6b69 ec2-t3-xlarge-us-east-2
[Failed] 6abe6b69 test-mac-arm
[Failed] 6abe6b69 ursa-i9-9960x
[Finished] 6abe6b69 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-01-26T11:55:13Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

github-actions bot added Component: C++ Component: Python Component: R labels Jan 8, 2023

westonpace changed the title ~~ARROW_17381: [C++] Cleanup error handling~~ ARROW-17381: [C++] Cleanup error handling Jan 8, 2023

westonpace requested a review from bkietz January 8, 2023 20:00

westonpace force-pushed the feature/ARROW-17381--cleanup-error-handling branch from 441184d to 2fc72c2 Compare January 8, 2023 20:07

westonpace mentioned this pull request Jan 8, 2023

[CI][C++] arrow-compute ExecPlanExecution.StressSourceGroupedSumStop timeout #15243

Closed

save-buffer reviewed Jan 9, 2023

View reviewed changes

This was referenced Jan 11, 2023

[C++] Sporadic DCHECK failure in arrow-dataset-scanner-test (2) #32430

Closed

GH-15243: [C++] fix for potential deadlock in the group-by node #33700

Merged

pitrou changed the title ~~ARROW-17381: [C++] Cleanup error handling~~ ARROW-17381: [C++] Cleanup error handling in execution engine Jan 17, 2023

westonpace mentioned this pull request Jan 19, 2023

GH-33737: [C++] simplify exec plan tracing #33738

Merged

westonpace added 3 commits January 23, 2023 11:10

Removed ExecNode::finished and ExecNode::ErrorReceived. They were add…

0c688cd

…ing complexity without value and also error prone.

Update bindings to new API

1c9c49b

LINT

e63fc49

westonpace force-pushed the feature/ARROW-17381--cleanup-error-handling branch from 2fc72c2 to e63fc49 Compare January 23, 2023 19:12

westonpace requested review from AlenkaF, paleolimbot and thisisnic as code owners January 23, 2023 19:12

westonpace changed the title ~~ARROW-17381: [C++] Cleanup error handling in execution engine~~ GH-32653: [C++] Cleanup error handling in execution engine Jan 23, 2023

asfimport mentioned this pull request Jan 23, 2023

[C++] Centralize Errors in ExecPlan #32653

Closed

westonpace added 2 commits January 23, 2023 12:06

Cleanup some things missed during rebase

b515021

A few more pieces missed in the rebase'

fe83c28

westonpace requested a review from lidavidm as a code owner January 23, 2023 21:50

github-actions bot added the Component: FlightRPC label Jan 23, 2023

westonpace added 2 commits January 23, 2023 14:28

Update hash-join to expect a status from output batch and finished

1c2bda9

Patch glib

9ede805

westonpace requested a review from kou as a code owner January 24, 2023 14:29

github-actions bot added the Component: GLib label Jan 24, 2023

lidavidm approved these changes Jan 24, 2023

View reviewed changes

Slight tweak to how R consumes plans to ensure that the result of pla…

c2dc486

…n_->finished() is always looked at.

paleolimbot approved these changes Jan 25, 2023

View reviewed changes

kou approved these changes Jan 25, 2023

View reviewed changes

westonpace added 2 commits January 24, 2023 21:39

Mark plan status as finished when exiting the normal path in the reader

daa0f22

Minor docfix per PR review

c6cf35d

westonpace merged commit 295c664 into apache:master Jan 25, 2023

js8544 mentioned this pull request Feb 2, 2023

MINOR: [C++] Fix compilation error in hash_join_benchmark #34000

Merged

GH-32653: [C++] Cleanup error handling in execution engine #15253

GH-32653: [C++] Cleanup error handling in execution engine #15253

Uh oh!

Conversation

westonpace commented Jan 8, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 8, 2023

Uh oh!

github-actions bot commented Jan 8, 2023

Uh oh!

github-actions bot commented Jan 8, 2023

Uh oh!

westonpace commented Jan 8, 2023

Uh oh!

westonpace commented Jan 8, 2023

Uh oh!

save-buffer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 23, 2023

Uh oh!

github-actions bot commented Jan 23, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Jan 26, 2023

Uh oh!

ursabot commented Jan 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

westonpace commented Jan 8, 2023 •

edited by github-actions bot

Loading