ARROW-13542: [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk #11017

westonpace · 2021-08-27T06:00:53Z

This PR adds a write node. The write node takes in FileSystemDatasetWriteOptions and the projected schema and writes the incoming data to disk. It is a sink node but it is a bit different than the existing sink node. The existing sink node transferred batches via an AsyncGenerator but that puts ownership of the batches outside of the push-based flow of the exec plan. I added a new ConsumingSinkNode which consumes the batches as part of the push-based flow. This makes it possible to block the exec plan from finishing until all data has been written to disk.

In addition this PR refines the AsyncTaskGroup a little. WaitForTasksToFinish was not a very clearly named method. Once called it actually transitioned the task group from a "top level tasks can be added" state to a "no more top level tasks can be added" state and that was not clear in the name. The new name (End) is hopefully more clear.

This PR does not solve the backpressure problem. Instead a serial async task group was created. This will run all tasks in order (the default task group allows them to run in parallel) but not necessarily on the calling thread (i.e. unlike SerialTaskGroup we do not block on the AddTask method). This allows tasks to pile up in a queue and, if the write is slow, this will become a pile-up point which will eventually run out of memory (provided there is enough data being written). That problem is solved in a follow-up #11286

The AsyncSerializedTaskGroup and AsyncTaskGroup classes have very similar APIs but I did not create an interface / abstract base class because I don't yet envision any case where they would be interchangeable. The distinction is a "can these tasks run in parallel or not" and is not a performance / resource question.

As a consequence of using the ExecPlan dataset writes no longer have reliable ordering. If you pass in three batches and they are all destined for the same file then the batches may be written in any order in the destination file. This is because the ExecPlan creates a thread task for each input batch and so they could arrive at the write node in any order.

github-actions · 2021-08-27T06:01:16Z

https://issues.apache.org/jira/browse/ARROW-13542

… some new async utilities

…hat all tests are now passing.

…lers to fail

… node's schema

… placed arrow::dataset::Initialize method

…ataset write

lidavidm

LGTM. There are a lot of files touched, but things look straightforward.

lidavidm · 2021-10-05T12:35:57Z

cpp/src/arrow/util/future.h

+///
+/// The future will be marked complete if all `futures` complete
+/// successfully. Otherwise, it will be marked failed with the status of
+/// the first failing future.


I left a nit about the docstring on the MakeTask PR: #11210 (comment)

I clarified the comment according to that review and also changed the name to AllFinished.

lidavidm · 2021-10-05T13:10:53Z

cpp/src/arrow/compute/exec/sink_node.cc

+// is finished.  Use SinkNode if you are transferring the ownership of the data to another
+// system.  Use ConsumingSinkNode if the data is being consumed within the exec plan (i.e.
+// the exec plan should not complete until the consumption has completed).
+class ConsumingSinkNode : public ExecNode {


Should we add an explicit test of this node in plan_test.cc, as with the other nodes?

I added a few unit tests to plan_test.cc

aocsa · 2021-10-06T15:14:36Z

LGTM too me too. I reviewed the code and there are some common patterns to handle signals, input_counter and future finished vars that Is common in different ExecNodes as sames as MapExecNode I think should documented somewhere, specially to the newbies that don't want to mess with race condition.

lidavidm

Thanks for the updates. LGTM

cpp/src/arrow/compute/exec/plan_test.cc

…ng rows from an ExecPlan to disk This PR adds a write node. The write node takes in `FileSystemDatasetWriteOptions` and the projected schema and writes the incoming data to disk. It is a sink node but it is a bit different than the existing sink node. The existing sink node transferred batches via an AsyncGenerator but that puts ownership of the batches outside of the push-based flow of the exec plan. I added a new ConsumingSinkNode which consumes the batches as part of the push-based flow. This makes it possible to block the exec plan from finishing until all data has been written to disk. In addition this PR refines the AsyncTaskGroup a little. `WaitForTasksToFinish` was not a very clearly named method. Once called it actually transitioned the task group from a "top level tasks can be added" state to a "no more top level tasks can be added" state and that was not clear in the name. The new name (`End`) is hopefully more clear. This PR does not solve the backpressure problem. Instead a serial async task group was created. This will run all tasks in order (the default task group allows them to run in parallel) but not necessarily on the calling thread (i.e. unlike SerialTaskGroup we do not block on the AddTask method). This allows tasks to pile up in a queue and, if the write is slow, this will become a pile-up point which will eventually run out of memory (provided there is enough data being written). That problem is solved in a follow-up apache#11286 The `AsyncSerializedTaskGroup` and `AsyncTaskGroup` classes have very similar APIs but I did not create an interface / abstract base class because I don't yet envision any case where they would be interchangeable. The distinction is a "can these tasks run in parallel or not" and is not a performance / resource question. As a consequence of using the ExecPlan dataset writes no longer have reliable ordering. If you pass in three batches and they are all destined for the same file then the batches may be written in any order in the destination file. This is because the ExecPlan creates a thread task for each input batch and so they could arrive at the write node in any order. Closes apache#11017 from westonpace/feature/ARROW-13542-write-node Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added the Component: C++ label Aug 27, 2021

westonpace mentioned this pull request Aug 31, 2021

ARROW-13680: [C++] Create an asynchronous nursery to simplify capture logic #10968

Closed

westonpace force-pushed the feature/ARROW-13542-write-node branch from 019d847 to fd2b237 Compare September 28, 2021 11:27

github-actions bot added Component: GLib Component: Python Component: R Component: Ruby labels Sep 28, 2021

westonpace mentioned this pull request Sep 28, 2021

ARROW-13650: [C++] Create dataset writer to encapsulate dataset writer logic #10955

Closed

westonpace force-pushed the feature/ARROW-13542-write-node branch 2 times, most recently from d6cdf3b to 605913e Compare October 1, 2021 20:36

westonpace added 10 commits October 4, 2021 11:09

ARROW-13542: Early draft

858d36f

ARROW-13542: Reworked dataset write node to simplify logic and create…

3018d30

… some new async utilities

ARROW-13542: Fixed various bugs in the write node implementation so t…

57ff2de

…hat all tests are now passing.

ARROW-13542: Fixed up some namespacing that was causing some CI compi…

a1c0dfb

…lers to fail

ARROW-13542: Updated to use the projected_schema instead of the input…

63ac5ca

… node's schema

ARROW-13542: Remove unused variable

efbd169

ARROW-13542: Needed to add an include to R so it could find the newly…

f5a6fb2

… placed arrow::dataset::Initialize method

ARROW-13542: Fixing import order for linter

351991e

ARROW-13542: Removing duplicate node registration added during rebase

d75b034

ARROW-13542: Updated tests to not rely on deterministic ordering of d…

4240953

…ataset write

westonpace force-pushed the feature/ARROW-13542-write-node branch from c651c4a to 4240953 Compare October 4, 2021 21:10

westonpace marked this pull request as ready for review October 4, 2021 21:22

ARROW-13542: Cleaning up comments

3ba656a

lidavidm approved these changes Oct 5, 2021

View reviewed changes

aocsa mentioned this pull request Oct 6, 2021

ARROW-13576: [C++] Replace ExecNode::InputReceived with ::MakeTask #11210

Closed

ARROW-13542: Addressed PR reviews. Added unit tests for consuming sink.

adb0ad1

lidavidm approved these changes Oct 6, 2021

View reviewed changes

cpp/src/arrow/compute/exec/plan_test.cc Outdated Show resolved Hide resolved

ARROW-13542: Changing counter to atomic since test may run in parallel

97ed669

westonpace closed this in 4b8ffe4 Oct 8, 2021

westonpace deleted the feature/ARROW-13542-write-node branch January 6, 2022 08:15

asfimport mentioned this pull request Oct 8, 2021

[C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk #29194

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-13542: [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk #11017

ARROW-13542: [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk #11017

Uh oh!

westonpace commented Aug 27, 2021 •

edited

Loading

Uh oh!

github-actions bot commented Aug 27, 2021

Uh oh!

lidavidm left a comment

Uh oh!

lidavidm Oct 5, 2021

Uh oh!

westonpace Oct 6, 2021

Uh oh!

lidavidm Oct 5, 2021

Uh oh!

westonpace Oct 6, 2021

Uh oh!

aocsa commented Oct 6, 2021

Uh oh!

lidavidm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-13542: [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk #11017

ARROW-13542: [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk #11017

Uh oh!

Conversation

westonpace commented Aug 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 27, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

lidavidm Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

aocsa commented Oct 6, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

westonpace commented Aug 27, 2021 •

edited

Loading