Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Oct 4, 2021

No description provided.

@bkietz bkietz requested a review from pitrou October 4, 2021 16:35
@github-actions
Copy link

github-actions bot commented Oct 4, 2021

@apache apache deleted a comment from github-actions bot Oct 4, 2021
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice doc!

compute functions <invoking compute functions>` is not feasible
in either memory or computation time. Doing so causes all intermediate
data to be fully materialized. To facilitate arbitrarily large inputs
and more efficient resource usage, arrow also provides a streaming query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say "Arrow C++" rather than "arrow" here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer, but I think that's implicit in the declared namespace of this doc


.. image:: simple_graph.svg

:class:`ExecNode` is provided to reify the graph of operations in a query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe state explicitly that ExecNode represent the nodes of the graph which can perform processing, ....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's implied by the name ExecNode and the following sentence. I'm not sure how to make this more clear without making the sentence confusingly wordy

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good addition. In the future we might want to add a bit more introduction. At the moment a fairly naive user will probably be a bit lost at "is provided to reify the graph of operations in a query" (e.g. "what do graphs have to do with streaming execution"?)

Although a naive user is admittedly probably going to be using SQL or some other front end. It probably still wouldn't hurt to set the context a little. I don't think we need to address that today though. This adds valuable information and it would be good to get it in place for the 6.0.0 release.

My comments are all nits, so take them or leave them. This could be merged as is and I would be content.

Comment on lines 158 to 161
if (need_stop) {
// stop all nodes in the graph
plan->StopProducing();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of confusing. What is need_stop? Why would it be used here? If this is an example of how you would cancel a running plan then I think the pseudocode would be more along the lines of...

while (!plan->finished() && !user_requested_cancellation) {
  WaitForSignal();
}

if (!plan->finished()) {
  plan->StopProducing();
}

...but I'm not sure that is any more clear. Maybe it is easiest to just remove this block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I'll make this a bit more clear by phrasing it as an optional callback

Comment on lines +275 to +276
to the default registry with the name ``"scan"`` by calling
``arrow::dataset::internal::Initialize()``::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you also need to call arrow::dataset::internal::Initialize() before you use the write factory used above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more calls to Initialize() so that any of these snippets should be well formed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we can resolve ARROW-13773 and remove these :)

:class:`ExecNodeOptions`.

:struct:`Declaration`
``dplyr``-inspired helper for efficient construction of an :class:`ExecPlan`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the user has an R background the dplyr reference isn't helping much. Maybe just "A helper for efficient construction of sequences of ExecNodes"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think giving something to look up is preferable; if anyone isn't familiar the worst that can happen is they read the example to see what Declaration can be used for

MakeExecNode("write", plan.get(), {project_node},
WriteNodeOptions{/*base_dir=*/"/dat", /*...*/});

:struct:`Declaration` is a `dplyr <https://dplyr.tidyverse.org>`-inspired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about dplyr here.

``dplyr``-inspired helper for efficient construction of an :class:`ExecPlan`.

:struct:`ExecBatch`
A lightweight container for a single chunk of data in the Arrow format. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A lightweight container for a single chunk of data in the Arrow format. In
A lightweight container for columns of data in the Arrow format. In

"A single chunk" makes me think "contiguous"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecBatch's columns are a contiguous chunk- if you have a float32 column with no nulls that's stored in a single buffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but you have multiple columns and each one is its own set of contiguous chunks. I.e. it's not a pandas block.

bkietz and others added 9 commits October 15, 2021 09:57
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@bkietz
Copy link
Member Author

bkietz commented Oct 15, 2021

+1, merging

@bkietz bkietz closed this in 8650c23 Oct 15, 2021
@bkietz bkietz deleted the 13227-Document-ExecNode-ExecPla branch October 15, 2021 18:43
@ursabot
Copy link

ursabot commented Oct 15, 2021

Benchmark runs are scheduled for baseline = 444cdac and contender = 8650c23. 8650c23 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.18%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants