-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13227: [Documentation][Compute] Document ExecNode #11309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice doc!
| compute functions <invoking compute functions>` is not feasible | ||
| in either memory or computation time. Doing so causes all intermediate | ||
| data to be fully materialized. To facilitate arbitrarily large inputs | ||
| and more efficient resource usage, arrow also provides a streaming query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say "Arrow C++" rather than "arrow" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer, but I think that's implicit in the declared namespace of this doc
|
|
||
| .. image:: simple_graph.svg | ||
|
|
||
| :class:`ExecNode` is provided to reify the graph of operations in a query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe state explicitly that ExecNode represent the nodes of the graph which can perform processing, ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's implied by the name ExecNode and the following sentence. I'm not sure how to make this more clear without making the sentence confusingly wordy
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good addition. In the future we might want to add a bit more introduction. At the moment a fairly naive user will probably be a bit lost at "is provided to reify the graph of operations in a query" (e.g. "what do graphs have to do with streaming execution"?)
Although a naive user is admittedly probably going to be using SQL or some other front end. It probably still wouldn't hurt to set the context a little. I don't think we need to address that today though. This adds valuable information and it would be good to get it in place for the 6.0.0 release.
My comments are all nits, so take them or leave them. This could be merged as is and I would be content.
| if (need_stop) { | ||
| // stop all nodes in the graph | ||
| plan->StopProducing(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of confusing. What is need_stop? Why would it be used here? If this is an example of how you would cancel a running plan then I think the pseudocode would be more along the lines of...
while (!plan->finished() && !user_requested_cancellation) {
WaitForSignal();
}
if (!plan->finished()) {
plan->StopProducing();
}
...but I'm not sure that is any more clear. Maybe it is easiest to just remove this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I'll make this a bit more clear by phrasing it as an optional callback
| to the default registry with the name ``"scan"`` by calling | ||
| ``arrow::dataset::internal::Initialize()``:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically you also need to call arrow::dataset::internal::Initialize() before you use the write factory used above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added more calls to Initialize() so that any of these snippets should be well formed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we can resolve ARROW-13773 and remove these :)
| :class:`ExecNodeOptions`. | ||
|
|
||
| :struct:`Declaration` | ||
| ``dplyr``-inspired helper for efficient construction of an :class:`ExecPlan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the user has an R background the dplyr reference isn't helping much. Maybe just "A helper for efficient construction of sequences of ExecNodes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think giving something to look up is preferable; if anyone isn't familiar the worst that can happen is they read the example to see what Declaration can be used for
| MakeExecNode("write", plan.get(), {project_node}, | ||
| WriteNodeOptions{/*base_dir=*/"/dat", /*...*/}); | ||
|
|
||
| :struct:`Declaration` is a `dplyr <https://dplyr.tidyverse.org>`-inspired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about dplyr here.
| ``dplyr``-inspired helper for efficient construction of an :class:`ExecPlan`. | ||
|
|
||
| :struct:`ExecBatch` | ||
| A lightweight container for a single chunk of data in the Arrow format. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A lightweight container for a single chunk of data in the Arrow format. In | |
| A lightweight container for columns of data in the Arrow format. In |
"A single chunk" makes me think "contiguous"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExecBatch's columns are a contiguous chunk- if you have a float32 column with no nulls that's stored in a single buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but you have multiple columns and each one is its own set of contiguous chunks. I.e. it's not a pandas block.
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
|
+1, merging |
|
Benchmark runs are scheduled for baseline = 444cdac and contender = 8650c23. 8650c23 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.