GH-34136: [C++] Add a concept of ordering to ExecPlan by westonpace · Pull Request #34137 · apache/arrow

westonpace · 2023-02-11T01:03:14Z

In addition, it is now possible to bypass the I/O executor on the record_batch_source, exec_batch_source, and array_vector_source.

It is now possible to create a source node from a gen::Gen generator.

BREAKING CHANGE: The default executor for record_batch_source, exec_batch_source, and array_vector_source was (erroneously) the plan's CPU executor. It now defaults properly to the I/O executor.

Closes: [C++] Add the concept of "ordering" to an exec node, reject non-sensible plans #34136

github-actions · 2023-02-11T01:03:36Z

Closes: [C++] Add the concept of "ordering" to an exec node, reject non-sensible plans #34136

github-actions · 2023-02-11T01:03:38Z

⚠️ GitHub issue #34136 has been automatically assigned in GitHub to PR creator.

westonpace · 2023-02-11T01:04:03Z

~~This PR should be complete but it builds on #34059 which has not yet merged.~~

westonpace · 2023-02-17T22:36:45Z

@lidavidm Sorry, I requested your review a bit too early. I started working on the order by node and realized that the ordering I had in place didn't support the ability to specify null placement. So I changed it to a proper class similar to SortNodeOptions. The change is pretty minor but I'll leave up unless you want to take another look.

cpp/src/arrow/compute/ordering.h

jorisvandenbossche

Didn't look in detail at the C++ code, mostly took a look to see if I could follow the new logic, and added a few questions.

One more question: assume you would use something like DeclarationToTable, does that automatically use the ordering / batch indices if there is one, or do you still need to indicate this manually you want to use it? (like the generic SinkNodeOptions has a sequence_delivery parameter with a default of false, but I don't see that exposed through the DeclarationTo.. versions)

And looking forward to have this!

jorisvandenbossche · 2023-02-21T21:16:08Z

cpp/src/arrow/compute/exec/exec_plan.h

If there is an order based on a column "x", does that also guarantee something about the order within each batch? (or only between batches as this paragraph explains)

Yes, it should guarantee that the ordering exists within the batch as well.

I've updated the wording here to mention ordering within a batch.

jorisvandenbossche · 2023-02-21T21:18:31Z

cpp/src/arrow/compute/exec/exec_plan.h

What is exactly meant with "map node"? (I don't find this term used anywhere else in the compute / Acero code or docs) Do we mean a node that uses typical element-wise scalar kernels? Also something like a filter node will always preserve ordering.

OK, I searched for "map node" and not for "MapNode" ;)
And I see that indeed both Project and Filter node inherit from MapNode (it's still a term that is not used in our documentation though)

Ah, yes, this is probably an internal term. I'll update this.

I changed this to "A filter or project node..."

jorisvandenbossche · 2023-02-21T21:21:51Z

cpp/src/arrow/compute/exec/exec_plan.h

Just wondering, but assume you would do a filter operation that filters a large part of the data, and so you might end up with several empty batches, does that affect for example sink nodes like writing files? (do we skip empty batches there, or do we then potentially write empty files for them?)

Good question. The dataset writer will discard empty batches without writing anything. However, the sink node still respects empty batches. For example, if one were doing dataset.to_batches(...) then they might see an empty batch.

I'm fairly certain this is consistent with the current implementation and not a change in behavior.

jorisvandenbossche · 2023-02-21T21:43:56Z

One more question: assume you would use something like DeclarationToTable, does that automatically use the ordering / batch indices if there is one, or do you still need to indicate this manually you want to use it? (like the generic SinkNodeOptions has a sequence_delivery parameter with a default of false, but I don't see that exposed through the DeclarationTo.. versions)

Ah, I see there is a new variant of DeclarationToTable that uses QueryOptions as parameter, and that options struct has a sequence_output keyword. So I assume that answers my question.

Follow-up question on this: the default for sequence_output is false, so does that mean that even if you have a query plan with an order_by node at the end before consuming as eg a table, you won't get the data as ordered by default, because for the ordering to be respected, you need to explicitly enable it with sequence_output?

westonpace · 2023-02-21T22:55:44Z

even if you have a query plan with an order_by node at the end before consuming as eg a table, you won't get the data as ordered by default, because for the ordering to be respected, you need to explicitly enable it with sequence_output?

Correct. Maybe I should change to an optional and the default (nullopt) would sequence when the input to the sink node is ordered. This would mean we only default to false if there is an aggregate or join. Given the cost of this sequencing should generally be pretty reasonable I think it would be an ok default (and users could still disable it if they wanted).

westonpace · 2023-02-22T00:48:52Z

Correct. Maybe I should change to an optional and the default (nullopt) would sequence when the input to the sink node is ordered. This would mean we only default to false if there is an aggregate or join. Given the cost of this sequencing should generally be pretty reasonable I think it would be an ok default (and users could still disable it if they wanted).

I've done this. The default (nullopt) means "sequence if there is any ordering". It can be set to true to get "fail validation if there is no meaningful ordering" or false to get "never sequence and maximize performance even if there is a meaningful ordering".

cpp/src/arrow/compute/exec/exec_plan.h

…ources from an iterator factory) to run without an I/O executor (useful if the source is something like a vector)

…e null placement

…ou can' behavior. Clarified some comments

jorisvandenbossche · 2023-02-28T09:32:41Z

This is ready to be merged?

westonpace · 2023-02-28T14:54:47Z

This is ready to be merged?

Yes, thanks.

ursabot · 2023-03-01T11:44:23Z

Benchmark runs are scheduled for baseline = ef21008 and contender = 762329b. 762329b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.52% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.16%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 762329b7 ec2-t3-xlarge-us-east-2
[Failed] 762329b7 test-mac-arm
[Finished] 762329b7 ursa-i9-9960x
[Finished] 762329b7 ursa-thinkcentre-m75q
[Finished] ef21008d ec2-t3-xlarge-us-east-2
[Finished] ef21008d test-mac-arm
[Finished] ef21008d ursa-i9-9960x
[Finished] ef21008d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…ith use_threads=True (#34766) ### Rationale for this change Thanks to #34137, the ExecPlan now has a concept of ordering. When the source node is a Table, the order of the batches in the table is used as the implicit order. And when executing a plan and producing a resulting Table, the default for QueryOptions' `sequence_output` is to honor an order if there is one. Given that the `Table.filter` method only consists of a table source node (which adds implicit order) and a filter node (which preserves any ordering), the output will now always be ordered by default, also with the default of `use_threads=True` ### Are these changes tested? The existing test `test_exec_plan.py::test_filter_table_ordering` still passes. * Closes: #31880 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…rder with use_threads=True (apache#34766) ### Rationale for this change Thanks to apache#34137, the ExecPlan now has a concept of ordering. When the source node is a Table, the order of the batches in the table is used as the implicit order. And when executing a plan and producing a resulting Table, the default for QueryOptions' `sequence_output` is to honor an order if there is one. Given that the `Table.filter` method only consists of a table source node (which adds implicit order) and a filter node (which preserves any ordering), the output will now always be ordered by default, also with the default of `use_threads=True` ### Are these changes tested? The existing test `test_exec_plan.py::test_filter_table_ordering` still passes. * Closes: apache#31880 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added the Component: C++ label Feb 11, 2023

westonpace force-pushed the feature/GH-34136--add-ordering-to-exec-plan branch from 4ffb32f to 7d55bc5 Compare February 11, 2023 01:10

westonpace marked this pull request as ready for review February 11, 2023 01:10

westonpace force-pushed the feature/GH-34136--add-ordering-to-exec-plan branch from 7d55bc5 to 3556eb3 Compare February 17, 2023 20:43

westonpace requested a review from lidavidm February 17, 2023 20:44

lidavidm approved these changes Feb 17, 2023

View reviewed changes

westonpace requested a review from lidavidm February 17, 2023 22:36

westonpace mentioned this pull request Feb 18, 2023

GH-34248: [C++] Add an order_by node #34249

Merged

lidavidm approved these changes Feb 18, 2023

View reviewed changes

cpp/src/arrow/compute/ordering.h Outdated Show resolved Hide resolved

lidavidm approved these changes Feb 21, 2023

View reviewed changes

jorisvandenbossche reviewed Feb 21, 2023

View reviewed changes

jorisvandenbossche reviewed Feb 22, 2023

View reviewed changes

cpp/src/arrow/compute/exec/exec_plan.h Outdated Show resolved Hide resolved

westonpace added 6 commits February 24, 2023 13:38

Adds a concept of ordering to ExecPlan. Allows schema sources (e.g. s…

bcc1692

…ources from an iterator factory) to run without an I/O executor (useful if the source is something like a vector)

Remove duplicate declaration of GetFunctionRegistry

1850816

Created a proper Ordering class so that exec plan ordering can includ…

9f9f2bc

…e null placement

Lint, Windows exports

8358c2b

Imiplicit -> Implicit

61959e7

Addressing review comments. Sequencing now defaults to 'sequence if y…

9d1faf2

…ou can' behavior. Clarified some comments

westonpace force-pushed the feature/GH-34136--add-ordering-to-exec-plan branch from e1d25c1 to 9d1faf2 Compare February 24, 2023 21:39

westonpace merged commit 762329b into apache:main Feb 28, 2023

jorisvandenbossche mentioned this pull request Mar 29, 2023

GH-31880: [Python] Table.filter with expression now preserves order with use_threads=True #34766

Merged

Conversation

westonpace commented Feb 11, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2023

Uh oh!

github-actions bot commented Feb 11, 2023

Uh oh!

westonpace commented Feb 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Feb 17, 2023

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Feb 21, 2023

Uh oh!

westonpace commented Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Feb 22, 2023

Uh oh!

Uh oh!

jorisvandenbossche commented Feb 28, 2023

Uh oh!

westonpace commented Feb 28, 2023

Uh oh!

ursabot commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

westonpace commented Feb 11, 2023 •

edited by github-actions bot

Loading

westonpace commented Feb 11, 2023 •

edited

Loading

westonpace commented Feb 21, 2023 •

edited

Loading