Skip to content

[C++] Add the concept of "ordering" to an exec node, reject non-sensible plans #34136

@westonpace

Description

@westonpace

Describe the enhancement requested

Every node has an "ordering" which describes what the batch index of the batches produced by that node corresponds to.

Source nodes will generally have an ordering based on the data they are generating. For in-memory data this will be implicit "row number" ordering (done in this PR). For the scanner this will generally be implicit "file number / row number" ordering (future PR). In both cases the user should eventually be able to assert an explicit ordering if they know for a fact that their data is ordered (future PR). If the data is partitioned the scanner could generate an explicit ordering based on the partition columns (future PR).

Mapping nodes like filter/project/fetch will simply propagate the ordering of their input.

Some nodes, such as order_by (future PR) and asof join (this PR) will establish a new ordering.

Other nodes, such as the aggregate node, the hash-join node, and the union node will destroy the ordering (this PR). Their output will have no meaningful order. Although, in all these cases one could devise an alternative order-preserving approach (future PR):

  • An aggregate node could pretty easily establish an ordering based on the keys or an implicit ordering if there are no keys (if you only have 1 row then it is ordered :)
  • A hash-join node could similarly choose to order by one or more of the join keys (unclear at the moment if there would be much cost associated with this)
  • To preserve ordering, a union node would need to read from all inputs at the same time and output in round-robin fashion. Alternatively, a union node could choose to fully drain one input before starting on the next input. Both approaches would have performance implications

The fetch node will now fail to validate if it is placed in a plan that does not make sense (e.g. immediately following a hash-join node).

The sink node, if sequenced output is requested, will similarly fail to validate if the plan does not have an established ordering.

This all sounds far more complex than it is :)

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions