Skip to content

Implementation of WITH ORDINALITY#9014

Closed
niykko wants to merge 16 commits intoduckdb:mainfrom
niykko:with_ordinality
Closed

Implementation of WITH ORDINALITY#9014
niykko wants to merge 16 commits intoduckdb:mainfrom
niykko:with_ordinality

Conversation

@niykko
Copy link
Contributor

@niykko niykko commented Sep 20, 2023

As part of my Bachelor's Thesis for the Database Research Group at the University of Tübingen, I have implemented the WITH ORDINALITY feature for the core table functions 1 :

  • unnest
  • repeat
  • repeat_row
  • generate_series
  • range
  • glob
  • read_csv
  • read_csv_auto

In short, this feature optionally adds a one-based "index"-column, called Ordinality2, to the output of all of table functions listed above, e.g.:

SELECT * FROM range(5) WITH ORDINALITY;

produces

┌───────┬────────────┐
│ range │ Ordinality │
│ int64 │   int32    │
├───────┼────────────┤
│     0 │          1 │
│     1 │          2 │
│     2 │          3 │
│     3 │          4 │
│     4 │          5 │
└───────┴────────────┘

This can be used to create or maintain an explicit order of elements and can, for example, find its uses when reading from files. When there are multiple possible behaviours, I made sure to mimic the behaviour implemented in Postgres, more specifically:

  • When a function consumes rows from more than one source, WITH ORDINALITY yields seperate indices for each source, e.g., a UNION of multiple range-calls:
    SELECT u, Ordinality
    FROM   (SELECT range(5) UNION SELECT range(5,10)) AS _(l),
           unnest(l) WITH ORDINALITY AS __(u)
    ORDER BY l,u;
    produces
    ┌───────┬────────────┐
    │   u   │ Ordinality │
    │ int64 │   int32    │
    ├───────┼────────────┤
    │     0 │          1 │
    │     1 │          2 │
    │     2 │          3 │
    │     3 │          4 │
    │     4 │          5 │
    │     5 │          1 │  <-- index reset!
    │     6 │          2 │
    │     7 │          3 │
    │     8 │          4 │
    │     9 │          5 │
    ├───────┴────────────┤
    │ 10 rows  2 columns │
    └────────────────────┘
    
  • When read_csv reads from multiple .csv files, WITH ORDINALITY considers all .csv files to be the same row source, i.e., the counter doesn't reset at the start of each file.

Footnotes

  1. I also intend to add functionality for read_parquet, but I first wanted to make sure my implementation of read_csv is sensible before I touch the extension.

  2. The name is up for debate, suggestions are most certainly welcome. 😁

@Tishj
Copy link
Contributor

Tishj commented Sep 20, 2023

This seems similar to #6636 but is probably a lot more battle tested 💪
Nice work 👍

@github-actions github-actions bot marked this pull request as draft September 21, 2023 18:00
@niykko niykko marked this pull request as ready for review September 21, 2023 18:27
@github-actions github-actions bot marked this pull request as draft September 22, 2023 12:26
@niykko niykko marked this pull request as ready for review September 22, 2023 12:26
@github-actions github-actions bot marked this pull request as draft September 25, 2023 10:44
@niykko niykko marked this pull request as ready for review September 26, 2023 13:08
@Mytherin Mytherin changed the base branch from main to feature September 27, 2023 06:41
@github-actions github-actions bot marked this pull request as draft September 27, 2023 12:37
@niykko niykko marked this pull request as ready for review October 17, 2023 06:39
Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @niykko, thanks for the PR! Nice work so far, I picked this up and added a review. :)

Also, to simplify further work on this PR, can you resolve all review comments (@Tishj and mine) that you've addressed? Just so that we know what has been done and what's still missing.

There are a few points that still have to be addressed, and I've left comments to address them. Most importantly, WITH ORDINALITY has to be a parallelism-capable operator.

Can you also please add PRAGMA enable_verification to all test files? Furthermore, I am missing a lot of tests, namely:

  • tests with more complex scenarios: subqueries, order by ordinality, grouping
  • more tests with failing scenarios (statement error)
  • more read_csv[_auto] tests

@niykko niykko force-pushed the with_ordinality branch 2 times, most recently from cda9e11 to 1a58aac Compare May 27, 2024 13:41
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 27, 2024 13:41
@Mytherin Mytherin marked this pull request as ready for review June 17, 2024 07:41
@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 22, 2024 13:20
@niykko niykko marked this pull request as ready for review July 22, 2024 13:21
@niykko
Copy link
Contributor Author

niykko commented Jul 22, 2024

The TableInOutFunctions now also work as expected (i.e. the ordinality value resets after reading new input). An example query for this would be:

select * from (select range(1,3) UNION select range(3,5)) as _(l), unnest(l) with ordinality;
Result:

l unnest ordinality
[3, 4] 3 1
[3, 4] 4 2
[1, 2] 1 1
[1, 2] 2 2

The resulting ordinality column resets whenever a value from a new source is being read (mimicking postgres).

I still think TableFunctions and TableInOutFunctions are not optimally handled right now and would like to revamp these functions and their inheritance hierarchy as my next project.

Please let me know if you need me to change anything for this PR to finally make the cut :).

@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 22, 2024 13:35
@niykko niykko marked this pull request as ready for review July 22, 2024 13:35
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! The code is looking a lot cleaner now - some more comments:

//! (Optional) allows injecting a custom MultiFileReader implementation
table_function_get_multi_file_reader_t get_multi_file_reader;
//! Boolean used to determine if WITH ORDINALITY clause has been invoked
bool with_ordinality = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we turn this into an enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is a dumb question, but do I understand you correctly that this should be turned into an Enum with the possible values being either true (requested) or false (not requested)?
If so, would there be any benefit (besides readability)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for readability and extensibility purposes boolean values are sub-optimal so we prefer to avoid them.

}

return_types.push_back(LogicalType::BIGINT);
bind_context.AddGenericBinding(projection_index, function_name, return_names, return_types);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the binder is the wrong place to push the window operator - I think the (physical) planner is a better location to push these operators. I think all that should happen in the binder is that we push the ordinality column name and BIGINT return type so that they can be used by the rest of the plan.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft July 29, 2024 11:33
@Mytherin Mytherin changed the base branch from main to feature August 26, 2024 07:30
@Mytherin Mytherin changed the base branch from feature to main November 12, 2024 09:50
@niykko
Copy link
Contributor Author

niykko commented Mar 10, 2025

Hello!

I want to let you know that I will resume work on this feature now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants