Conversation
|
This seems similar to #6636 but is probably a lot more battle tested 💪 |
0b516b0 to
908a7a3
Compare
src/execution/operator/projection/physical_tableinout_function.cpp
Outdated
Show resolved
Hide resolved
src/execution/operator/projection/physical_tableinout_function.cpp
Outdated
Show resolved
Hide resolved
src/execution/operator/projection/physical_tableinout_function.cpp
Outdated
Show resolved
Hide resolved
src/include/duckdb/execution/operator/projection/OrdinalityData.h
Outdated
Show resolved
Hide resolved
src/include/duckdb/execution/operator/projection/OrdinalityData.h
Outdated
Show resolved
Hide resolved
src/execution/operator/projection/physical_tableinout_function.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Hey @niykko, thanks for the PR! Nice work so far, I picked this up and added a review. :)
Also, to simplify further work on this PR, can you resolve all review comments (@Tishj and mine) that you've addressed? Just so that we know what has been done and what's still missing.
There are a few points that still have to be addressed, and I've left comments to address them. Most importantly, WITH ORDINALITY has to be a parallelism-capable operator.
Can you also please add PRAGMA enable_verification to all test files? Furthermore, I am missing a lot of tests, namely:
- tests with more complex scenarios: subqueries, order by ordinality, grouping
- more tests with failing scenarios (
statement error) - more
read_csv[_auto]tests
src/execution/operator/projection/physical_tableinout_function.cpp
Outdated
Show resolved
Hide resolved
9e57fd6 to
9383dbe
Compare
cda9e11 to
1a58aac
Compare
|
The TableInOutFunctions now also work as expected (i.e. the ordinality value resets after reading new input). An example query for this would be:
The resulting ordinality column resets whenever a value from a new source is being read (mimicking postgres). I still think TableFunctions and TableInOutFunctions are not optimally handled right now and would like to revamp these functions and their inheritance hierarchy as my next project. Please let me know if you need me to change anything for this PR to finally make the cut :). |
Mytherin
left a comment
There was a problem hiding this comment.
Thanks for the fixes! The code is looking a lot cleaner now - some more comments:
| //! (Optional) allows injecting a custom MultiFileReader implementation | ||
| table_function_get_multi_file_reader_t get_multi_file_reader; | ||
| //! Boolean used to determine if WITH ORDINALITY clause has been invoked | ||
| bool with_ordinality = false; |
There was a problem hiding this comment.
Can we turn this into an enum?
There was a problem hiding this comment.
Sorry if this is a dumb question, but do I understand you correctly that this should be turned into an Enum with the possible values being either true (requested) or false (not requested)?
If so, would there be any benefit (besides readability)?
There was a problem hiding this comment.
Yes, for readability and extensibility purposes boolean values are sub-optimal so we prefer to avoid them.
| } | ||
|
|
||
| return_types.push_back(LogicalType::BIGINT); | ||
| bind_context.AddGenericBinding(projection_index, function_name, return_names, return_types); |
There was a problem hiding this comment.
I think the binder is the wrong place to push the window operator - I think the (physical) planner is a better location to push these operators. I think all that should happen in the binder is that we push the ordinality column name and BIGINT return type so that they can be used by the rest of the plan.
|
Hello! I want to let you know that I will resume work on this feature now. |
As part of my Bachelor's Thesis for the Database Research Group at the University of Tübingen, I have implemented the
WITH ORDINALITYfeature for the core table functions 1 :unnestrepeatrepeat_rowgenerate_seriesrangeglobread_csvread_csv_autoIn short, this feature optionally adds a one-based "index"-column, called
Ordinality2, to the output of all of table functions listed above, e.g.:produces
This can be used to create or maintain an explicit order of elements and can, for example, find its uses when reading from files. When there are multiple possible behaviours, I made sure to mimic the behaviour implemented in Postgres, more specifically:
WITH ORDINALITYyields seperate indices for each source, e.g., aUNIONof multiplerange-calls:read_csvreads from multiple .csv files,WITH ORDINALITYconsiders all .csv files to be the same row source, i.e., the counter doesn't reset at the start of each file.Footnotes
I also intend to add functionality for
read_parquet, but I first wanted to make sure my implementation ofread_csvis sensible before I touch the extension. ↩The name is up for debate, suggestions are most certainly welcome. 😁 ↩