-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
The way that DataFusion names the output columns can be quite messy. For example, the query below produces a column named "FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12)" (yes that whole thing)
Arrow requires column names to be distinct strings and DataFusion creates the name based on the content of the expression.
Here is the entire query
❯ create table foo(host varchar, free int, time int) as values ('a', 1, 2), ('a', 3, 4);
0 rows in set. Query took 0.003 seconds.
❯ select first_value(free) over (partition by host order by time desc) + 5 / 12 from foo limit 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3 |
| 3 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.It also means you can't select the same expression without an alias, which while annoying is likely not a critical usecase
❯ select free / 1, free / 1 from foo;
Error during planning: Projections require unique expression names but the expression "foo.free / Int64(1)" at position 0 and "foo.free / Int64(1)" at position 1 have the same name. Consider aliasing ("AS") one of them.
❯ select free / 1, free / 1 as col2 from foo;
+---------------------+------+
| foo.free / Int64(1) | col2 |
+---------------------+------+
| 1 | 1 |
| 3 | 3 |
+---------------------+------+Describe the solution you'd like
While the output column names can be controlled via adding an explicit alias, I think it would be good to have a more concise way of naming the default outputs, especially for window functions
Perhaps something like not adding the OVER clause's contents so FIRST_VALUE(foo.free) in the above example.
If we do this we need to be careful to allow multiple calls to FIRST_VALUE with different OVER clauses in the same query
Describe alternatives you've considered
No response
Additional context
If we make the names more concise, we will probably need a solution like #6543 to ensure they remain unique