Skip to content

Improve default output column names #6758

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

The way that DataFusion names the output columns can be quite messy. For example, the query below produces a column named "FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12)" (yes that whole thing)

Arrow requires column names to be distinct strings and DataFusion creates the name based on the content of the expression.

Here is the entire query

❯ create table foo(host varchar, free int, time int) as values ('a', 1, 2), ('a', 3, 4);
0 rows in set. Query took 0.003 seconds.
❯ select first_value(free) over (partition by host order by time desc) + 5 / 12 from foo limit 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3                                                                                                                                                           |
| 3                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

It also means you can't select the same expression without an alias, which while annoying is likely not a critical usecase

select free / 1, free / 1 from foo;
Error during planning: Projections require unique expression names but the expression "foo.free / Int64(1)" at position 0 and "foo.free / Int64(1)" at position 1 have the same name. Consider aliasing ("AS") one of them.
❯ select free / 1, free / 1 as col2 from foo;
+---------------------+------+
| foo.free / Int64(1) | col2 |
+---------------------+------+
| 1                   | 1    |
| 3                   | 3    |
+---------------------+------+

Describe the solution you'd like

While the output column names can be controlled via adding an explicit alias, I think it would be good to have a more concise way of naming the default outputs, especially for window functions

Perhaps something like not adding the OVER clause's contents so FIRST_VALUE(foo.free) in the above example.

If we do this we need to be careful to allow multiple calls to FIRST_VALUE with different OVER clauses in the same query

Describe alternatives you've considered

No response

Additional context

If we make the names more concise, we will probably need a solution like #6543 to ensure they remain unique

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions