Skip to content

A more sql-like group-by #2607

@fdncred

Description

@fdncred

Is your feature request related to a problem? Please describe.
I have a hard time remembering how to use group-by in nushell because I always end up with columns of rows or tables which I'm not looking for.

example:
open movies-2.csv | select 'Lead Studio' 'Worldwide Gross' | group-by '"Lead Studio"' | pivot company gross_values

╭────┬───────────────────────┬─────────────────╮
│ #  │ company               │ gross_values    │
├────┼───────────────────────┼─────────────────┤
│  0 │ The Weinstein Company │ [table 2 rows]  │
│  1 │ Independent           │ [table 19 rows] │
│  2 │ Disney                │ [table 8 rows]  │
│  3 │ Fox                   │ [table 6 rows]  │
│  4 │ 20th Century Fox      │ [table 2 rows]  │
│  5 │ Warner Bros.          │ [table 14 rows] │
│  6 │ Summit                │ [table 5 rows]  │
│  7 │ Paramount             │ [table 4 rows]  │
│  8 │ CBS                   │ [table 1 rows]  │
│  9 │ New Line              │ [table 1 rows]  │
│ 10 │ Sony                  │ [table 4 rows]  │
│ 11 │ Universal             │ [table 8 rows]  │
│ 12 │ Lionsgate             │ [table 2 rows]  │
╰────┴───────────────────────┴─────────────────╯

Describe the solution you'd like
What I'd like to see happen is for a new grouping command to be added to nushell that will work a little more like sql. I have found this rust tool which is fantastic. Here's an example:
gb.exe -f movies-2.csv -k 3 -s 7 --skip_header
produces this output which is close to a nushell looking output but not quite.

 k:3                   | count | sum:7
-----------------------+-------+--------------------
 20th Century Fox      | 2     | 156.75
 CBS                   | 1     | 77.09
 Disney                | 8     | 2214.5
 Fox                   | 5     | 661.85
 Independent           | 19    | 1546.9399999999998
 Lionsgate             | 2     | 152.59
 New Line              | 1     | 20.71
 Paramount             | 4     | 321.21999999999997
 Sony                  | 4     | 403.12
 Summit                | 5     | 1242.26
 The Weinstein Company | 2     | 27.880000000000003
 Universal             | 8     | 1792.04
 Warner Bros.          | 14    | 2462.14

What's going on is gb.exe is saying:

  1. Use this file with -f movies-2.csv
  2. Use column 3 as a group by key with -k 3
  3. Use column 7 as the summing aggregate with -s 7
    I'd rather use column names in place of column numbers because that's a more nu way of doing things.

Here's a couple more examples without much explanation.
This one shows adding two grouping keys. I'm always wanting to add more columns when using group-by and it just-work:tm:
gb.exe -f movies-2.csv -k 3,2 -s 7 --skip_header

 k:3                   | k:2       | count | sum:7
-----------------------+-----------+-------+--------------------
 20th Century Fox      | Drama     | 1     | 117.09
 20th Century Fox      | Romance   | 1     | 39.66
 CBS                   | Comedy    | 1     | 77.09
 Disney                | Animation | 4     | 1264.23
 Disney                | Comedy    | 4     | 950.27
 Fox                   | Comedy    | 5     | 661.85
 Independent           | Comedy    | 7     | 399.07
 Independent           | Drama     | 4     | 69.75
 Independent           | Romance   | 7     | 1048.75
 Independent           | romance   | 1     | 29.37
...

This one shows grouping by column 2 alone.
gb.exe -f movies-2.csv -k 2 -s 7 --skip_header

 k:2       | count | sum:7
-----------+-------+--------------------
 Action    | 1     | 93.4
 Animation | 4     | 1264.23
 Comdy     | 1     | 105.96
 Comedy    | 41    | 6081.730000000002
 Drama     | 13    | 1287.15
 Fantasy   | 1     | 285.43
 Romance   | 12    | 1783.1600000000003
 romance   | 1     | 29.37
 Romence   | 1     | 148.66

There are dozens of command line parameters in gb and I'm not requesting all of them be impemented. Maybe we can just get this added with only multiple columns and sum. Then at a later point look at avg or what additional functionality would be interesting.

Describe alternatives you've considered
This does what I want to do but I find it too foreign and having to remember to use reduce when I'm grouping is a bummer because I never remember.
open movies-2.csv | select 'Lead Studio' 'Worldwide Gross' | group-by '"Lead Studio"' | pivot company gross_values | insert total {get gross_values | reduce -f 0 { = $acc + $it.'"Worldwide Gross"' }} | reject gross_values

╭────┬───────────────────────┬───────────╮
│ #  │ company               │ total     │
├────┼───────────────────────┼───────────┤
│  0 │ The Weinstein Company │   27.8800 │
│  1 │ Independent           │ 1546.9400 │
│  2 │ Disney                │ 2214.5000 │
│  3 │ Fox                   │  722.5700 │
│  4 │ 20th Century Fox      │  156.7500 │
│  5 │ Warner Bros.          │ 2462.1399 │
│  6 │ Summit                │ 1242.2600 │
│  7 │ Paramount             │  321.2200 │
│  8 │ CBS                   │   77.0900 │
│  9 │ New Line              │   20.7100 │
│ 10 │ Sony                  │  403.1200 │
│ 11 │ Universal             │ 1792.0400 │
│ 12 │ Lionsgate             │  152.5900 │
╰────┴───────────────────────┴───────────╯

Additional context
Attached is the zipped csv that I used.
movies-2.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions