Skip to content

add multiple grouper support to group-by#14337

Merged
fdncred merged 9 commits intonushell:mainfrom
Bahex:group-by-multiple
Nov 15, 2024
Merged

add multiple grouper support to group-by#14337
fdncred merged 9 commits intonushell:mainfrom
Bahex:group-by-multiple

Conversation

@Bahex
Copy link
Copy Markdown
Member

@Bahex Bahex commented Nov 14, 2024

Related:

Description

This PR changes group-by to support grouping by multiple grouper arguments.

Changes

  • No grouper: no change in behavior
  • Single grouper
    • --to-table=false: no change in behavior
    • --to-table=true:
      • closure grouper: named group0
      • cell-path grouper: named after the cell-path
  • Multiple groupers:
    • --to-table=false: nested groups
    • --to-table=true: one column for each grouper argument, followed by the items column
      • columns corresponding to cell-paths are named after them
      • columns corresponding to closure groupers are named group{i} where i is the index of the grouper argument

Examples

> [1 3 1 3 2 1 1] | group-by
╭───┬───────────╮
    ╭───┬───╮ 
 1   0  1  
     1  1  
     2  1  
     3  1  
    ╰───┴───╯ 
    ╭───┬───╮ 
 3   0  3  
     1  3  
    ╰───┴───╯ 
    ╭───┬───╮ 
 2   0  2  
    ╰───┴───╯ 
╰───┴───────────╯

> [1 3 1 3 2 1 1] | group-by --to-table
╭─#─┬─group─┬───items───╮
 0  1      ╭───┬───╮ 
            0  1  
            1  1  
            2  1  
            3  1  
           ╰───┴───╯ 
 1  3      ╭───┬───╮ 
            0  3  
            1  3  
           ╰───┴───╯ 
 2  2      ╭───┬───╮ 
            0  2  
           ╰───┴───╯ 
╰─#─┴─group─┴───items───╯

> [1 3 1 3 2 1 1] | group-by { $in >= 2 }
╭───────┬───────────╮
        ╭───┬───╮ 
 false   0  1  
         1  1  
         2  1  
         3  1  
        ╰───┴───╯ 
        ╭───┬───╮ 
 true    0  3  
         1  3  
         2  2  
        ╰───┴───╯ 
╰───────┴───────────╯

> [1 3 1 3 2 1 1] | group-by { $in >= 2 } --to-table
╭─#─┬─group0─┬───items───╮
 0  false   ╭───┬───╮ 
             0  1  
             1  1  
             2  1  
             3  1  
            ╰───┴───╯ 
 1  true    ╭───┬───╮ 
             0  3  
             1  3  
             2  2  
            ╰───┴───╯ 
╰─#─┴─group0─┴───items───╯
let data = [
    [name, lang, year];
    [andres, rb, "2019"],
    [jt, rs, "2019"],
    [storm, rs, "2021"]
]

> $data
╭─#─┬──name──┬─lang─┬─year─╮
 0  andres  rb    2019 
 1  jt      rs    2019 
 2  storm   rs    2021 
╰─#─┴──name──┴─lang─┴─year─╯
> $data | group-by lang
╭────┬──────────────────────────────╮
     ╭─#─┬──name──┬─lang─┬─year─╮ 
 rb   0  andres  rb    2019  
     ╰─#─┴──name──┴─lang─┴─year─╯ 
     ╭─#─┬─name──┬─lang─┬─year─╮  
 rs   0  jt     rs    2019   
      1  storm  rs    2021   
     ╰─#─┴─name──┴─lang─┴─year─╯  
╰────┴──────────────────────────────╯

Group column is now named after the grouper, to allow multiple groupers.

> $data | group-by lang --to-table  # column names changed!
╭─#─┬─lang─┬────────────items─────────────╮
 0  rb    ╭─#─┬──name──┬─lang─┬─year─╮ 
           0  andres  rb    2019  
          ╰─#─┴──name──┴─lang─┴─year─╯ 
 1  rs    ╭─#─┬─name──┬─lang─┬─year─╮  
           0  jt     rs    2019   
           1  storm  rs    2021   
          ╰─#─┴─name──┴─lang─┴─year─╯  
╰─#─┴─lang─┴────────────items─────────────╯

Grouping by multiple columns makes finer grained aggregations possible.

> $data | group-by lang year --to-table
╭─#─┬─lang─┬─year─┬────────────items─────────────╮
 0  rb    2019  ╭─#─┬──name──┬─lang─┬─year─╮ 
                 0  andres  rb    2019  
                ╰─#─┴──name──┴─lang─┴─year─╯ 
 1  rs    2019  ╭─#─┬─name─┬─lang─┬─year─╮   
                 0  jt    rs    2019    
                ╰─#─┴─name─┴─lang─┴─year─╯   
 2  rs    2021  ╭─#─┬─name──┬─lang─┬─year─╮  
                 0  storm  rs    2021   
                ╰─#─┴─name──┴─lang─┴─year─╯  
╰─#─┴─lang─┴─year─┴────────────items─────────────╯

Grouping by multiple columns, without --to-table returns a nested structure.
This is equivalent to $data | group-by year | split-by lang, making split-by obsolete.

> $data | group-by lang year
╭────┬─────────────────────────────────────────╮
     ╭──────┬──────────────────────────────╮ 
 rb         ╭─#─┬──name──┬─lang─┬─year─╮  
      2019   0  andres  rb    2019   
            ╰─#─┴──name──┴─lang─┴─year─╯  
     ╰──────┴──────────────────────────────╯ 
     ╭──────┬─────────────────────────────╮  
 rs         ╭─#─┬─name─┬─lang─┬─year─╮    
      2019   0  jt    rs    2019     
            ╰─#─┴─name─┴─lang─┴─year─╯    
            ╭─#─┬─name──┬─lang─┬─year─╮   
      2021   0  storm  rs    2021    
            ╰─#─┴─name──┴─lang─┴─year─╯   
     ╰──────┴─────────────────────────────╯  
╰────┴─────────────────────────────────────────╯

From #2607:

Here's a couple more examples without much explanation. This one shows adding two grouping keys. I'm always wanting to add more columns when using group-by and it just-work:tm: gb.exe -f movies-2.csv -k 3,2 -s 7 --skip_header

 k:3                   | k:2       | count | sum:7
-----------------------+-----------+-------+--------------------
 20th Century Fox      | Drama     | 1     | 117.09
 20th Century Fox      | Romance   | 1     | 39.66
 CBS                   | Comedy    | 1     | 77.09
 Disney                | Animation | 4     | 1264.23
 Disney                | Comedy    | 4     | 950.27
 Fox                   | Comedy    | 5     | 661.85
 Independent           | Comedy    | 7     | 399.07
 Independent           | Drama     | 4     | 69.75
 Independent           | Romance   | 7     | 1048.75
 Independent           | romance   | 1     | 29.37
...

This example can be achieved like this:

> open movies-2.csv
  | group-by "Lead Studio" Genre --to-table
  | insert count {get items | length}
  | insert sum { get items."Worldwide Gross" | math sum}
  | reject items
  | sort-by "Lead Studio" Genre
╭─#──┬──────Lead Studio──────┬───Genre───┬─count─┬───sum───╮
 0   20th Century Fox       Drama          1   117.09 
 1   20th Century Fox       Romance        1    39.66 
 2   CBS                    Comedy         1    77.09 
 3   Disney                 Animation      4  1264.23 
 4   Disney                 Comedy         4   950.27 
 5   Fox                    Comedy         5   661.85 
 6   Fox                    comedy         1    60.72 
 7   Independent            Comedy         7   399.07 
 8   Independent            Drama          4    69.75 
 9   Independent            Romance        7  1048.75 
 10  Independent            romance        1    29.37 
...

@Bahex Bahex changed the title Group by multiple add multiple grouper support to group-by Nov 14, 2024
@sholderbach sholderbach added notes:breaking-changes This PR implies a change affecting users and has to be noted in the release notes deprecated:pr-commands (deprecated: too vague) This PR changes our commands in some way labels Nov 14, 2024
@fdncred fdncred merged commit b6e8487 into nushell:main Nov 15, 2024
@github-actions github-actions bot added this to the v0.101.0 milestone Nov 15, 2024
@fdncred fdncred added the notes:mention Include the release notes summary in the "Hall of Fame" section label Nov 15, 2024
@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Nov 15, 2024

Could you write something up in the release notes for 0.101 to go with this PR when you have time? (i'm not even sure if someone has created the 101 release notes yet)

@IanManske
Copy link
Copy Markdown
Member

With the --to-table flag, group-by should probably return two colums: group and items. Because currently, the columns can conflict:

ls | rename items | group-by --to-table items
╭────┬─────────────────────╮
│  # │        items        │
├────┼─────────────────────┤
│  0CITATION.cff        │
│  1CODE_OF_CONDUCT.md  │
│  2CONTRIBUTING.md     │
│  3Cargo.lock          │
│  4Cargo.toml          │
│  5Cross.toml          │
│  6LICENSE             │
│  7README.md           │
│  8SECURITY.md         │
│  9 │ assets              │
│ 10 │ benches             │
│ 11 │ crates              │
│ 12 │ devdocs             │
│ 13 │ docker              │
│ 14 │ rust-toolchain.toml │
│ 15 │ scripts             │
│ 16 │ src                 │
│ 17 │ target              │
│ 18 │ tests               │
│ 19 │ toolkit.nu          │
│ 20 │ typos.toml          │
│ 21 │ wix                 │
╰────┴─────────────────────╯

@Bahex
Copy link
Copy Markdown
Member Author

Bahex commented Nov 16, 2024

With the --to-table flag, group-by should probably return two colums: group and items. Because currently, the columns can conflict:

ls | rename items | group-by --to-table items
╭────┬─────────────────────╮
│  # │        items        │
├────┼─────────────────────┤
│  0CITATION.cff        │
│  1CODE_OF_CONDUCT.md  │
...

@IanManske, thanks for bringing this up, that specific name conflict didn't come to my mind while implementing this, and now I can think of some other (though less likely) cases that can happen when using closures as groupers as well.

I thought of a few ways to solve this, but the most predictable behavior would be to not use groupers for any kind of column naming, and just name group columns group{i}, same as when using closures to group.

With that change output would be like this.

> $data | group-by lang year --to-table
╭─#─┬─group0─┬─group1─┬────────────items─────────────╮
 0  rb      2019    ╭─#─┬──name──┬─lang─┬─year─╮ 
                     0  andres  rb    2019  
                    ╰─#─┴──name──┴─lang─┴─year─╯ 
 1  rs      2019    ╭─#─┬─name─┬─lang─┬─year─╮   
                     0  jt    rs    2019    
                    ╰─#─┴─name─┴─lang─┴─year─╯   
 2  rs      2021    ╭─#─┬─name──┬─lang─┬─year─╮  
                     0  storm  rs    2021   
                    ╰─#─┴─name──┴─lang─┴─year─╯  
╰─#─┴─group0─┴─group1─┴────────────items─────────────╯

While I find naming the columns after the grouper arguments to be more intuitive, this is the more consistent and predictable way. I'll make a new PR when I'm free.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Nov 16, 2024

boo! i'd rather be more intuitive with column names.

@IanManske
Copy link
Copy Markdown
Member

IanManske commented Nov 16, 2024

Just realized I should have worded my previous comment more clearly. I was thinking that group-by should always return two columns. This would get rid of the potential conflict with items.

ls | group-by --to-table { get name | path parse | get extension } { get type }
╭───┬───────────────────┬─────────────────────────────────────────────────────────────╮
│ # │       group       │                            items                            │
├───┼───────────────────┼─────────────────────────────────────────────────────────────┤
│ 0 │ ╭────────┬──────╮ │ ╭───┬─────────────────────┬──────┬─────────┬──────────────╮ │
│   │ │ group0 │ toml │ │ │ # │        name         │ type │  size   │   modified   │ │
│   │ │ group1 │ file │ │ ├───┼─────────────────────┼──────┼─────────┼──────────────┤ │
│   │ ╰────────┴──────╯ │ │ 0Cargo.toml          │ file │ 9.0 KiB4 hours ago  │ │
│   │                   │ │ 1Cross.toml          │ file │   666 B6 months ago │ │
│   │                   │ │ 2 │ rust-toolchain.toml │ file │ 1.1 KiB │ a day ago    │ │
│   │                   │ │ 3 │ typos.toml          │ file │   513 B2 months ago │ │
│   │                   │ ╰───┴─────────────────────┴──────┴─────────┴──────────────╯ │
│ 1 │ ╭────────┬──────╮ │ ╭───┬─────────┬──────┬─────────┬────────────╮               │
│   │ │ group0 │      │ │ │ # │  name   │ type │  size   │  modified  │               │
│   │ │ group1 │ file │ │ ├───┼─────────┼──────┼─────────┼────────────┤               │
│   │ ╰────────┴──────╯ │ │ 0LICENSE │ file │ 1.1 KiB │ a year ago │               │
│   │                   │ ╰───┴─────────┴──────┴─────────┴────────────╯               │
│ 2 │ ╭────────┬─────╮  │ ╭────┬───────────┬──────┬─────────┬──────────────╮          │
│   │ │ group0 │     │  │ │  # │   name    │ type │  size   │   modified   │          │
│   │ │ group1 │ dir │  │ ├────┼───────────┼──────┼─────────┼──────────────┤          │
│   │ ╰────────┴─────╯  │ │  0 │ assets    │ dir  │ 4.0 KiB6 months ago │          │
│   │                   │ │  1 │ benches   │ dir  │ 4.0 KiB │ a day ago    │          │
│   │                   │ │  2 │ crates    │ dir  │ 4.0 KiB3 weeks ago  │          │
│   │                   │ │  3 │ devdocs   │ dir  │ 4.0 KiB2 months ago │          │
│   │                   │ │  4 │ docker    │ dir  │ 4.0 KiB │ a day ago    │          │
│   │                   │ │  5 │ empty_dir │ dir  │ 4.0 KiB3 months ago │          │
│   │                   │ │  6 │ scripts   │ dir  │ 4.0 KiB2 months ago │          │
│   │                   │ │  7 │ src       │ dir  │ 4.0 KiB │ a day ago    │          │
│   │                   │ │  8 │ target    │ dir  │ 4.0 KiB2 months ago │          │
│   │                   │ │  9 │ tests     │ dir  │ 4.0 KiB5 months ago │          │
│   │                   │ │ 10 │ wix       │ dir  │ 4.0 KiB5 months ago │          │
│   │                   │ ╰────┴───────────┴──────┴─────────┴──────────────╯          │
╰───┴───────────────────┴─────────────────────────────────────────────────────────────╯

However, this doesn't solve the issue where the column names generated for the closures can conflict with a cell path. Not sure what the best solution is. Some ideas:

  • If any closure is provided, change the groups to be lists instead of a records. E.g, [toml, file] instead of {group0: toml, group1: file}.
  • The above might be slightly confusing though (the return structure/type is changed based on arguments). Alternatively, we disallow mixing cell paths and closures, returning record group keys for cell paths and list group keys for closures.
  • When provided a single closure, both of the above create groups keys that are lists of length 1. This seems annoying -- if I provide a single closure, I would expect the group key to be the same as the return value of the closure instead of it being wrapped in a list. So, perhaps we should only allow providing one closure to compute the group key.

So, right now I'm leaning towards only allowing two ways of providing arguments to group-by:

# any number of cell paths
$data | group-by --to-table path1 path2 path3.nested
# returns a table with a `group` column and an `items` column. `group` will contain records like:
# { path1: _, path2: _, path3: { nested: _ } }

# one closure
$data | group-by --to-table { path parse | get extension }
# returns a table with a `group` and an `items` column. `group` will contain the closure return value,
# in this case, a file extension string.

This is nice and simple, but closures and cell paths cannot be mixed. IMO this is fine, since you can use select in the closure to choose additional keys using cell paths. I.e., group-by path1 path2 is shorthand for / equivalent to group-by { select path1 path2 }

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Nov 16, 2024

I'm not a fan of only two columns. It makes it much harder to understand.

Bahex added a commit to Bahex/nushell that referenced this pull request Nov 17, 2024
fdncred pushed a commit that referenced this pull request Nov 17, 2024
A more involved solution to the issue pointed out
[here](#14337 (comment))

# Description

With `--to-table`
- cell-path groupers are used to create column names, similar to
`select`
- closure groupers result in columns named `closure_{i}` where `i` is
the index of argument, with regards to other closures i.e. first closure
grouper results in a column named `closure_0`

  Previously
  - `group-by foo {...} {...}` => `table<foo, group1, group2, items>`
  - `group-by {...} foo {...}` => `table<group0, foo, group2, items>`
  
  With this PR
- `group-by foo {...} {...}` => `table<foo, closure_0, closure_1,
items>`
- `group-by {...} foo {...}` => `table<closure_0, foo, closure_1,
items>`
- no grouper argument results in a `table<group, items>` as previously

On naming conflicts caused by cell-path groupers named `items` or
`closure_{i}`, an error is thrown, suggesting to use a closure in place
of a cell-path.

```nushell
❯ ls | rename items | group-by items --to-table 
Error:   × grouper arguments can't be named `items`
   ╭─[entry #3:1:29]
 1 │ ls | rename items | group-by items --to-table 
   ·                             ────────┬────────
   ·                                     ╰── contains `items`
   ╰────
  help: instead of a cell-path, try using a closure
```
And following the suggestion:
```nushell
❯ ls | rename items | group-by { get items } --to-table 
╭─#──┬──────closure_0──────┬───────────────────────────items────────────────────────────╮
│ 0  │ CITATION.cff        │ ╭─#─┬────items─────┬─type─┬─size──┬───modified───╮         │
│    │                     │ │ 0 │ CITATION.cff │ file │ 812 B │ 3 months ago │         │
│    │                     │ ╰─#─┴────items─────┴─type─┴─size──┴───modified───╯         │
│ 1  │ CODE_OF_CONDUCT.md  │ ╭─#─┬───────items────────┬─type─┬──size───┬───modified───╮ │
...
```
fdncred pushed a commit to nushell/nu_scripts that referenced this pull request Dec 31, 2024
# Description

Add `aggregate`, a command that operates on the output of `group-by
--to-table` to help aggregate to do quick inspections.

# Related
- nushell/nushell#14316 (comment)
- nushell/nushell#2607
- nushell/nushell#14337

# Examples

```nushell
open ~/Downloads/movies.csv
  | group-by Lead_Studio Genre --to-table
  | aggregate Worldwide_Gross
  # | first 4
  # | to md
```


|Lead_Studio|Genre|count|Worldwide_Gross_min|Worldwide_Gross_avg|Worldwide_Gross_max|Worldwide_Gross_sum|
|-|-|-|-|-|-|-|
|The Weinstein Company|Comedy|1|19.62|19.62|19.62|19.62|
|The Weinstein Company|Drama|1|8.26|8.26|8.26|8.26|
|Independent|Comedy|7|14.31|57.01|205.3|399.07|
|Independent|Romance|7|0.03|149.82142857142858|702.17|1048.75|

---

```nushell
open ~/Downloads/movies.csv
  | group-by Lead_Studio Genre --to-table
  | aggregate Worldwide_Gross --ops {avg: {math avg}, std: {math stddev}}
  # | first 4
  # | to md
```

|Lead_Studio|Genre|count|Worldwide_Gross_avg|Worldwide_Gross_std|
|-|-|-|-|-|
|The Weinstein Company|Comedy|1|19.62|0|
|The Weinstein Company|Drama|1|8.26|0|
|Independent|Comedy|7|57.01|66.1709932134704|
|Independent|Romance|7|149.82142857142858|229.79475832816996|

---

```nushell
open ~/Downloads/movies.csv
  | group-by Lead_Studio Genre --to-table
  | aggregate Worldwide_Gross Audience_score_% --ops {avg: {math avg}}
  # | first 4
  # | to md
```

|Lead_Studio|Genre|count|Worldwide_Gross_avg|Audience_score_%_avg|
|-|-|-|-|-|
|The Weinstein Company|Comedy|1|19.62|52|
|The Weinstein Company|Drama|1|8.26|84|
|Independent|Comedy|7|57.01|60.142857142857146|
|Independent|Romance|7|149.82142857142858|59.857142857142854|
@Bahex Bahex deleted the group-by-multiple branch March 22, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deprecated:pr-commands (deprecated: too vague) This PR changes our commands in some way notes:breaking-changes This PR implies a change affecting users and has to be noted in the release notes notes:mention Include the release notes summary in the "Hall of Fame" section

Projects

None yet

Development

Successfully merging this pull request may close these issues.

group-by should support multiple grouper arguments

4 participants