add multiple grouper support to group-by#14337
Conversation
along with some variable name changes
|
Could you write something up in the release notes for 0.101 to go with this PR when you have time? (i'm not even sure if someone has created the 101 release notes yet) |
|
With the ls | rename items | group-by --to-table items
╭────┬─────────────────────╮
│ # │ items │
├────┼─────────────────────┤
│ 0 │ CITATION.cff │
│ 1 │ CODE_OF_CONDUCT.md │
│ 2 │ CONTRIBUTING.md │
│ 3 │ Cargo.lock │
│ 4 │ Cargo.toml │
│ 5 │ Cross.toml │
│ 6 │ LICENSE │
│ 7 │ README.md │
│ 8 │ SECURITY.md │
│ 9 │ assets │
│ 10 │ benches │
│ 11 │ crates │
│ 12 │ devdocs │
│ 13 │ docker │
│ 14 │ rust-toolchain.toml │
│ 15 │ scripts │
│ 16 │ src │
│ 17 │ target │
│ 18 │ tests │
│ 19 │ toolkit.nu │
│ 20 │ typos.toml │
│ 21 │ wix │
╰────┴─────────────────────╯ |
@IanManske, thanks for bringing this up, that specific name conflict didn't come to my mind while implementing this, and now I can think of some other (though less likely) cases that can happen when using closures as groupers as well. I thought of a few ways to solve this, but the most predictable behavior would be to not use groupers for any kind of column naming, and just name group columns With that change output would be like this. > $data | group-by lang year --to-table
╭─#─┬─group0─┬─group1─┬────────────items─────────────╮
│ 0 │ rb │ 2019 │ ╭─#─┬──name──┬─lang─┬─year─╮ │
│ │ │ │ │ 0 │ andres │ rb │ 2019 │ │
│ │ │ │ ╰─#─┴──name──┴─lang─┴─year─╯ │
│ 1 │ rs │ 2019 │ ╭─#─┬─name─┬─lang─┬─year─╮ │
│ │ │ │ │ 0 │ jt │ rs │ 2019 │ │
│ │ │ │ ╰─#─┴─name─┴─lang─┴─year─╯ │
│ 2 │ rs │ 2021 │ ╭─#─┬─name──┬─lang─┬─year─╮ │
│ │ │ │ │ 0 │ storm │ rs │ 2021 │ │
│ │ │ │ ╰─#─┴─name──┴─lang─┴─year─╯ │
╰─#─┴─group0─┴─group1─┴────────────items─────────────╯While I find naming the columns after the grouper arguments to be more intuitive, this is the more consistent and predictable way. I'll make a new PR when I'm free. |
|
boo! i'd rather be more intuitive with column names. |
|
Just realized I should have worded my previous comment more clearly. I was thinking that ls | group-by --to-table { get name | path parse | get extension } { get type }
╭───┬───────────────────┬─────────────────────────────────────────────────────────────╮
│ # │ group │ items │
├───┼───────────────────┼─────────────────────────────────────────────────────────────┤
│ 0 │ ╭────────┬──────╮ │ ╭───┬─────────────────────┬──────┬─────────┬──────────────╮ │
│ │ │ group0 │ toml │ │ │ # │ name │ type │ size │ modified │ │
│ │ │ group1 │ file │ │ ├───┼─────────────────────┼──────┼─────────┼──────────────┤ │
│ │ ╰────────┴──────╯ │ │ 0 │ Cargo.toml │ file │ 9.0 KiB │ 4 hours ago │ │
│ │ │ │ 1 │ Cross.toml │ file │ 666 B │ 6 months ago │ │
│ │ │ │ 2 │ rust-toolchain.toml │ file │ 1.1 KiB │ a day ago │ │
│ │ │ │ 3 │ typos.toml │ file │ 513 B │ 2 months ago │ │
│ │ │ ╰───┴─────────────────────┴──────┴─────────┴──────────────╯ │
│ 1 │ ╭────────┬──────╮ │ ╭───┬─────────┬──────┬─────────┬────────────╮ │
│ │ │ group0 │ │ │ │ # │ name │ type │ size │ modified │ │
│ │ │ group1 │ file │ │ ├───┼─────────┼──────┼─────────┼────────────┤ │
│ │ ╰────────┴──────╯ │ │ 0 │ LICENSE │ file │ 1.1 KiB │ a year ago │ │
│ │ │ ╰───┴─────────┴──────┴─────────┴────────────╯ │
│ 2 │ ╭────────┬─────╮ │ ╭────┬───────────┬──────┬─────────┬──────────────╮ │
│ │ │ group0 │ │ │ │ # │ name │ type │ size │ modified │ │
│ │ │ group1 │ dir │ │ ├────┼───────────┼──────┼─────────┼──────────────┤ │
│ │ ╰────────┴─────╯ │ │ 0 │ assets │ dir │ 4.0 KiB │ 6 months ago │ │
│ │ │ │ 1 │ benches │ dir │ 4.0 KiB │ a day ago │ │
│ │ │ │ 2 │ crates │ dir │ 4.0 KiB │ 3 weeks ago │ │
│ │ │ │ 3 │ devdocs │ dir │ 4.0 KiB │ 2 months ago │ │
│ │ │ │ 4 │ docker │ dir │ 4.0 KiB │ a day ago │ │
│ │ │ │ 5 │ empty_dir │ dir │ 4.0 KiB │ 3 months ago │ │
│ │ │ │ 6 │ scripts │ dir │ 4.0 KiB │ 2 months ago │ │
│ │ │ │ 7 │ src │ dir │ 4.0 KiB │ a day ago │ │
│ │ │ │ 8 │ target │ dir │ 4.0 KiB │ 2 months ago │ │
│ │ │ │ 9 │ tests │ dir │ 4.0 KiB │ 5 months ago │ │
│ │ │ │ 10 │ wix │ dir │ 4.0 KiB │ 5 months ago │ │
│ │ │ ╰────┴───────────┴──────┴─────────┴──────────────╯ │
╰───┴───────────────────┴─────────────────────────────────────────────────────────────╯However, this doesn't solve the issue where the column names generated for the closures can conflict with a cell path. Not sure what the best solution is. Some ideas:
So, right now I'm leaning towards only allowing two ways of providing arguments to # any number of cell paths
$data | group-by --to-table path1 path2 path3.nested
# returns a table with a `group` column and an `items` column. `group` will contain records like:
# { path1: _, path2: _, path3: { nested: _ } }
# one closure
$data | group-by --to-table { path parse | get extension }
# returns a table with a `group` and an `items` column. `group` will contain the closure return value,
# in this case, a file extension string.This is nice and simple, but closures and cell paths cannot be mixed. IMO this is fine, since you can use |
|
I'm not a fan of only two columns. It makes it much harder to understand. |
A more involved solution to the issue pointed out [here](#14337 (comment)) # Description With `--to-table` - cell-path groupers are used to create column names, similar to `select` - closure groupers result in columns named `closure_{i}` where `i` is the index of argument, with regards to other closures i.e. first closure grouper results in a column named `closure_0` Previously - `group-by foo {...} {...}` => `table<foo, group1, group2, items>` - `group-by {...} foo {...}` => `table<group0, foo, group2, items>` With this PR - `group-by foo {...} {...}` => `table<foo, closure_0, closure_1, items>` - `group-by {...} foo {...}` => `table<closure_0, foo, closure_1, items>` - no grouper argument results in a `table<group, items>` as previously On naming conflicts caused by cell-path groupers named `items` or `closure_{i}`, an error is thrown, suggesting to use a closure in place of a cell-path. ```nushell ❯ ls | rename items | group-by items --to-table Error: × grouper arguments can't be named `items` ╭─[entry #3:1:29] 1 │ ls | rename items | group-by items --to-table · ────────┬──────── · ╰── contains `items` ╰──── help: instead of a cell-path, try using a closure ``` And following the suggestion: ```nushell ❯ ls | rename items | group-by { get items } --to-table ╭─#──┬──────closure_0──────┬───────────────────────────items────────────────────────────╮ │ 0 │ CITATION.cff │ ╭─#─┬────items─────┬─type─┬─size──┬───modified───╮ │ │ │ │ │ 0 │ CITATION.cff │ file │ 812 B │ 3 months ago │ │ │ │ │ ╰─#─┴────items─────┴─type─┴─size──┴───modified───╯ │ │ 1 │ CODE_OF_CONDUCT.md │ ╭─#─┬───────items────────┬─type─┬──size───┬───modified───╮ │ ... ```
# Description Add `aggregate`, a command that operates on the output of `group-by --to-table` to help aggregate to do quick inspections. # Related - nushell/nushell#14316 (comment) - nushell/nushell#2607 - nushell/nushell#14337 # Examples ```nushell open ~/Downloads/movies.csv | group-by Lead_Studio Genre --to-table | aggregate Worldwide_Gross # | first 4 # | to md ``` |Lead_Studio|Genre|count|Worldwide_Gross_min|Worldwide_Gross_avg|Worldwide_Gross_max|Worldwide_Gross_sum| |-|-|-|-|-|-|-| |The Weinstein Company|Comedy|1|19.62|19.62|19.62|19.62| |The Weinstein Company|Drama|1|8.26|8.26|8.26|8.26| |Independent|Comedy|7|14.31|57.01|205.3|399.07| |Independent|Romance|7|0.03|149.82142857142858|702.17|1048.75| --- ```nushell open ~/Downloads/movies.csv | group-by Lead_Studio Genre --to-table | aggregate Worldwide_Gross --ops {avg: {math avg}, std: {math stddev}} # | first 4 # | to md ``` |Lead_Studio|Genre|count|Worldwide_Gross_avg|Worldwide_Gross_std| |-|-|-|-|-| |The Weinstein Company|Comedy|1|19.62|0| |The Weinstein Company|Drama|1|8.26|0| |Independent|Comedy|7|57.01|66.1709932134704| |Independent|Romance|7|149.82142857142858|229.79475832816996| --- ```nushell open ~/Downloads/movies.csv | group-by Lead_Studio Genre --to-table | aggregate Worldwide_Gross Audience_score_% --ops {avg: {math avg}} # | first 4 # | to md ``` |Lead_Studio|Genre|count|Worldwide_Gross_avg|Audience_score_%_avg| |-|-|-|-|-| |The Weinstein Company|Comedy|1|19.62|52| |The Weinstein Company|Drama|1|8.26|84| |Independent|Comedy|7|57.01|60.142857142857146| |Independent|Romance|7|149.82142857142858|59.857142857142854|
group-byshould support multiple grouper arguments #14330Related:
split-bycommand #14019agg-byas a command that helps aggregate data with sum, avg, and count #14316Description
This PR changes
group-byto support grouping by multiplegrouperarguments.Changes
--to-table=false: no change in behavior--to-table=true:--to-table=false: nested groups--to-table=true: one column for each grouper argument, followed by theitemscolumngroup{i}whereiis the index of the grouper argumentExamples
Group column is now named after the grouper, to allow multiple groupers.
Grouping by multiple columns makes finer grained aggregations possible.
Grouping by multiple columns, without
--to-tablereturns a nested structure.This is equivalent to
$data | group-by year | split-by lang, makingsplit-byobsolete.From #2607:
This example can be achieved like this: