add agg-by as a command that helps aggregate data with sum, avg, and count#14316
Closed
fdncred wants to merge 6 commits intonushell:mainfrom
Closed
add agg-by as a command that helps aggregate data with sum, avg, and count#14316fdncred wants to merge 6 commits intonushell:mainfrom
agg-by as a command that helps aggregate data with sum, avg, and count#14316fdncred wants to merge 6 commits intonushell:mainfrom
Conversation
This was referenced Nov 13, 2024
fdncred
pushed a commit
that referenced
this pull request
Nov 15, 2024
- closes #14330 Related: - #2607 - #14019 - #14316 # Description This PR changes `group-by` to support grouping by multiple `grouper` arguments. # Changes - No grouper: no change in behavior - Single grouper - `--to-table=false`: no change in behavior - `--to-table=true`: - closure grouper: named group0 - cell-path grouper: named after the cell-path - Multiple groupers: - `--to-table=false`: nested groups - `--to-table=true`: one column for each grouper argument, followed by the `items` column - columns corresponding to cell-paths are named after them - columns corresponding to closure groupers are named `group{i}` where `i` is the index of the grouper argument # Examples ```nushell > [1 3 1 3 2 1 1] | group-by ╭───┬───────────╮ │ │ ╭───┬───╮ │ │ 1 │ │ 0 │ 1 │ │ │ │ │ 1 │ 1 │ │ │ │ │ 2 │ 1 │ │ │ │ │ 3 │ 1 │ │ │ │ ╰───┴───╯ │ │ │ ╭───┬───╮ │ │ 3 │ │ 0 │ 3 │ │ │ │ │ 1 │ 3 │ │ │ │ ╰───┴───╯ │ │ │ ╭───┬───╮ │ │ 2 │ │ 0 │ 2 │ │ │ │ ╰───┴───╯ │ ╰───┴───────────╯ > [1 3 1 3 2 1 1] | group-by --to-table ╭─#─┬─group─┬───items───╮ │ 0 │ 1 │ ╭───┬───╮ │ │ │ │ │ 0 │ 1 │ │ │ │ │ │ 1 │ 1 │ │ │ │ │ │ 2 │ 1 │ │ │ │ │ │ 3 │ 1 │ │ │ │ │ ╰───┴───╯ │ │ 1 │ 3 │ ╭───┬───╮ │ │ │ │ │ 0 │ 3 │ │ │ │ │ │ 1 │ 3 │ │ │ │ │ ╰───┴───╯ │ │ 2 │ 2 │ ╭───┬───╮ │ │ │ │ │ 0 │ 2 │ │ │ │ │ ╰───┴───╯ │ ╰─#─┴─group─┴───items───╯ > [1 3 1 3 2 1 1] | group-by { $in >= 2 } ╭───────┬───────────╮ │ │ ╭───┬───╮ │ │ false │ │ 0 │ 1 │ │ │ │ │ 1 │ 1 │ │ │ │ │ 2 │ 1 │ │ │ │ │ 3 │ 1 │ │ │ │ ╰───┴───╯ │ │ │ ╭───┬───╮ │ │ true │ │ 0 │ 3 │ │ │ │ │ 1 │ 3 │ │ │ │ │ 2 │ 2 │ │ │ │ ╰───┴───╯ │ ╰───────┴───────────╯ > [1 3 1 3 2 1 1] | group-by { $in >= 2 } --to-table ╭─#─┬─group0─┬───items───╮ │ 0 │ false │ ╭───┬───╮ │ │ │ │ │ 0 │ 1 │ │ │ │ │ │ 1 │ 1 │ │ │ │ │ │ 2 │ 1 │ │ │ │ │ │ 3 │ 1 │ │ │ │ │ ╰───┴───╯ │ │ 1 │ true │ ╭───┬───╮ │ │ │ │ │ 0 │ 3 │ │ │ │ │ │ 1 │ 3 │ │ │ │ │ │ 2 │ 2 │ │ │ │ │ ╰───┴───╯ │ ╰─#─┴─group0─┴───items───╯ ``` ```nushell let data = [ [name, lang, year]; [andres, rb, "2019"], [jt, rs, "2019"], [storm, rs, "2021"] ] > $data ╭─#─┬──name──┬─lang─┬─year─╮ │ 0 │ andres │ rb │ 2019 │ │ 1 │ jt │ rs │ 2019 │ │ 2 │ storm │ rs │ 2021 │ ╰─#─┴──name──┴─lang─┴─year─╯ ``` ```nushell > $data | group-by lang ╭────┬──────────────────────────────╮ │ │ ╭─#─┬──name──┬─lang─┬─year─╮ │ │ rb │ │ 0 │ andres │ rb │ 2019 │ │ │ │ ╰─#─┴──name──┴─lang─┴─year─╯ │ │ │ ╭─#─┬─name──┬─lang─┬─year─╮ │ │ rs │ │ 0 │ jt │ rs │ 2019 │ │ │ │ │ 1 │ storm │ rs │ 2021 │ │ │ │ ╰─#─┴─name──┴─lang─┴─year─╯ │ ╰────┴──────────────────────────────╯ ``` Group column is now named after the grouper, to allow multiple groupers. ```nushell > $data | group-by lang --to-table # column names changed! ╭─#─┬─lang─┬────────────items─────────────╮ │ 0 │ rb │ ╭─#─┬──name──┬─lang─┬─year─╮ │ │ │ │ │ 0 │ andres │ rb │ 2019 │ │ │ │ │ ╰─#─┴──name──┴─lang─┴─year─╯ │ │ 1 │ rs │ ╭─#─┬─name──┬─lang─┬─year─╮ │ │ │ │ │ 0 │ jt │ rs │ 2019 │ │ │ │ │ │ 1 │ storm │ rs │ 2021 │ │ │ │ │ ╰─#─┴─name──┴─lang─┴─year─╯ │ ╰─#─┴─lang─┴────────────items─────────────╯ ``` Grouping by multiple columns makes finer grained aggregations possible. ```nushell > $data | group-by lang year --to-table ╭─#─┬─lang─┬─year─┬────────────items─────────────╮ │ 0 │ rb │ 2019 │ ╭─#─┬──name──┬─lang─┬─year─╮ │ │ │ │ │ │ 0 │ andres │ rb │ 2019 │ │ │ │ │ │ ╰─#─┴──name──┴─lang─┴─year─╯ │ │ 1 │ rs │ 2019 │ ╭─#─┬─name─┬─lang─┬─year─╮ │ │ │ │ │ │ 0 │ jt │ rs │ 2019 │ │ │ │ │ │ ╰─#─┴─name─┴─lang─┴─year─╯ │ │ 2 │ rs │ 2021 │ ╭─#─┬─name──┬─lang─┬─year─╮ │ │ │ │ │ │ 0 │ storm │ rs │ 2021 │ │ │ │ │ │ ╰─#─┴─name──┴─lang─┴─year─╯ │ ╰─#─┴─lang─┴─year─┴────────────items─────────────╯ ``` Grouping by multiple columns, without `--to-table` returns a nested structure. This is equivalent to `$data | group-by year | split-by lang`, making `split-by` obsolete. ```nushell > $data | group-by lang year ╭────┬─────────────────────────────────────────╮ │ │ ╭──────┬──────────────────────────────╮ │ │ rb │ │ │ ╭─#─┬──name──┬─lang─┬─year─╮ │ │ │ │ │ 2019 │ │ 0 │ andres │ rb │ 2019 │ │ │ │ │ │ │ ╰─#─┴──name──┴─lang─┴─year─╯ │ │ │ │ ╰──────┴──────────────────────────────╯ │ │ │ ╭──────┬─────────────────────────────╮ │ │ rs │ │ │ ╭─#─┬─name─┬─lang─┬─year─╮ │ │ │ │ │ 2019 │ │ 0 │ jt │ rs │ 2019 │ │ │ │ │ │ │ ╰─#─┴─name─┴─lang─┴─year─╯ │ │ │ │ │ │ ╭─#─┬─name──┬─lang─┬─year─╮ │ │ │ │ │ 2021 │ │ 0 │ storm │ rs │ 2021 │ │ │ │ │ │ │ ╰─#─┴─name──┴─lang─┴─year─╯ │ │ │ │ ╰──────┴─────────────────────────────╯ │ ╰────┴─────────────────────────────────────────╯ ``` From #2607: > Here's a couple more examples without much explanation. This one shows adding two grouping keys. I'm always wanting to add more columns when using group-by and it just-work:tm: `gb.exe -f movies-2.csv -k 3,2 -s 7 --skip_header` > > ``` > k:3 | k:2 | count | sum:7 > -----------------------+-----------+-------+-------------------- > 20th Century Fox | Drama | 1 | 117.09 > 20th Century Fox | Romance | 1 | 39.66 > CBS | Comedy | 1 | 77.09 > Disney | Animation | 4 | 1264.23 > Disney | Comedy | 4 | 950.27 > Fox | Comedy | 5 | 661.85 > Independent | Comedy | 7 | 399.07 > Independent | Drama | 4 | 69.75 > Independent | Romance | 7 | 1048.75 > Independent | romance | 1 | 29.37 > ... > ``` This example can be achieved like this: ```nushell > open movies-2.csv | group-by "Lead Studio" Genre --to-table | insert count {get items | length} | insert sum { get items."Worldwide Gross" | math sum} | reject items | sort-by "Lead Studio" Genre ╭─#──┬──────Lead Studio──────┬───Genre───┬─count─┬───sum───╮ │ 0 │ 20th Century Fox │ Drama │ 1 │ 117.09 │ │ 1 │ 20th Century Fox │ Romance │ 1 │ 39.66 │ │ 2 │ CBS │ Comedy │ 1 │ 77.09 │ │ 3 │ Disney │ Animation │ 4 │ 1264.23 │ │ 4 │ Disney │ Comedy │ 4 │ 950.27 │ │ 5 │ Fox │ Comedy │ 5 │ 661.85 │ │ 6 │ Fox │ comedy │ 1 │ 60.72 │ │ 7 │ Independent │ Comedy │ 7 │ 399.07 │ │ 8 │ Independent │ Drama │ 4 │ 69.75 │ │ 9 │ Independent │ Romance │ 7 │ 1048.75 │ │ 10 │ Independent │ romance │ 1 │ 29.37 │ ... ```
Member
|
Maybe Edit: hold up, still reading the command... |
Contributor
Author
|
It is a little like |
Contributor
Author
|
closing in favor of script, hoping it gets into stdlib eventually def default-agg-ops [] {
{
min: {math min},
avg: {math avg},
max: {math max},
sum: {math sum},
}
}
def col_name [col: cell-path, op_name: string]: [nothing -> string] {
$col | split cell-path | get value | str join "." | $"($in)_($op_name)"
}
# group and aggregate by multiple columns
export def agg-by [
grouper: list<cell-path>,
--agg-ops: record, # default = {min: {math min}, avg: {math avg}, max: {math max}, sum: {math sum}}
column: cell-path, # column to perform agg-ops on
...rest: cell-path, # additional columns
] {
group-by ...$grouper --to-table
| update items {|group|
let agg_ops = $agg_ops | default (default-agg-ops)
[$column ...$rest] | each {|col| # cell-path
$agg_ops | items {|op_name, op| # string, closure
$group.items | get $col | do $op | wrap (col_name $col $op_name)
}
| into record
}
| into record
| insert count ($group.items | length)
| roll right # put count as the first column
}
| flatten items
} |
fdncred
pushed a commit
to nushell/nu_scripts
that referenced
this pull request
Dec 31, 2024
# Description Add `aggregate`, a command that operates on the output of `group-by --to-table` to help aggregate to do quick inspections. # Related - nushell/nushell#14316 (comment) - nushell/nushell#2607 - nushell/nushell#14337 # Examples ```nushell open ~/Downloads/movies.csv | group-by Lead_Studio Genre --to-table | aggregate Worldwide_Gross # | first 4 # | to md ``` |Lead_Studio|Genre|count|Worldwide_Gross_min|Worldwide_Gross_avg|Worldwide_Gross_max|Worldwide_Gross_sum| |-|-|-|-|-|-|-| |The Weinstein Company|Comedy|1|19.62|19.62|19.62|19.62| |The Weinstein Company|Drama|1|8.26|8.26|8.26|8.26| |Independent|Comedy|7|14.31|57.01|205.3|399.07| |Independent|Romance|7|0.03|149.82142857142858|702.17|1048.75| --- ```nushell open ~/Downloads/movies.csv | group-by Lead_Studio Genre --to-table | aggregate Worldwide_Gross --ops {avg: {math avg}, std: {math stddev}} # | first 4 # | to md ``` |Lead_Studio|Genre|count|Worldwide_Gross_avg|Worldwide_Gross_std| |-|-|-|-|-| |The Weinstein Company|Comedy|1|19.62|0| |The Weinstein Company|Drama|1|8.26|0| |Independent|Comedy|7|57.01|66.1709932134704| |Independent|Romance|7|149.82142857142858|229.79475832816996| --- ```nushell open ~/Downloads/movies.csv | group-by Lead_Studio Genre --to-table | aggregate Worldwide_Gross Audience_score_% --ops {avg: {math avg}} # | first 4 # | to md ``` |Lead_Studio|Genre|count|Worldwide_Gross_avg|Audience_score_%_avg| |-|-|-|-|-| |The Weinstein Company|Comedy|1|19.62|52| |The Weinstein Company|Drama|1|8.26|84| |Independent|Comedy|7|57.01|60.142857142857146| |Independent|Romance|7|149.82142857142858|59.857142857142854|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR tries to make it easier to group and sum/avg/count data. This all could probably be done in a custom command, but I wanted to see if there's an appetite for such a command. This command is pretty naive and doesn't do much checking/validation. It's just meant as a "starter command" that people could add to. Most of the code is ripped off of group-by.
I also wrote this for my mental health during a time of struggle.
I created it because a long time ago I opened this issue, and this PR would close it #2607
Before
After
More details in my spec
User-Facing Changes
Tests + Formatting
After Submitting