Skip to content

add agg-by as a command that helps aggregate data with sum, avg, and count#14316

Closed
fdncred wants to merge 6 commits intonushell:mainfrom
fdncred:add_agg_by_command
Closed

add agg-by as a command that helps aggregate data with sum, avg, and count#14316
fdncred wants to merge 6 commits intonushell:mainfrom
fdncred:add_agg_by_command

Conversation

@fdncred
Copy link
Copy Markdown
Contributor

@fdncred fdncred commented Nov 12, 2024

Description

This PR tries to make it easier to group and sum/avg/count data. This all could probably be done in a custom command, but I wanted to see if there's an appetite for such a command. This command is pretty naive and doesn't do much checking/validation. It's just meant as a "starter command" that people could add to. Most of the code is ripped off of group-by.

I also wrote this for my mental health during a time of struggle.

I created it because a long time ago I opened this issue, and this PR would close it #2607

Before

open ~/Downloads/movies.csv |
    select Lead_Studio Worldwide_Gross |
    group-by Lead_Studio --to-table |
    upsert Worldwide_Gross_sum {|r| $r.items.Worldwide_Gross | math sum} |
    upsert Worldwide_Gross_avg {|r| $r.items.Worldwide_Gross | math avg} |
    upsert Worldwide_Gross_min {|r| $r.items.Worldwide_Gross | math min} |
    upsert Worldwide_Gross_max {|r| $r.items.Worldwide_Gross | math max} |
    update items {|i| $i.items.Lead_Studio | length} |
    sort-by Worldwide_Gross_sum | rename Lead_Studio count

After

open ~/Downloads/movies.csv |
    agg-by Lead_Studio --agg-column Worldwide_Gross |
    sort-by Worldwide_Gross

More details in my spec

image

User-Facing Changes

Tests + Formatting

After Submitting

fdncred pushed a commit that referenced this pull request Nov 15, 2024
- closes #14330 

Related:
- #2607 
- #14019
- #14316 

# Description
This PR changes `group-by` to support grouping by multiple `grouper`
arguments.

# Changes

- No grouper: no change in behavior 
- Single grouper
  - `--to-table=false`: no change in behavior
  - `--to-table=true`:
    - closure grouper: named group0
    - cell-path grouper: named after the cell-path
- Multiple groupers:
  - `--to-table=false`: nested groups
- `--to-table=true`: one column for each grouper argument, followed by
the `items` column
    - columns corresponding to cell-paths are named after them
- columns corresponding to closure groupers are named `group{i}` where
`i` is the index of the grouper argument

# Examples
```nushell
> [1 3 1 3 2 1 1] | group-by
╭───┬───────────╮
│   │ ╭───┬───╮ │
│ 1 │ │ 0 │ 1 │ │
│   │ │ 1 │ 1 │ │
│   │ │ 2 │ 1 │ │
│   │ │ 3 │ 1 │ │
│   │ ╰───┴───╯ │
│   │ ╭───┬───╮ │
│ 3 │ │ 0 │ 3 │ │
│   │ │ 1 │ 3 │ │
│   │ ╰───┴───╯ │
│   │ ╭───┬───╮ │
│ 2 │ │ 0 │ 2 │ │
│   │ ╰───┴───╯ │
╰───┴───────────╯

> [1 3 1 3 2 1 1] | group-by --to-table
╭─#─┬─group─┬───items───╮
│ 0 │ 1     │ ╭───┬───╮ │
│   │       │ │ 0 │ 1 │ │
│   │       │ │ 1 │ 1 │ │
│   │       │ │ 2 │ 1 │ │
│   │       │ │ 3 │ 1 │ │
│   │       │ ╰───┴───╯ │
│ 1 │ 3     │ ╭───┬───╮ │
│   │       │ │ 0 │ 3 │ │
│   │       │ │ 1 │ 3 │ │
│   │       │ ╰───┴───╯ │
│ 2 │ 2     │ ╭───┬───╮ │
│   │       │ │ 0 │ 2 │ │
│   │       │ ╰───┴───╯ │
╰─#─┴─group─┴───items───╯

> [1 3 1 3 2 1 1] | group-by { $in >= 2 }
╭───────┬───────────╮
│       │ ╭───┬───╮ │
│ false │ │ 0 │ 1 │ │
│       │ │ 1 │ 1 │ │
│       │ │ 2 │ 1 │ │
│       │ │ 3 │ 1 │ │
│       │ ╰───┴───╯ │
│       │ ╭───┬───╮ │
│ true  │ │ 0 │ 3 │ │
│       │ │ 1 │ 3 │ │
│       │ │ 2 │ 2 │ │
│       │ ╰───┴───╯ │
╰───────┴───────────╯

> [1 3 1 3 2 1 1] | group-by { $in >= 2 } --to-table
╭─#─┬─group0─┬───items───╮
│ 0 │ false  │ ╭───┬───╮ │
│   │        │ │ 0 │ 1 │ │
│   │        │ │ 1 │ 1 │ │
│   │        │ │ 2 │ 1 │ │
│   │        │ │ 3 │ 1 │ │
│   │        │ ╰───┴───╯ │
│ 1 │ true   │ ╭───┬───╮ │
│   │        │ │ 0 │ 3 │ │
│   │        │ │ 1 │ 3 │ │
│   │        │ │ 2 │ 2 │ │
│   │        │ ╰───┴───╯ │
╰─#─┴─group0─┴───items───╯
```

```nushell
let data = [
    [name, lang, year];
    [andres, rb, "2019"],
    [jt, rs, "2019"],
    [storm, rs, "2021"]
]

> $data
╭─#─┬──name──┬─lang─┬─year─╮
│ 0 │ andres │ rb   │ 2019 │
│ 1 │ jt     │ rs   │ 2019 │
│ 2 │ storm  │ rs   │ 2021 │
╰─#─┴──name──┴─lang─┴─year─╯
```

```nushell
> $data | group-by lang
╭────┬──────────────────────────────╮
│    │ ╭─#─┬──name──┬─lang─┬─year─╮ │
│ rb │ │ 0 │ andres │ rb   │ 2019 │ │
│    │ ╰─#─┴──name──┴─lang─┴─year─╯ │
│    │ ╭─#─┬─name──┬─lang─┬─year─╮  │
│ rs │ │ 0 │ jt    │ rs   │ 2019 │  │
│    │ │ 1 │ storm │ rs   │ 2021 │  │
│    │ ╰─#─┴─name──┴─lang─┴─year─╯  │
╰────┴──────────────────────────────╯
```

Group column is now named after the grouper, to allow multiple groupers.
```nushell
> $data | group-by lang --to-table  # column names changed!
╭─#─┬─lang─┬────────────items─────────────╮
│ 0 │ rb   │ ╭─#─┬──name──┬─lang─┬─year─╮ │
│   │      │ │ 0 │ andres │ rb   │ 2019 │ │
│   │      │ ╰─#─┴──name──┴─lang─┴─year─╯ │
│ 1 │ rs   │ ╭─#─┬─name──┬─lang─┬─year─╮  │
│   │      │ │ 0 │ jt    │ rs   │ 2019 │  │
│   │      │ │ 1 │ storm │ rs   │ 2021 │  │
│   │      │ ╰─#─┴─name──┴─lang─┴─year─╯  │
╰─#─┴─lang─┴────────────items─────────────╯
```

Grouping by multiple columns makes finer grained aggregations possible.
```nushell
> $data | group-by lang year --to-table
╭─#─┬─lang─┬─year─┬────────────items─────────────╮
│ 0 │ rb   │ 2019 │ ╭─#─┬──name──┬─lang─┬─year─╮ │
│   │      │      │ │ 0 │ andres │ rb   │ 2019 │ │
│   │      │      │ ╰─#─┴──name──┴─lang─┴─year─╯ │
│ 1 │ rs   │ 2019 │ ╭─#─┬─name─┬─lang─┬─year─╮   │
│   │      │      │ │ 0 │ jt   │ rs   │ 2019 │   │
│   │      │      │ ╰─#─┴─name─┴─lang─┴─year─╯   │
│ 2 │ rs   │ 2021 │ ╭─#─┬─name──┬─lang─┬─year─╮  │
│   │      │      │ │ 0 │ storm │ rs   │ 2021 │  │
│   │      │      │ ╰─#─┴─name──┴─lang─┴─year─╯  │
╰─#─┴─lang─┴─year─┴────────────items─────────────╯
```

Grouping by multiple columns, without `--to-table` returns a nested
structure.
This is equivalent to `$data | group-by year | split-by lang`, making
`split-by` obsolete.
```nushell
> $data | group-by lang year
╭────┬─────────────────────────────────────────╮
│    │ ╭──────┬──────────────────────────────╮ │
│ rb │ │      │ ╭─#─┬──name──┬─lang─┬─year─╮ │ │
│    │ │ 2019 │ │ 0 │ andres │ rb   │ 2019 │ │ │
│    │ │      │ ╰─#─┴──name──┴─lang─┴─year─╯ │ │
│    │ ╰──────┴──────────────────────────────╯ │
│    │ ╭──────┬─────────────────────────────╮  │
│ rs │ │      │ ╭─#─┬─name─┬─lang─┬─year─╮  │  │
│    │ │ 2019 │ │ 0 │ jt   │ rs   │ 2019 │  │  │
│    │ │      │ ╰─#─┴─name─┴─lang─┴─year─╯  │  │
│    │ │      │ ╭─#─┬─name──┬─lang─┬─year─╮ │  │
│    │ │ 2021 │ │ 0 │ storm │ rs   │ 2021 │ │  │
│    │ │      │ ╰─#─┴─name──┴─lang─┴─year─╯ │  │
│    │ ╰──────┴─────────────────────────────╯  │
╰────┴─────────────────────────────────────────╯
```

From #2607:
> Here's a couple more examples without much explanation. This one shows
adding two grouping keys. I'm always wanting to add more columns when
using group-by and it just-work:tm: `gb.exe -f movies-2.csv -k 3,2 -s 7
--skip_header`
> 
> ```
>  k:3                   | k:2       | count | sum:7
> -----------------------+-----------+-------+--------------------
>  20th Century Fox      | Drama     | 1     | 117.09
>  20th Century Fox      | Romance   | 1     | 39.66
>  CBS                   | Comedy    | 1     | 77.09
>  Disney                | Animation | 4     | 1264.23
>  Disney                | Comedy    | 4     | 950.27
>  Fox                   | Comedy    | 5     | 661.85
>  Independent           | Comedy    | 7     | 399.07
>  Independent           | Drama     | 4     | 69.75
>  Independent           | Romance   | 7     | 1048.75
>  Independent           | romance   | 1     | 29.37
> ...
> ```

This example can be achieved like this:
```nushell
> open movies-2.csv
  | group-by "Lead Studio" Genre --to-table
  | insert count {get items | length}
  | insert sum { get items."Worldwide Gross" | math sum}
  | reject items
  | sort-by "Lead Studio" Genre
╭─#──┬──────Lead Studio──────┬───Genre───┬─count─┬───sum───╮
│ 0  │ 20th Century Fox      │ Drama     │     1 │  117.09 │
│ 1  │ 20th Century Fox      │ Romance   │     1 │   39.66 │
│ 2  │ CBS                   │ Comedy    │     1 │   77.09 │
│ 3  │ Disney                │ Animation │     4 │ 1264.23 │
│ 4  │ Disney                │ Comedy    │     4 │  950.27 │
│ 5  │ Fox                   │ Comedy    │     5 │  661.85 │
│ 6  │ Fox                   │ comedy    │     1 │   60.72 │
│ 7  │ Independent           │ Comedy    │     7 │  399.07 │
│ 8  │ Independent           │ Drama     │     4 │   69.75 │
│ 9  │ Independent           │ Romance   │     7 │ 1048.75 │
│ 10 │ Independent           │ romance   │     1 │   29.37 │
...
```
@sholderbach sholderbach added new-command deprecated:pr-commands (deprecated: too vague) This PR changes our commands in some way labels Nov 15, 2024
@IanManske
Copy link
Copy Markdown
Member

IanManske commented Nov 18, 2024

Maybe math stats would be a better name for this command?

Edit: hold up, still reading the command...

@fdncred
Copy link
Copy Markdown
Contributor Author

fdncred commented Nov 18, 2024

It is a little like str stats but this is more about grouping and aggregating data, which kind of takes too many steps with nushell, imo. Just looking for a QOL win.

@fdncred
Copy link
Copy Markdown
Contributor Author

fdncred commented Dec 2, 2024

closing in favor of script, hoping it gets into stdlib eventually

def default-agg-ops [] {
    {
        min: {math min},
        avg: {math avg},
        max: {math max},
        sum: {math sum},
    }
}

def col_name [col: cell-path, op_name: string]: [nothing -> string] {
    $col | split cell-path | get value | str join "." | $"($in)_($op_name)"
}

# group and aggregate by multiple columns
export def agg-by [
    grouper: list<cell-path>,
    --agg-ops: record, # default = {min: {math min}, avg: {math avg}, max: {math max}, sum: {math sum}}  
    column: cell-path, # column to perform agg-ops on
    ...rest: cell-path, # additional columns
] {
    group-by ...$grouper --to-table
    | update items {|group|
        let agg_ops = $agg_ops | default (default-agg-ops)

        [$column ...$rest] | each {|col| # cell-path
            $agg_ops | items {|op_name, op| # string, closure
                $group.items | get $col | do $op | wrap (col_name $col $op_name)
            }
            | into record
        }
        | into record
        | insert count ($group.items | length)
        | roll right  # put count as the first column
    }
    | flatten items 
}

@fdncred fdncred closed this Dec 2, 2024
@fdncred fdncred deleted the add_agg_by_command branch December 2, 2024 12:23
fdncred pushed a commit to nushell/nu_scripts that referenced this pull request Dec 31, 2024
# Description

Add `aggregate`, a command that operates on the output of `group-by
--to-table` to help aggregate to do quick inspections.

# Related
- nushell/nushell#14316 (comment)
- nushell/nushell#2607
- nushell/nushell#14337

# Examples

```nushell
open ~/Downloads/movies.csv
  | group-by Lead_Studio Genre --to-table
  | aggregate Worldwide_Gross
  # | first 4
  # | to md
```


|Lead_Studio|Genre|count|Worldwide_Gross_min|Worldwide_Gross_avg|Worldwide_Gross_max|Worldwide_Gross_sum|
|-|-|-|-|-|-|-|
|The Weinstein Company|Comedy|1|19.62|19.62|19.62|19.62|
|The Weinstein Company|Drama|1|8.26|8.26|8.26|8.26|
|Independent|Comedy|7|14.31|57.01|205.3|399.07|
|Independent|Romance|7|0.03|149.82142857142858|702.17|1048.75|

---

```nushell
open ~/Downloads/movies.csv
  | group-by Lead_Studio Genre --to-table
  | aggregate Worldwide_Gross --ops {avg: {math avg}, std: {math stddev}}
  # | first 4
  # | to md
```

|Lead_Studio|Genre|count|Worldwide_Gross_avg|Worldwide_Gross_std|
|-|-|-|-|-|
|The Weinstein Company|Comedy|1|19.62|0|
|The Weinstein Company|Drama|1|8.26|0|
|Independent|Comedy|7|57.01|66.1709932134704|
|Independent|Romance|7|149.82142857142858|229.79475832816996|

---

```nushell
open ~/Downloads/movies.csv
  | group-by Lead_Studio Genre --to-table
  | aggregate Worldwide_Gross Audience_score_% --ops {avg: {math avg}}
  # | first 4
  # | to md
```

|Lead_Studio|Genre|count|Worldwide_Gross_avg|Audience_score_%_avg|
|-|-|-|-|-|
|The Weinstein Company|Comedy|1|19.62|52|
|The Weinstein Company|Drama|1|8.26|84|
|Independent|Comedy|7|57.01|60.142857142857146|
|Independent|Romance|7|149.82142857142858|59.857142857142854|
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deprecated:pr-commands (deprecated: too vague) This PR changes our commands in some way new-command

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants