-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Track column sizes in Statistics; propagate through projections #19113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
af67b08 to
874a51d
Compare
53b5875 to
23ff3e9
Compare
| let sum_scan_bytes: Option<usize> = self | ||
| .column_statistics | ||
| .iter() | ||
| .map(|cs| cs.byte_size.get_value().copied()) | ||
| .try_fold(0usize, |acc, val| val.map(|v| acc + v)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to handle exactness I think
| s | ||
| }; | ||
| let s = if cs.byte_size != Precision::Absent { | ||
| format!("{} ScanBytes={}", s, cs.byte_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to change to just Bytes
| // 300/1000 = 0.3, so 8000 * 0.3 = 2400 | ||
| assert_eq!(result.total_byte_size, Precision::Inexact(2400)); | ||
| // Column 1: byte_size 800 * (300/500) = 240, Sum = 240 | ||
| assert_eq!(result.total_byte_size, Precision::Inexact(240)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to double check these numbers
| // This is the same logic as parquet_column but we start from arrow schema index | ||
| // instead of looking up by name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment
mbutrovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb! Just a minor comment nit.
This sent me down a rabbit hole of how we might support stats when there are expressions that change a column's size, but that's way out of scope here.
datafusion/common/src/stats.rs
Outdated
| /// The number of table rows. | ||
| /// The number of rows estimated to be scanned. | ||
| pub num_rows: Precision<usize>, | ||
| /// Total bytes of the table rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you reworded "table rows" above, we should do something similar here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent 7a15d26 which actually does a bit more rewording than even you suggested. Could you take a look at the new wording?
Yes agreed, I think that's a beneficial but orthogonal change. There's already some stuff present in |
23ff3e9 to
7a15d26
Compare
mbutrovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @adriangb! Looks like it just needs a cargo fmt and for CI to go green.
Closes #19098, follow up to #19094. Related to #14936