Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 5, 2025

Closes #19098, follow up to #19094. Related to #14936

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Dec 5, 2025
@adriangb adriangb force-pushed the add-stats-column-sizes branch from af67b08 to 874a51d Compare December 7, 2025 19:28
@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Dec 7, 2025
@adriangb adriangb force-pushed the add-stats-column-sizes branch from 53b5875 to 23ff3e9 Compare December 8, 2025 15:26
@adriangb adriangb marked this pull request as ready for review December 8, 2025 17:24
Comment on lines +520 to +528
let sum_scan_bytes: Option<usize> = self
.column_statistics
.iter()
.map(|cs| cs.byte_size.get_value().copied())
.try_fold(0usize, |acc, val| val.map(|v| acc + v));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to handle exactness I think

s
};
let s = if cs.byte_size != Precision::Absent {
format!("{} ScanBytes={}", s, cs.byte_size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change to just Bytes

Comment on lines -1456 to +1524
// 300/1000 = 0.3, so 8000 * 0.3 = 2400
assert_eq!(result.total_byte_size, Precision::Inexact(2400));
// Column 1: byte_size 800 * (300/500) = 240, Sum = 240
assert_eq!(result.total_byte_size, Precision::Inexact(240));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double check these numbers

Comment on lines +516 to +517
// This is the same logic as parquet_column but we start from arrow schema index
// instead of looking up by name.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment

@mbutrovich mbutrovich self-requested a review December 10, 2025 16:34
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb! Just a minor comment nit.

This sent me down a rabbit hole of how we might support stats when there are expressions that change a column's size, but that's way out of scope here.

/// The number of table rows.
/// The number of rows estimated to be scanned.
pub num_rows: Precision<usize>,
/// Total bytes of the table rows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you reworded "table rows" above, we should do something similar here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent 7a15d26 which actually does a bit more rewording than even you suggested. Could you take a look at the new wording?

@adriangb
Copy link
Contributor Author

This sent me down a rabbit hole of how we might support stats when there are expressions that change a column's size, but that's way out of scope here.

Yes agreed, I think that's a beneficial but orthogonal change. There's already some stuff present in PhysicalExpression but I forge the current state of it / why it isn't used.

@adriangb adriangb force-pushed the add-stats-column-sizes branch from 23ff3e9 to 7a15d26 Compare December 10, 2025 20:24
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @adriangb! Looks like it just needs a cargo fmt and for CI to go green.

@mbutrovich mbutrovich added this pull request to the merge queue Dec 10, 2025
Merged via the queue into apache:main with commit c1aa1b5 Dec 10, 2025
32 checks passed
@adriangb adriangb deleted the add-stats-column-sizes branch December 10, 2025 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track individual column sizes in Statistics

2 participants