-
Notifications
You must be signed in to change notification settings - Fork 99
feat: moarstats add bivariate stats
#3247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds bivariate statistics computation to the moarstats command, enabling analysis of relationships between pairs of columns in CSV datasets. The feature computes five correlation/covariance statistics (Pearson, Spearman, Kendall's tau, sample/population covariance) and mutual information for field pairs.
Key changes:
- Implements bivariate statistics with parallel chunked processing for large files and sequential processing for smaller datasets
- Adds multi-dataset join capability to compute bivariate statistics across joined datasets
- Includes comprehensive test coverage with 9 new test cases covering various scenarios (basic correlation, negative correlation, string fields, mixed types, joins, etc.)
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/cmd/moarstats.rs | Core implementation of bivariate statistics computation including correlation algorithms, mutual information calculation, multi-dataset join support, parallel/sequential processing strategies, and extensive optimizations (date parsing cache, string interning, batch conversions) |
| tests/test_moarstats.rs | Comprehensive test suite with 9 test cases covering positive/negative correlations, string fields, multiple fields, all statistics, mixed types, joins, and index auto-creation |
| docs/STATS_DEFINITIONS.md | Documentation for bivariate statistics including definitions of Pearson/Spearman/Kendall correlations, covariance, mutual information, and multi-dataset join usage |
No description provided.