Commit b443ccc
docs: STATS_DEFINITION.md comprehensive update
Key Updates:
1. stats section corrections:
- Added sqlp and joinp to the list of "smart" commands that use the stats cache
- Fixed the skewness formula to match actual implementation: (q3 - (2.0 * q2) + q1) / iqr
- Added information about memory-aware chunking and the QSV_STATS_CHUNK_MEMORY_MB environment variable
2. moarstats section enhancements:
- Added new "Bivariate Statistics" section documenting the 6 bivariate statistics:
- Pearson correlation
- Spearman correlation
- Kendall's tau
- Sample and population covariance
- Mutual information
- Normalized mutual information
- Added performance optimizations (date parsing cache, string interning, early termination, streaming algorithms)
- Documented multi-dataset join capabilities
- Updated xsd_type definition to include Gregorian date type detection (gYear, gYearMonth, etc.) with confidence markers (? vs ??)
3. New frequency section:
Created comprehensive documentation for the frequency command including:
- Frequency table output format (field, value, count, percentage, rank)
- Ranking strategies (dense, min, max, ordinal, average) with examples
- Weighted frequencies support and weight handling rules
- Stats cache integration explaining ID column detection and memory optimization
- JSON/TOON output structure with example and list of 17 additional stats
- Memory-aware processing with chunking behavior and environment variable configuration
[skip ci]
Co-Authored-By: Claude <81847+claude@users.noreply.github.com>1 parent f7644ad commit b443ccc
1 file changed
+238
-9
lines changed
0 commit comments