|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +**collapse** is a high-performance C/C++-based R package for advanced data transformation and statistical computing. The package provides: |
| 8 | +- Fast grouped and weighted statistical functions with OpenMP multithreading |
| 9 | +- Class-agnostic architecture supporting base R, tibble, data.table, sf, plm, and other extensions |
| 10 | +- ~13,775 lines of R code and ~28,625 lines of C/C++ code |
| 11 | +- Comprehensive test suite with 48 test files |
| 12 | + |
| 13 | +## Build and Test Commands |
| 14 | + |
| 15 | +### Development Workflow |
| 16 | +```r |
| 17 | +# Load package for development |
| 18 | +devtools::load_all() |
| 19 | + |
| 20 | +# Run all tests |
| 21 | +devtools::test() |
| 22 | + |
| 23 | +# Run specific test file |
| 24 | +testthat::test_file("tests/testthat/test-GRP.R") |
| 25 | + |
| 26 | +# Run package check |
| 27 | +devtools::check() |
| 28 | + |
| 29 | +# Build documentation |
| 30 | +devtools::document() |
| 31 | +``` |
| 32 | + |
| 33 | +### Command Line Build |
| 34 | +```bash |
| 35 | +# Build package tarball |
| 36 | +R CMD build . |
| 37 | + |
| 38 | +# Check package (comprehensive tests) |
| 39 | +R CMD check collapse_*.tar.gz |
| 40 | + |
| 41 | +# Install from source |
| 42 | +R CMD INSTALL collapse_*.tar.gz |
| 43 | + |
| 44 | +# Run tests directly |
| 45 | +Rscript -e "testthat::test_dir('tests/testthat')" |
| 46 | +``` |
| 47 | + |
| 48 | +### CI/CD |
| 49 | +- GitHub Actions runs R-CMD-check on macOS, Windows, and Ubuntu |
| 50 | +- Tests against R-devel, R-release, and R-oldrel-1 |
| 51 | +- Code coverage tracked via codecov |
| 52 | +- Documentation auto-deployed via pkgdown |
| 53 | + |
| 54 | +## Architecture |
| 55 | + |
| 56 | +### Code Organization |
| 57 | + |
| 58 | +**R/ directory (52 files)** |
| 59 | +- `f*` prefix: Fast statistical functions (fmean.R, fmedian.R, fvar_fsd.R, etc.) |
| 60 | +- `GRP.R` (1,428 lines): Core grouping object using radix sort or hash methods |
| 61 | +- `fsubset_ftransform_fmutate.R` (1,220 lines): Data manipulation (subset, transform, mutate) |
| 62 | +- `collap.R` (793 lines): Advanced aggregation framework |
| 63 | +- `pivot.R` (823 lines): Pivot operations (wider/longer/recast) |
| 64 | +- `join.R` (473 lines): Fast join operations |
| 65 | +- `fbetween_fwithin.R`, `fhdbetween_fhdwithin.R`: Between/within group transformations |
| 66 | +- `fdiff_fgrowth.R`, `flag.R`: Time series operations (differences, growth rates, lags/leads) |
| 67 | +- `qsu.R`, `descr.R`: Statistical summaries |
| 68 | +- `list_functions.R` (647 lines): Recursive list operations |
| 69 | +- `indexing.R` (741 lines): Time series and panel data indexing |
| 70 | +- `zzz.R` (195 lines): Package initialization, namespace masking system |
| 71 | + |
| 72 | +**src/ directory (94 files)** |
| 73 | +- `collapse_c.h` (195 lines): Main C header with all function signatures and OpenMP macros |
| 74 | +- `base_radixsort.c` (2,158 lines): Fast radix sorting (adapted from base R) |
| 75 | +- `fdiff_fgrowth.cpp` (1,854 lines): Differencing and growth rate calculations |
| 76 | +- `fvar_fsd.cpp` (1,689 lines): Variance and standard deviation |
| 77 | +- `fnth_fmedian_fquantile.c` (1,627 lines): Quantile functions |
| 78 | +- `TRA.c` (1,416 lines): Transformation operations framework |
| 79 | +- `programming.c` (1,284 lines): Programming helper functions |
| 80 | +- `fmode.c` (1,248 lines): Mode calculation |
| 81 | +- `kit_dup.c` (1,169 lines): Duplicate handling (adapted from kit package) |
| 82 | +- `match.c` (1,161 lines): Fast matching operations |
| 83 | +- `flag.cpp` (1,083 lines): Lag/lead operations |
| 84 | +- `data.table_*.c`: Integration with data.table internals (rbindlist, subset, utils) |
| 85 | +- `Makevars`, `Makevars.win`: OpenMP compilation configuration |
| 86 | + |
| 87 | +**tests/testthat/ (48 files)** |
| 88 | +- `test-<function>.R` pattern for each major function |
| 89 | +- Comprehensive tests against base R equivalents |
| 90 | +- Tests with NA/Inf values, different data types, grouping, and weights |
| 91 | + |
| 92 | +### Key Architectural Patterns |
| 93 | + |
| 94 | +**1. Two-Layer Design** |
| 95 | +- R layer: Input validation, S3 method dispatch, attribute handling |
| 96 | +- C/C++ layer: Core computation with OpenMP parallelization |
| 97 | +- Pattern: `.Call()` invokes C functions named with `_C`, `_mC` (matrix), `_lC` (list/data.frame) suffixes |
| 98 | + |
| 99 | +**2. Function Signature Pattern** |
| 100 | +Most statistical functions follow: |
| 101 | +```r |
| 102 | +f<name>(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = TRUE, ...) |
| 103 | +``` |
| 104 | +- `x`: data (vector, matrix, or data.frame) |
| 105 | +- `g`: grouping variable(s) - converted to GRP object internally |
| 106 | +- `w`: weights vector |
| 107 | +- `TRA`: transformation operation (see TRA framework below) |
| 108 | +- S3 methods for: default, matrix, data.frame, grouped_df, zoo, units, pdata.frame, pseries, sf |
| 109 | + |
| 110 | +**3. GRP (Grouping) Object** |
| 111 | +Central to all grouped operations: |
| 112 | +- Created via `GRP()` or automatically from factors/lists |
| 113 | +- Contains: `N.groups`, `group.sizes`, `groups` (names), `group.id`, `group.starts` |
| 114 | +- Two algorithms: radix sort (default, stable) or hash-based (faster for many groups) |
| 115 | +- Access via `GRP()`, `fgroup_by()`, or automatic conversion from factors |
| 116 | + |
| 117 | +**4. TRA (Transformation) Framework** |
| 118 | +10 transformation operations applicable to all statistical functions: |
| 119 | +- `"-"`: center (subtract statistic) |
| 120 | +- `"+"`: add statistic |
| 121 | +- `"*"`: multiply by statistic |
| 122 | +- `"/"`: divide by statistic (scale) |
| 123 | +- `"%"`: compute percentage of statistic |
| 124 | +- `"%%"`: modulus |
| 125 | +- `"-+"`: center and add overall mean |
| 126 | +- `"+-"`: add statistic and subtract overall mean |
| 127 | +- `"replace"`: replace values with statistic |
| 128 | +- `"replace_fill"`: replace and fill with statistic |
| 129 | + |
| 130 | +Example: `fmean(x, g, TRA = "-")` centers x by group means |
| 131 | + |
| 132 | +**5. Namespace Masking System** |
| 133 | +- Functions can be exported without 'f' prefix via `options(collapse_mask = "all")` |
| 134 | +- Controlled via `.fastverse` config file or `set_collapse(mask = "manip")` |
| 135 | +- Allows `mean()`, `sum()`, etc. to use collapse's fast versions |
| 136 | +- Keywords: "all", "fast-fun", "fast-stat-fun", "helper", "manip", "special" |
| 137 | + |
| 138 | +**6. Class-Agnostic Design** |
| 139 | +- Consistent S3 methods across all data structures |
| 140 | +- Attribute preservation system maintains class-specific attributes |
| 141 | +- Functions work identically on base R, tibble, data.table, sf, plm objects |
| 142 | + |
| 143 | +**7. OpenMP Parallelization** |
| 144 | +- Controlled via `set_collapse(nthreads = n)` or `options(collapse_nthreads = n)` |
| 145 | +- Automatic fallback if OpenMP not available |
| 146 | +- Used in: fsum, fmean, fmode, and other computationally intensive functions |
| 147 | +- Compilation flags in Makevars handle platform differences |
| 148 | + |
| 149 | +**8. data.table Integration** |
| 150 | +- Reuses core algorithms from data.table (radixsort, rbindlist, subset) under MPL 2.0 |
| 151 | +- Works natively on data.table objects |
| 152 | +- Compatible with `:=` operator |
| 153 | +- Modified for collapse's specific needs |
| 154 | + |
| 155 | +## Common Development Patterns |
| 156 | + |
| 157 | +### Adding a New Statistical Function |
| 158 | + |
| 159 | +1. **Create R file** `R/f<name>.R`: |
| 160 | + - Define S3 generic: `f<name> <- function(x, ...) UseMethod("f<name>")` |
| 161 | + - Implement methods: `f<name>.default`, `f<name>.matrix`, `f<name>.data.frame` |
| 162 | + - Add grouping support via GRP object |
| 163 | + - Add TRA argument for transformations |
| 164 | + - Handle attributes with `copyAttrib()` or `copyMostAttrib()` |
| 165 | + |
| 166 | +2. **Create C implementation** in `src/`: |
| 167 | + - Add function signature to `collapse_c.h` |
| 168 | + - Implement ungrouped version: `<name>_C()` |
| 169 | + - Implement grouped version with g, starts, sizes parameters |
| 170 | + - Add OpenMP parallelization where beneficial |
| 171 | + - Register in `ExportSymbols.c` or use Rcpp |
| 172 | + |
| 173 | +3. **Add tests** in `tests/testthat/test-f<name>.R`: |
| 174 | + - Compare against base R equivalent |
| 175 | + - Test with NA values, Inf, different types |
| 176 | + - Test grouped and weighted versions |
| 177 | + - Test with different data structures |
| 178 | + |
| 179 | +4. **Update documentation**: |
| 180 | + - Add roxygen2 comments |
| 181 | + - Update `collapse-documentation.Rd` if adding new category |
| 182 | + - Add examples |
| 183 | + |
| 184 | +### Working with C/C++ Code |
| 185 | + |
| 186 | +- All C functions callable from R are declared in `collapse_c.h` |
| 187 | +- Use `PROTECT`/`UNPROTECT` for R objects in C code |
| 188 | +- Check existing patterns in similar functions before implementing |
| 189 | +- OpenMP pragmas: `#pragma omp parallel for num_threads(nthreads)` |
| 190 | +- Rcpp functions auto-generate interfaces via `RcppExports.cpp` |
| 191 | + |
| 192 | +### Memory Management |
| 193 | + |
| 194 | +- Use `settransform()`, `setv()` for in-place modifications (reference semantics) |
| 195 | +- Regular `ftransform()`, `fmutate()` copy on modify |
| 196 | +- `qDF()`, `qDT()`, `qM()` for fast conversions without attribute copying |
| 197 | +- Avoid unnecessary copies in tight loops |
| 198 | + |
| 199 | +### Testing Philosophy |
| 200 | + |
| 201 | +- Every function needs comprehensive tests |
| 202 | +- Test against base R or established packages |
| 203 | +- Use random data and `set.seed()` for reproducibility |
| 204 | +- Test edge cases: empty data, single row/column, all NA |
| 205 | +- Test with grouped_df, data.table, pseries objects |
| 206 | + |
| 207 | +## Package-Specific Conventions |
| 208 | + |
| 209 | +### Function Naming |
| 210 | +- `f` prefix: Fast statistical functions |
| 211 | +- `q` prefix: Quick conversions (qDF, qDT, qM) |
| 212 | +- Single letter operators: B (between), W (within), D (diff), G (growth), L (lag), HDB/HDW (high-dim between/within), STD (standardize) |
| 213 | +- Many functions have shorter aliases in documentation |
| 214 | + |
| 215 | +### Global Options |
| 216 | +Access via `get_collapse()` and `set_collapse()`: |
| 217 | +- `nthreads`: OpenMP thread count (default: system) |
| 218 | +- `na.rm`: Default NA removal (default: TRUE) |
| 219 | +- `sort`: Default sorting in GRP (default: TRUE) |
| 220 | +- `mask`: Namespace masking level |
| 221 | +- `verbose`: Verbosity level for operations |
| 222 | + |
| 223 | +### Attribute Handling |
| 224 | +- `copyAttrib()`: Copy all attributes |
| 225 | +- `copyMostAttrib()`: Copy all except names, dim, dimnames |
| 226 | +- `setAttrib()`: Set single attribute |
| 227 | +- Class attributes preserved automatically in most functions |
| 228 | + |
| 229 | +### Error Handling |
| 230 | +- R layer validates inputs, provides informative errors |
| 231 | +- C layer assumes validated inputs for performance |
| 232 | +- Use `ckmatch()` for matching with good error messages |
| 233 | +- `unused_arg_action()` for handling unexpected arguments |
| 234 | + |
| 235 | +## Key Files to Understand |
| 236 | + |
| 237 | +- `R/zzz.R`: Package initialization, namespace setup |
| 238 | +- `R/GRP.R`: Core grouping system |
| 239 | +- `R/global_macros.R`: Global options and constants |
| 240 | +- `src/collapse_c.h`: Complete C API |
| 241 | +- `src/base_radixsort.c`: Fast ordering algorithm |
| 242 | +- `tests/testthat/test-GRP.R`: Comprehensive grouping tests |
| 243 | +- `vignettes/collapse_documentation.Rmd`: Documentation guide |
| 244 | + |
| 245 | +## Performance Considerations |
| 246 | + |
| 247 | +- collapse functions are highly optimized; avoid calling base R equivalents in hot paths |
| 248 | +- Use GRP objects explicitly for repeated grouping operations |
| 249 | +- Set `nthreads` appropriately for your system |
| 250 | +- Use in-place modification functions when appropriate |
| 251 | +- Vector operations are faster than grouped operations on small data |
| 252 | +- For large data, collapse typically outperforms base R by 2-100x |
| 253 | + |
| 254 | +## Documentation Resources |
| 255 | + |
| 256 | +- Built-in documentation: `help('collapse-documentation')` |
| 257 | +- Vignettes: https://fastverse.org/collapse/articles/ |
| 258 | +- arXiv article: https://arxiv.org/abs/2403.05038 |
| 259 | +- GitHub: https://github.com/fastverse/collapse |
0 commit comments