Skip to content

Commit 015fdaa

Browse files
authored
Merge pull request #823 from fastverse/development
Update
2 parents 99a6b96 + 9c75620 commit 015fdaa

2 files changed

Lines changed: 267 additions & 2 deletions

File tree

.github/workflows/claude-code-review.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,14 @@
11
name: Claude Code Review
22

33
on:
4-
pull_request:
5-
types: [opened, synchronize, ready_for_review, reopened]
4+
workflow_dispatch:
5+
inputs:
6+
pr_number:
7+
description: 'Pull Request number to review'
8+
required: true
9+
default: '1'
10+
# pull_request:
11+
# types: [opened, synchronize, ready_for_review, reopened]
612
# Optional: Only run on specific file changes
713
# paths:
814
# - "src/**/*.ts"

CLAUDE.md

Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Overview
6+
7+
**collapse** is a high-performance C/C++-based R package for advanced data transformation and statistical computing. The package provides:
8+
- Fast grouped and weighted statistical functions with OpenMP multithreading
9+
- Class-agnostic architecture supporting base R, tibble, data.table, sf, plm, and other extensions
10+
- ~13,775 lines of R code and ~28,625 lines of C/C++ code
11+
- Comprehensive test suite with 48 test files
12+
13+
## Build and Test Commands
14+
15+
### Development Workflow
16+
```r
17+
# Load package for development
18+
devtools::load_all()
19+
20+
# Run all tests
21+
devtools::test()
22+
23+
# Run specific test file
24+
testthat::test_file("tests/testthat/test-GRP.R")
25+
26+
# Run package check
27+
devtools::check()
28+
29+
# Build documentation
30+
devtools::document()
31+
```
32+
33+
### Command Line Build
34+
```bash
35+
# Build package tarball
36+
R CMD build .
37+
38+
# Check package (comprehensive tests)
39+
R CMD check collapse_*.tar.gz
40+
41+
# Install from source
42+
R CMD INSTALL collapse_*.tar.gz
43+
44+
# Run tests directly
45+
Rscript -e "testthat::test_dir('tests/testthat')"
46+
```
47+
48+
### CI/CD
49+
- GitHub Actions runs R-CMD-check on macOS, Windows, and Ubuntu
50+
- Tests against R-devel, R-release, and R-oldrel-1
51+
- Code coverage tracked via codecov
52+
- Documentation auto-deployed via pkgdown
53+
54+
## Architecture
55+
56+
### Code Organization
57+
58+
**R/ directory (52 files)**
59+
- `f*` prefix: Fast statistical functions (fmean.R, fmedian.R, fvar_fsd.R, etc.)
60+
- `GRP.R` (1,428 lines): Core grouping object using radix sort or hash methods
61+
- `fsubset_ftransform_fmutate.R` (1,220 lines): Data manipulation (subset, transform, mutate)
62+
- `collap.R` (793 lines): Advanced aggregation framework
63+
- `pivot.R` (823 lines): Pivot operations (wider/longer/recast)
64+
- `join.R` (473 lines): Fast join operations
65+
- `fbetween_fwithin.R`, `fhdbetween_fhdwithin.R`: Between/within group transformations
66+
- `fdiff_fgrowth.R`, `flag.R`: Time series operations (differences, growth rates, lags/leads)
67+
- `qsu.R`, `descr.R`: Statistical summaries
68+
- `list_functions.R` (647 lines): Recursive list operations
69+
- `indexing.R` (741 lines): Time series and panel data indexing
70+
- `zzz.R` (195 lines): Package initialization, namespace masking system
71+
72+
**src/ directory (94 files)**
73+
- `collapse_c.h` (195 lines): Main C header with all function signatures and OpenMP macros
74+
- `base_radixsort.c` (2,158 lines): Fast radix sorting (adapted from base R)
75+
- `fdiff_fgrowth.cpp` (1,854 lines): Differencing and growth rate calculations
76+
- `fvar_fsd.cpp` (1,689 lines): Variance and standard deviation
77+
- `fnth_fmedian_fquantile.c` (1,627 lines): Quantile functions
78+
- `TRA.c` (1,416 lines): Transformation operations framework
79+
- `programming.c` (1,284 lines): Programming helper functions
80+
- `fmode.c` (1,248 lines): Mode calculation
81+
- `kit_dup.c` (1,169 lines): Duplicate handling (adapted from kit package)
82+
- `match.c` (1,161 lines): Fast matching operations
83+
- `flag.cpp` (1,083 lines): Lag/lead operations
84+
- `data.table_*.c`: Integration with data.table internals (rbindlist, subset, utils)
85+
- `Makevars`, `Makevars.win`: OpenMP compilation configuration
86+
87+
**tests/testthat/ (48 files)**
88+
- `test-<function>.R` pattern for each major function
89+
- Comprehensive tests against base R equivalents
90+
- Tests with NA/Inf values, different data types, grouping, and weights
91+
92+
### Key Architectural Patterns
93+
94+
**1. Two-Layer Design**
95+
- R layer: Input validation, S3 method dispatch, attribute handling
96+
- C/C++ layer: Core computation with OpenMP parallelization
97+
- Pattern: `.Call()` invokes C functions named with `_C`, `_mC` (matrix), `_lC` (list/data.frame) suffixes
98+
99+
**2. Function Signature Pattern**
100+
Most statistical functions follow:
101+
```r
102+
f<name>(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = TRUE, ...)
103+
```
104+
- `x`: data (vector, matrix, or data.frame)
105+
- `g`: grouping variable(s) - converted to GRP object internally
106+
- `w`: weights vector
107+
- `TRA`: transformation operation (see TRA framework below)
108+
- S3 methods for: default, matrix, data.frame, grouped_df, zoo, units, pdata.frame, pseries, sf
109+
110+
**3. GRP (Grouping) Object**
111+
Central to all grouped operations:
112+
- Created via `GRP()` or automatically from factors/lists
113+
- Contains: `N.groups`, `group.sizes`, `groups` (names), `group.id`, `group.starts`
114+
- Two algorithms: radix sort (default, stable) or hash-based (faster for many groups)
115+
- Access via `GRP()`, `fgroup_by()`, or automatic conversion from factors
116+
117+
**4. TRA (Transformation) Framework**
118+
10 transformation operations applicable to all statistical functions:
119+
- `"-"`: center (subtract statistic)
120+
- `"+"`: add statistic
121+
- `"*"`: multiply by statistic
122+
- `"/"`: divide by statistic (scale)
123+
- `"%"`: compute percentage of statistic
124+
- `"%%"`: modulus
125+
- `"-+"`: center and add overall mean
126+
- `"+-"`: add statistic and subtract overall mean
127+
- `"replace"`: replace values with statistic
128+
- `"replace_fill"`: replace and fill with statistic
129+
130+
Example: `fmean(x, g, TRA = "-")` centers x by group means
131+
132+
**5. Namespace Masking System**
133+
- Functions can be exported without 'f' prefix via `options(collapse_mask = "all")`
134+
- Controlled via `.fastverse` config file or `set_collapse(mask = "manip")`
135+
- Allows `mean()`, `sum()`, etc. to use collapse's fast versions
136+
- Keywords: "all", "fast-fun", "fast-stat-fun", "helper", "manip", "special"
137+
138+
**6. Class-Agnostic Design**
139+
- Consistent S3 methods across all data structures
140+
- Attribute preservation system maintains class-specific attributes
141+
- Functions work identically on base R, tibble, data.table, sf, plm objects
142+
143+
**7. OpenMP Parallelization**
144+
- Controlled via `set_collapse(nthreads = n)` or `options(collapse_nthreads = n)`
145+
- Automatic fallback if OpenMP not available
146+
- Used in: fsum, fmean, fmode, and other computationally intensive functions
147+
- Compilation flags in Makevars handle platform differences
148+
149+
**8. data.table Integration**
150+
- Reuses core algorithms from data.table (radixsort, rbindlist, subset) under MPL 2.0
151+
- Works natively on data.table objects
152+
- Compatible with `:=` operator
153+
- Modified for collapse's specific needs
154+
155+
## Common Development Patterns
156+
157+
### Adding a New Statistical Function
158+
159+
1. **Create R file** `R/f<name>.R`:
160+
- Define S3 generic: `f<name> <- function(x, ...) UseMethod("f<name>")`
161+
- Implement methods: `f<name>.default`, `f<name>.matrix`, `f<name>.data.frame`
162+
- Add grouping support via GRP object
163+
- Add TRA argument for transformations
164+
- Handle attributes with `copyAttrib()` or `copyMostAttrib()`
165+
166+
2. **Create C implementation** in `src/`:
167+
- Add function signature to `collapse_c.h`
168+
- Implement ungrouped version: `<name>_C()`
169+
- Implement grouped version with g, starts, sizes parameters
170+
- Add OpenMP parallelization where beneficial
171+
- Register in `ExportSymbols.c` or use Rcpp
172+
173+
3. **Add tests** in `tests/testthat/test-f<name>.R`:
174+
- Compare against base R equivalent
175+
- Test with NA values, Inf, different types
176+
- Test grouped and weighted versions
177+
- Test with different data structures
178+
179+
4. **Update documentation**:
180+
- Add roxygen2 comments
181+
- Update `collapse-documentation.Rd` if adding new category
182+
- Add examples
183+
184+
### Working with C/C++ Code
185+
186+
- All C functions callable from R are declared in `collapse_c.h`
187+
- Use `PROTECT`/`UNPROTECT` for R objects in C code
188+
- Check existing patterns in similar functions before implementing
189+
- OpenMP pragmas: `#pragma omp parallel for num_threads(nthreads)`
190+
- Rcpp functions auto-generate interfaces via `RcppExports.cpp`
191+
192+
### Memory Management
193+
194+
- Use `settransform()`, `setv()` for in-place modifications (reference semantics)
195+
- Regular `ftransform()`, `fmutate()` copy on modify
196+
- `qDF()`, `qDT()`, `qM()` for fast conversions without attribute copying
197+
- Avoid unnecessary copies in tight loops
198+
199+
### Testing Philosophy
200+
201+
- Every function needs comprehensive tests
202+
- Test against base R or established packages
203+
- Use random data and `set.seed()` for reproducibility
204+
- Test edge cases: empty data, single row/column, all NA
205+
- Test with grouped_df, data.table, pseries objects
206+
207+
## Package-Specific Conventions
208+
209+
### Function Naming
210+
- `f` prefix: Fast statistical functions
211+
- `q` prefix: Quick conversions (qDF, qDT, qM)
212+
- Single letter operators: B (between), W (within), D (diff), G (growth), L (lag), HDB/HDW (high-dim between/within), STD (standardize)
213+
- Many functions have shorter aliases in documentation
214+
215+
### Global Options
216+
Access via `get_collapse()` and `set_collapse()`:
217+
- `nthreads`: OpenMP thread count (default: system)
218+
- `na.rm`: Default NA removal (default: TRUE)
219+
- `sort`: Default sorting in GRP (default: TRUE)
220+
- `mask`: Namespace masking level
221+
- `verbose`: Verbosity level for operations
222+
223+
### Attribute Handling
224+
- `copyAttrib()`: Copy all attributes
225+
- `copyMostAttrib()`: Copy all except names, dim, dimnames
226+
- `setAttrib()`: Set single attribute
227+
- Class attributes preserved automatically in most functions
228+
229+
### Error Handling
230+
- R layer validates inputs, provides informative errors
231+
- C layer assumes validated inputs for performance
232+
- Use `ckmatch()` for matching with good error messages
233+
- `unused_arg_action()` for handling unexpected arguments
234+
235+
## Key Files to Understand
236+
237+
- `R/zzz.R`: Package initialization, namespace setup
238+
- `R/GRP.R`: Core grouping system
239+
- `R/global_macros.R`: Global options and constants
240+
- `src/collapse_c.h`: Complete C API
241+
- `src/base_radixsort.c`: Fast ordering algorithm
242+
- `tests/testthat/test-GRP.R`: Comprehensive grouping tests
243+
- `vignettes/collapse_documentation.Rmd`: Documentation guide
244+
245+
## Performance Considerations
246+
247+
- collapse functions are highly optimized; avoid calling base R equivalents in hot paths
248+
- Use GRP objects explicitly for repeated grouping operations
249+
- Set `nthreads` appropriately for your system
250+
- Use in-place modification functions when appropriate
251+
- Vector operations are faster than grouped operations on small data
252+
- For large data, collapse typically outperforms base R by 2-100x
253+
254+
## Documentation Resources
255+
256+
- Built-in documentation: `help('collapse-documentation')`
257+
- Vignettes: https://fastverse.org/collapse/articles/
258+
- arXiv article: https://arxiv.org/abs/2403.05038
259+
- GitHub: https://github.com/fastverse/collapse

0 commit comments

Comments
 (0)