add gtars-lola crate, support in-memory db in IGD struct#242
add gtars-lola crate, support in-memory db in IGD struct#242sanghoonio merged 33 commits intodevfrom
Conversation
…gine Phase 0: New gtars-igd/src/igd.rs with unified Igd struct that supports both construction and querying in-memory, replacing the split between igd_t (creation-only) and igd_t_from_disk (query-only). Includes builder APIs (from_bed_dir, from_region_sets), in-memory query with min_overlap bp filtering, and optional disk persistence compatible with legacy format. Legacy create.rs/search.rs untouched. Phase 1: New gtars-lola crate with core LOLA enrichment engine - ContingencyTable with Fisher's exact test (via statrs Hypergeometric CDF), run_lola() entry point, and result ranking (pValueLog, oddsRatio, support, meanRnk, maxRnk). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2: Universe validation (check_universe_appropriateness with coverage/many-to-many diagnostics), user set redefinition (redefine_user_sets matching R LOLA's redefineUserSets), and restricted universe construction (build_restricted_universe with interval merging). Phase 3: RegionDB struct wrapping Igd + original region sets + annotations. Loads LOLA-format folder structure (collection.txt, index.txt, regions/*.bed), builds IGD in one pass. Supports collection filtering, file limits, merge, and BEDbase integration via from_igd_with_regions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 4: Rust extendr bindings (lola.rs) exposing 7 functions - loadRegionDB, loadRegionDBFromBeds, runLOLA, listRegionSets, checkUniverseAppropriateness, redefineUserSets, buildRestrictedUniverse. R wrappers (lola.R) match original LOLA package signatures for drop-in compatibility. Accepts RegionSet S4 objects, returns data.frames with LOLA result columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 5: PyO3 bindings as feature-gated lola submodule (from gtars.lola import RegionDB, run_lola). RegionDB class with from_folder/from_bed_files constructors. Regions passed as native Python tuples for ergonomic API. Returns column-oriented dicts for easy DataFrame conversion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements Benjamini-Hochberg FDR correction per user set, TSV output matching R LOLA format, and adds qValue to Python and R bindings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Load all BED files from regions/ directory, not just index.txt entries - Use binary per-region overlap counting (count_region_hits) to match R LOLA's countOverlaps semantics - Use survival function sf() instead of 1-cdf() to avoid catastrophic cancellation in p-value computation for highly significant results - Fix R wrapper symbol names and regenerate extendr/roxygen docs - Add compare_lola.Rmd test script All contingency table values now match R LOLA exactly. P-values match to within 2.4e-11 (machine epsilon). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exposes LolaRegionDB, runLOLA, and checkUniverseAppropriateness via wasm-bindgen for browser-based enrichment testing. All in-memory, no filesystem access needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 3 of interval operations: IGD-backed overlap queries on RegionSet. - Add Igd::from_single_region_set, find_overlaps_regionset, and count_overlaps_per_query to gtars-igd with deduplication across tiles - Add find_overlaps and count_overlaps to IntervalRanges trait (gtars-genomicdist now depends on gtars-igd) - Add R FFI wrappers and S4 methods importing generics from IRanges - Refactor gtars-lola universe.rs to use RegionSet methods instead of hand-rolled overlap code (~80 lines removed) - All 245 Rust tests and 7/7 LOLA comparison checks pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose regiondb_anno(), regiondb_collection_anno(), and regiondb_region_set() so R consumers can access annotations and individual region sets from a gtars-loaded RegionDB without needing to re-read from disk. This enables episcope and other packages to use gtars as a drop-in backend while retaining access to metadata needed by post-enrichment functions like extractEnrichmentOverlaps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # gtars-genomicdist/src/interval_ranges.rs # gtars-python/Cargo.toml
…tions - gaps(): add leading gap from position 0 to first region per chromosome, matching R's gaps() behavior (Fix 1) - Add CoordinateMode enum (Bed/GRanges) in gtars-core for coordinate convention differences between R (1-based closed) and BED (0-based half-open) - Region::mid_point_with_mode() uses banker's rounding for GRanges mode (Fix 3) - calc_summary_signal() formats labels with 1-based start in GRanges mode (Fix 2) - Thread CoordinateMode through TssIndex, calc_tss_distances, calc_feature_distances; R bindings use GRanges, all others default to Bed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- calc_neighbor_distances: sort regions by (start, end) per chromosome before windowing; filter out non-positive distances (Fix 4) - calc_nearest_neighbors: sort regions per chromosome; skip single-region chromosomes instead of pushing u32::MAX sentinels (Fix 5) - Update tests to reflect new behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add region_distribution_with_chrom_sizes() that creates bins based on
each chromosome's actual size instead of uniform global bin size (Fix 6)
- R binding accepts optional chrom_names/chrom_lengths parameters
- Tourney dispatch passes chromSizes from getChromSizes("hg19") to gtars
- R distribution() generic gains optional chromSizes parameter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
calc_expected_partitions now computes partition genome fractions using priority resolution: each genomic bp is assigned to the first partition that covers it (via reduce + setdiff), preventing double-counting when partitions overlap. This matches R's approach. (Fix 7) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fall back to collection-level description when file-level description is empty, matching R LOLA's behavior (Fix 8) - Add 'size' column (region count per file) to regionDBAnno output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sums Three bugs in calcExpectedPartitions that caused expected value mismatches: 1. Strand-aware setdiff for proximal promoters: into_regionset() was called before setdiff(), dropping strand info. R's setdiff is strand-aware, so +/- strand promoters at the same position were incorrectly subtracted. 2. trim() silently dropped regions on chromosomes not in chrom_sizes (e.g., chrMT vs chrM naming mismatch). Now keeps such regions unchanged, matching R's trim() behavior. 3. R binding pre-reduced genes before computing promoters, collapsing overlapping genes and losing their individual promoters. R computes promoters(rawGenes) first, then reduces. Removed the pre-reduce. Also adds promoters_stranded() to preserve strand through the promoter pipeline, documents the R neighbordt() bug in an Rmd, and updates the dispatch to pass chromSizes for calcExpectedPartitions. Results: 5/8 tourney functions exact match, 3 intentional divergences (2 where gtars is more correct than R, 1 deliberate design choice). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- IGD: 3 new tests comparing old vs new API (single-file, multi-file, disk format compatibility) - LOLA: 10 new tests for enrichment edge cases, FDR correction, and database loading - gtars-py: disambiguate count_overlaps/find_overlaps where both IntervalRanges and RegionSetOverlaps traits provide them Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… RegionSetOverlaps - Remove count_overlaps/find_overlaps from IntervalRanges trait (genomicdist) - Remove gtars-igd dependency from gtars-genomicdist - Add min_overlap: Option<i32> to all 4 RegionSetOverlaps methods (overlaprs) - Rewrite universe.rs to accept pre-built &Igd instead of rebuilding per call - Switch R bindings to RegionSetOverlaps with min_overlap for countOverlaps/findOverlaps - Update Python, WASM bindings for new signatures - Fix loadRegionDB description fallback to use collection folder name (matches R) Resolves E0034 trait ambiguity in gtars-py. Tourney: 24/27 pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ings - Fix iter_chroms() to preserve insertion order instead of HashSet non-deterministic order (fixes GC content correlation bug) - Fix S4 dispatch: register methods for character/data.frame instead of ANY for Bioconductor generics (narrow, shift, etc.) to avoid hijacking GRanges dispatch - Switch LOLA odds ratio from sample OR (a*d)/(b*c) to Fisher conditional MLE matching R's fisher.test()$estimate, using Brent root-finding on the noncentral hypergeometric distribution - Add missing IntervalRanges wasm bindings (shift, flank, resize, narrow, disjoin, gaps, intersect) to match R bindings - Update compare_lola.Rmd for LOLACore compatibility and oddsRatio check - Update tutorial_regionset.Rmd with tabset comparisons and known diffs - Gitignore LOLACore database and ext/ test data - Remove test_dropin.Rmd - Regenerate roxygen docs via rextendr::document() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Igd::from_named_region_sets (takes &RegionSet references) instead of from_region_sets (takes copied (String, i32, i32) tuples), avoiding a full allocation pass over all regions during database loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add shift, flank, resize, narrow, disjoin, gaps, and intersect subcommands to match R binding coverage of IntervalRanges methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LOLA Benchmarks & Test CoverageSetup
Validation against R LOLAAll 7 checks PASS:
Performance
Odds ratioUses the Fisher conditional maximum likelihood estimate (Brent root-finding on the noncentral hypergeometric distribution), matching R's Test coverage
Other work on this branchOverlap trait unificationResolved E0034 ambiguity between
Tourney fixes (genomicdist)8 original tourney failures addressed:
IntervalRanges
R binding fixes
|
Resolve conflicts from gdist-edits PR (#244) merge: - regionset_ops: Keep min_overlap on RegionSetOverlaps trait (needed by R bindings), delegate to MCO batch methods for fast path (min_bp <= 1), fall back to in-place iteration with overlap-size filtering for min_bp > 1. Add 7 new tests for min_overlap edge cases. - genomicdist.rs: Combine gtars-lola's chrom_sizes params and no-reduce partition fix with dev's checked_u32 input validation. - multi_chrom_overlapper.rs: Fix MCO consistency test to pass min_overlap. - region_set.rs (Python): Update callers for min_overlap signature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…API safety Batch 1 (correctness): - LOLA: Replace saturating_sub with validated subtraction + NegativeContingency error - IGD: Guard from_region_sets/from_named_region_sets against end < start intervals - LOLA: Use from_named_region_sets in merge() to avoid tuple copy overhead - LOLA: Warn on skipped BED files instead of silent continue Batch 2 (API safety): - IGD: Guard add() against negative coordinates - IGD: Clamp negative query coords in count_overlaps() - IGD: Remove chr-prefix filter from parse_bed_line (support non-UCSC assemblies) - IGD: Remove 321M coordinate cap from from_bed_dir Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c comments Batch 3 (code quality): - IGD: Merge validation into parse pass, eliminating double file read - LOLA: Remove unused EmptyUserSet error variant - IGD: Fix TSV format (remove extra spaces, write avg_region_width as f64) - genomicdist: Document bp-mode double-counting behavior in partitions - IGD: Add # Panics doc comments for add(), save(), count_overlaps(), find_overlaps_regionset(), count_overlaps_per_query() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Factor out the tile-walk + binary-search + overlap-check pattern that was copy-pasted across count_overlaps, find_overlaps_regionset, and count_overlaps_per_query into a single walk_tile_overlaps method that accepts a closure for per-hit processing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…docs
- Fix resize("center") midpoint overflow and chromosome_statistics width
sum overflow (u32→u64 accumulator)
- Fix boxplot_stats NaN panic and narrow() underflow with saturating_sub
- Replace disjoin() O(n²) coverage check with O(n log n) sweep-line
- Fix NaN odds ratio ranking (worst rank instead of arbitrary placement)
- Add FDR correction to Python and R LOLA bindings
- Fix R LOLA contingency values (i32→f64) to avoid truncation
- Fix R io.rs unwrap → proper error propagation
- Fix R genomePartitionList: cast f64 coordinates to integer for stranded path
- Revert R output types for counts/widths/distances back to i32 (only
coordinates need f64 for u32 range)
- Fix R r_narrow type error propagation
- Remove dead sample_odds_ratio, leftover println
- Add IGD hits debug_assert, contigs pub(crate) with accessor
- Document IGD i32 coordinate limit and uniwig sorted-input requirement
- Add regression tests for overflow, NaN, depletion direction
- Add Python LOLA binding tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch CLI handlers and R bindings from legacy create_igd_f/igd_search to the new unified Igd struct. Add Igd::from_bed_files() constructor that accepts explicit file paths, enabling directory, .txt filelist, and stdin input modes. R search now returns structured data (named list) instead of TSV strings. Remove dead CLI search flags that were never wired up. Legacy create.rs/search.rs kept in place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@nsheff I may want to merge gtars-lola into dev today. An earlier round of implementing used genomicdist as a lola dep using genomicdist intervalranges trait findoverlaps and countoverlaps (which used igd vs. ailist in overlaprs), but since the discussion yesterday I removed find/countoverlaps from genomicdist and updated gtars-lola to use gtars-igd directly, and the R bindings now use the overlaprs trait versions for those granges operations. So in addition to adding the lola crate, there are quite a few bug fixes to genomicdist in this branch itself because I was testing genomicdist and lola concurrently. |
- Add Python 3.12 + uv + maturin + pytest steps to CI workflow - Create proper testthat framework with 4 test files (82 tests): test_lola.R (12), test_refget.R (30), test_regionset.R (28), test_genomicdist.R (12) — all using tracked fixture data - Add tracked LOLA test fixture (lola_multi_db from LOLA R extdata) - Migrate ad-hoc R test scripts into tests/testthat/, remove originals - Move Rmd examples to tests/examples/, update paths - Update README with LOLA/IGD mentions, add LOLA Python example - Add testthat to DESCRIPTION Suggests - Clean up .gitignore for new directory structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validates the unified Igd struct against legacy igd_t/igd_t_from_disk using real genomic data (6 BED files: CpG islands, lamin B1 LADs, VISTA enhancers across 2 collections) for broader coverage before merge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No description provided.