Skip to content

add gtars-lola crate, support in-memory db in IGD struct#242

Merged
sanghoonio merged 33 commits intodevfrom
gtars-lola
Mar 12, 2026
Merged

add gtars-lola crate, support in-memory db in IGD struct#242
sanghoonio merged 33 commits intodevfrom
gtars-lola

Conversation

@sanghoonio
Copy link
Copy Markdown
Member

No description provided.

sanghoonio and others added 6 commits March 8, 2026 13:29
…gine

Phase 0: New gtars-igd/src/igd.rs with unified Igd struct that supports
both construction and querying in-memory, replacing the split between
igd_t (creation-only) and igd_t_from_disk (query-only). Includes builder
APIs (from_bed_dir, from_region_sets), in-memory query with min_overlap
bp filtering, and optional disk persistence compatible with legacy format.
Legacy create.rs/search.rs untouched.

Phase 1: New gtars-lola crate with core LOLA enrichment engine -
ContingencyTable with Fisher's exact test (via statrs Hypergeometric CDF),
run_lola() entry point, and result ranking (pValueLog, oddsRatio, support,
meanRnk, maxRnk).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2: Universe validation (check_universe_appropriateness with
coverage/many-to-many diagnostics), user set redefinition
(redefine_user_sets matching R LOLA's redefineUserSets), and
restricted universe construction (build_restricted_universe with
interval merging).

Phase 3: RegionDB struct wrapping Igd + original region sets +
annotations. Loads LOLA-format folder structure (collection.txt,
index.txt, regions/*.bed), builds IGD in one pass. Supports
collection filtering, file limits, merge, and BEDbase integration
via from_igd_with_regions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 4: Rust extendr bindings (lola.rs) exposing 7 functions -
loadRegionDB, loadRegionDBFromBeds, runLOLA, listRegionSets,
checkUniverseAppropriateness, redefineUserSets, buildRestrictedUniverse.

R wrappers (lola.R) match original LOLA package signatures for
drop-in compatibility. Accepts RegionSet S4 objects, returns
data.frames with LOLA result columns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 5: PyO3 bindings as feature-gated lola submodule
(from gtars.lola import RegionDB, run_lola). RegionDB class
with from_folder/from_bed_files constructors. Regions passed
as native Python tuples for ergonomic API. Returns column-oriented
dicts for easy DataFrame conversion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements Benjamini-Hochberg FDR correction per user set, TSV output
matching R LOLA format, and adds qValue to Python and R bindings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Load all BED files from regions/ directory, not just index.txt entries
- Use binary per-region overlap counting (count_region_hits) to match
  R LOLA's countOverlaps semantics
- Use survival function sf() instead of 1-cdf() to avoid catastrophic
  cancellation in p-value computation for highly significant results
- Fix R wrapper symbol names and regenerate extendr/roxygen docs
- Add compare_lola.Rmd test script

All contingency table values now match R LOLA exactly. P-values match
to within 2.4e-11 (machine epsilon).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sanghoonio and others added 19 commits March 8, 2026 15:04
Exposes LolaRegionDB, runLOLA, and checkUniverseAppropriateness via
wasm-bindgen for browser-based enrichment testing. All in-memory,
no filesystem access needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 3 of interval operations: IGD-backed overlap queries on RegionSet.

- Add Igd::from_single_region_set, find_overlaps_regionset, and
  count_overlaps_per_query to gtars-igd with deduplication across tiles
- Add find_overlaps and count_overlaps to IntervalRanges trait
  (gtars-genomicdist now depends on gtars-igd)
- Add R FFI wrappers and S4 methods importing generics from IRanges
- Refactor gtars-lola universe.rs to use RegionSet methods instead of
  hand-rolled overlap code (~80 lines removed)
- All 245 Rust tests and 7/7 LOLA comparison checks pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose regiondb_anno(), regiondb_collection_anno(), and
regiondb_region_set() so R consumers can access annotations and
individual region sets from a gtars-loaded RegionDB without needing
to re-read from disk. This enables episcope and other packages to
use gtars as a drop-in backend while retaining access to metadata
needed by post-enrichment functions like extractEnrichmentOverlaps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	gtars-genomicdist/src/interval_ranges.rs
#	gtars-python/Cargo.toml
…tions

- gaps(): add leading gap from position 0 to first region per chromosome,
  matching R's gaps() behavior (Fix 1)
- Add CoordinateMode enum (Bed/GRanges) in gtars-core for coordinate
  convention differences between R (1-based closed) and BED (0-based half-open)
- Region::mid_point_with_mode() uses banker's rounding for GRanges mode (Fix 3)
- calc_summary_signal() formats labels with 1-based start in GRanges mode (Fix 2)
- Thread CoordinateMode through TssIndex, calc_tss_distances,
  calc_feature_distances; R bindings use GRanges, all others default to Bed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- calc_neighbor_distances: sort regions by (start, end) per chromosome
  before windowing; filter out non-positive distances (Fix 4)
- calc_nearest_neighbors: sort regions per chromosome; skip single-region
  chromosomes instead of pushing u32::MAX sentinels (Fix 5)
- Update tests to reflect new behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add region_distribution_with_chrom_sizes() that creates bins based on
  each chromosome's actual size instead of uniform global bin size (Fix 6)
- R binding accepts optional chrom_names/chrom_lengths parameters
- Tourney dispatch passes chromSizes from getChromSizes("hg19") to gtars
- R distribution() generic gains optional chromSizes parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
calc_expected_partitions now computes partition genome fractions using
priority resolution: each genomic bp is assigned to the first partition
that covers it (via reduce + setdiff), preventing double-counting when
partitions overlap. This matches R's approach. (Fix 7)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fall back to collection-level description when file-level description
  is empty, matching R LOLA's behavior (Fix 8)
- Add 'size' column (region count per file) to regionDBAnno output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sums

Three bugs in calcExpectedPartitions that caused expected value mismatches:

1. Strand-aware setdiff for proximal promoters: into_regionset() was called
   before setdiff(), dropping strand info. R's setdiff is strand-aware, so
   +/- strand promoters at the same position were incorrectly subtracted.

2. trim() silently dropped regions on chromosomes not in chrom_sizes (e.g.,
   chrMT vs chrM naming mismatch). Now keeps such regions unchanged, matching
   R's trim() behavior.

3. R binding pre-reduced genes before computing promoters, collapsing
   overlapping genes and losing their individual promoters. R computes
   promoters(rawGenes) first, then reduces. Removed the pre-reduce.

Also adds promoters_stranded() to preserve strand through the promoter
pipeline, documents the R neighbordt() bug in an Rmd, and updates the
dispatch to pass chromSizes for calcExpectedPartitions.

Results: 5/8 tourney functions exact match, 3 intentional divergences
(2 where gtars is more correct than R, 1 deliberate design choice).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- IGD: 3 new tests comparing old vs new API (single-file, multi-file,
  disk format compatibility)
- LOLA: 10 new tests for enrichment edge cases, FDR correction, and
  database loading
- gtars-py: disambiguate count_overlaps/find_overlaps where both
  IntervalRanges and RegionSetOverlaps traits provide them

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… RegionSetOverlaps

- Remove count_overlaps/find_overlaps from IntervalRanges trait (genomicdist)
- Remove gtars-igd dependency from gtars-genomicdist
- Add min_overlap: Option<i32> to all 4 RegionSetOverlaps methods (overlaprs)
- Rewrite universe.rs to accept pre-built &Igd instead of rebuilding per call
- Switch R bindings to RegionSetOverlaps with min_overlap for countOverlaps/findOverlaps
- Update Python, WASM bindings for new signatures
- Fix loadRegionDB description fallback to use collection folder name (matches R)

Resolves E0034 trait ambiguity in gtars-py. Tourney: 24/27 pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ings

- Fix iter_chroms() to preserve insertion order instead of HashSet
  non-deterministic order (fixes GC content correlation bug)
- Fix S4 dispatch: register methods for character/data.frame instead of
  ANY for Bioconductor generics (narrow, shift, etc.) to avoid hijacking
  GRanges dispatch
- Switch LOLA odds ratio from sample OR (a*d)/(b*c) to Fisher conditional
  MLE matching R's fisher.test()$estimate, using Brent root-finding on the
  noncentral hypergeometric distribution
- Add missing IntervalRanges wasm bindings (shift, flank, resize, narrow,
  disjoin, gaps, intersect) to match R bindings
- Update compare_lola.Rmd for LOLACore compatibility and oddsRatio check
- Update tutorial_regionset.Rmd with tabset comparisons and known diffs
- Gitignore LOLACore database and ext/ test data
- Remove test_dropin.Rmd
- Regenerate roxygen docs via rextendr::document()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Igd::from_named_region_sets (takes &RegionSet references) instead of
from_region_sets (takes copied (String, i32, i32) tuples), avoiding a
full allocation pass over all regions during database loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add shift, flank, resize, narrow, disjoin, gaps, and intersect
subcommands to match R binding coverage of IntervalRanges methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sanghoonio
Copy link
Copy Markdown
Member Author

LOLA Benchmarks & Test Coverage

Setup

  • Region database: LOLACore hg19 — 3,012 region sets (all 5 collections: cistrome_cistrome, cistrome_epigenome, codex, encode_segmentation, encode_tfbs)
  • Query: LOLA's built-in sample_input — 2 user sets (3,142 and 5,831 regions), universe of 26,829 regions
  • Result: 6,024 rows (2 user sets × 3,012 DB sets)

Validation against R LOLA

All 7 checks PASS:

Check Status
Database filenames match PASS
Support values match (exact) PASS
b values match (exact) PASS
c values match (exact) PASS
d values match (exact) PASS
pValueLog correlation > 0.99 PASS (correlation = 1.0, max diff = 3.5e-11)
oddsRatio max relative diff < 0.1% PASS

Performance

R LOLA gtars
loadRegionDB 18.3s 30.1s
runLOLA 13.1s 0.85s
Total 31.4s 31.0s
  • runLOLA speedup: 14.7x (median over 5 iterations)
  • Load is slower because gtars builds the IGD index upfront; R defers overlap indexing to query time
  • The IGD cost is amortized over repeated queries (multiple user sets, parameter sweeps)

Odds ratio

Uses the Fisher conditional maximum likelihood estimate (Brent root-finding on the noncentral hypergeometric distribution), matching R's fisher.test()$estimate.

Test coverage

  • gtars-lola: 50 tests (enrichment, database loading, universe validation, ranking, contingency table arithmetic, min_overlap, edge cases)
  • gtars-igd: 29 tests (in-memory IGD construction, overlap queries, disk roundtrip)

Other work on this branch

Overlap trait unification

Resolved E0034 ambiguity between IntervalRanges and RegionSetOverlaps — both defined count_overlaps/find_overlaps on RegionSet. Fixed by:

  • Removing overlap methods from IntervalRanges, dropping gtars-igd dep from genomicdist
  • Rewriting universe.rs to use Igd directly (build once, query all user sets)
  • Adding min_overlap parameter to RegionSetOverlaps
  • R bindings switched to RegionSetOverlaps; Python trait ambiguity resolved

Tourney fixes (genomicdist)

8 original tourney failures addressed:

Test Status Fix
gaps PASS Leading gap from position 0 per chromosome
calcSummarySignal PASS CoordinateMode enum — R uses 1-based labels
calcFeatureDist PASS CoordinateMode — R uses banker's rounding midpoints
calcNeighborDist INTENTIONAL gtars more correct — R's neighbordt() has a bug with contained regions (reports spurious positive distance for fully overlapping regions)
calcNearestNeighbors INTENTIONAL gtars computes per-region per-chromosome; R cross-contaminates across chromosome boundaries
calcChromBins INTENTIONAL Different binning design (uniform 250-bin vs R's per-chrom sizing)
calcExpectedPartitions PASS Strand-aware promoter setdiff, trim keeps unknown chroms, pre-reduce fix
loadRegionDB PASS Description fallback to collection folder name, added size column

IntervalRanges

  • Added CoordinateMode enum (Bed/GRanges) in gtars-core/src/models/coords.rs, defaults to Bed so existing core crate consumers are unaffected. Threaded through midpoint and label-formatting functions in genomicdist. This ensures identical biological results regardless of input coordinate notation — the same genomic interval produces the same midpoint and distance values whether described in BED or GRanges conventions.
  • Added 7 missing WASM bindings (shift, flank, resize, narrow, disjoin, gaps, intersect) matching R coverage
  • Added 7 missing CLI subcommands (same set) to gtars ranges

R binding fixes

  • S4 dispatch: ANYcharacter/data.frame signatures for Bioconductor generic compatibility
  • iter_chroms() preserves insertion order (was non-deterministic via HashMap)
  • GRanges coercion fix in LOLA R bindings

sanghoonio and others added 3 commits March 11, 2026 12:12
Resolve conflicts from gdist-edits PR (#244) merge:

- regionset_ops: Keep min_overlap on RegionSetOverlaps trait (needed by
  R bindings), delegate to MCO batch methods for fast path (min_bp <= 1),
  fall back to in-place iteration with overlap-size filtering for min_bp > 1.
  Add 7 new tests for min_overlap edge cases.

- genomicdist.rs: Combine gtars-lola's chrom_sizes params and no-reduce
  partition fix with dev's checked_u32 input validation.

- multi_chrom_overlapper.rs: Fix MCO consistency test to pass min_overlap.

- region_set.rs (Python): Update callers for min_overlap signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…API safety

Batch 1 (correctness):
- LOLA: Replace saturating_sub with validated subtraction + NegativeContingency error
- IGD: Guard from_region_sets/from_named_region_sets against end < start intervals
- LOLA: Use from_named_region_sets in merge() to avoid tuple copy overhead
- LOLA: Warn on skipped BED files instead of silent continue

Batch 2 (API safety):
- IGD: Guard add() against negative coordinates
- IGD: Clamp negative query coords in count_overlaps()
- IGD: Remove chr-prefix filter from parse_bed_line (support non-UCSC assemblies)
- IGD: Remove 321M coordinate cap from from_bed_dir

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c comments

Batch 3 (code quality):
- IGD: Merge validation into parse pass, eliminating double file read
- LOLA: Remove unused EmptyUserSet error variant
- IGD: Fix TSV format (remove extra spaces, write avg_region_width as f64)
- genomicdist: Document bp-mode double-counting behavior in partitions
- IGD: Add # Panics doc comments for add(), save(), count_overlaps(),
  find_overlaps_regionset(), count_overlaps_per_query()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sanghoonio and others added 3 commits March 11, 2026 12:49
Factor out the tile-walk + binary-search + overlap-check pattern that was
copy-pasted across count_overlaps, find_overlaps_regionset, and
count_overlaps_per_query into a single walk_tile_overlaps method that
accepts a closure for per-hit processing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…docs

- Fix resize("center") midpoint overflow and chromosome_statistics width
  sum overflow (u32→u64 accumulator)
- Fix boxplot_stats NaN panic and narrow() underflow with saturating_sub
- Replace disjoin() O(n²) coverage check with O(n log n) sweep-line
- Fix NaN odds ratio ranking (worst rank instead of arbitrary placement)
- Add FDR correction to Python and R LOLA bindings
- Fix R LOLA contingency values (i32→f64) to avoid truncation
- Fix R io.rs unwrap → proper error propagation
- Fix R genomePartitionList: cast f64 coordinates to integer for stranded path
- Revert R output types for counts/widths/distances back to i32 (only
  coordinates need f64 for u32 range)
- Fix R r_narrow type error propagation
- Remove dead sample_odds_ratio, leftover println
- Add IGD hits debug_assert, contigs pub(crate) with accessor
- Document IGD i32 coordinate limit and uniwig sorted-input requirement
- Add regression tests for overflow, NaN, depletion direction
- Add Python LOLA binding tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch CLI handlers and R bindings from legacy create_igd_f/igd_search
to the new unified Igd struct. Add Igd::from_bed_files() constructor
that accepts explicit file paths, enabling directory, .txt filelist,
and stdin input modes. R search now returns structured data (named list)
instead of TSV strings. Remove dead CLI search flags that were never
wired up. Legacy create.rs/search.rs kept in place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sanghoonio
Copy link
Copy Markdown
Member Author

@nsheff I may want to merge gtars-lola into dev today. An earlier round of implementing used genomicdist as a lola dep using genomicdist intervalranges trait findoverlaps and countoverlaps (which used igd vs. ailist in overlaprs), but since the discussion yesterday I removed find/countoverlaps from genomicdist and updated gtars-lola to use gtars-igd directly, and the R bindings now use the overlaprs trait versions for those granges operations. So in addition to adding the lola crate, there are quite a few bug fixes to genomicdist in this branch itself because I was testing genomicdist and lola concurrently.

sanghoonio and others added 2 commits March 11, 2026 22:22
- Add Python 3.12 + uv + maturin + pytest steps to CI workflow
- Create proper testthat framework with 4 test files (82 tests):
  test_lola.R (12), test_refget.R (30), test_regionset.R (28),
  test_genomicdist.R (12) — all using tracked fixture data
- Add tracked LOLA test fixture (lola_multi_db from LOLA R extdata)
- Migrate ad-hoc R test scripts into tests/testthat/, remove originals
- Move Rmd examples to tests/examples/, update paths
- Update README with LOLA/IGD mentions, add LOLA Python example
- Add testthat to DESCRIPTION Suggests
- Clean up .gitignore for new directory structure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validates the unified Igd struct against legacy igd_t/igd_t_from_disk
using real genomic data (6 BED files: CpG islands, lamin B1 LADs, VISTA
enhancers across 2 collections) for broader coverage before merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sanghoonio sanghoonio merged commit a0645eb into dev Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant