ENH: Add Array API support to centralize and vlr#2429
ENH: Add Array API support to centralize and vlr#2429LarytheLord wants to merge 2 commits intoscikit-bio:mainfrom
Conversation
Convert centralize() and vlr() in the composition module to work with any array backend that follows the Python Array API standard (NumPy, CuPy, JAX, PyTorch, etc.). centralize() previously imported scipy.stats.gmean, which tied it to NumPy. The geometric mean is now computed as exp(mean(log(x))) using only operations available in the Array API standard. vlr() previously called np.log and np.var directly. The non-robust path now accepts arrays from any backend via ingest_array and _closure_two. The robust path (masked arrays) stays on NumPy since np.ma has no Array API equivalent. Both functions gain the same ingest_array / xp-namespace pattern used by closure, clr, ilr, alr, and their inverses. Tests added in test_ndarray.py for NumPy, PyTorch, and JAX backends.
| from scipy.stats import gmean | ||
|
|
||
| mat = closure(mat) | ||
| cen = gmean(mat, axis=0) |
There was a problem hiding this comment.
Does this have any impact on performance?
Both in terms of runtime and memory usage?
Especially when input is nparray.
|
For NumPy inputs the overhead is negligible. ingest_array on an ndarray just returns the numpy namespace so no copy, no conversion. The gmean replacement is actually faster because scipy.stats.gmean adds input coercion and NaN policy handling on top of the same exp(mean(log(x))) formula. Benchmarked on a (1000, 500) composition matrix: scipy.stats.gmean: 7.48 ms Memory footprint is the same, both allocate one temporary from log(). The composition-specific checks (no negatives, no NaN, no all-zero rows) are already handled upstream by _check_composition when validate=True |
|
Can you benchmark something bigger? |
|
At 30k x 30k: exp(mean(log(x))): 53.9 s, 7.2 GB peak ~3.2x faster at this scale, half the memory. The speedup is larger here than at mid-range sizes (~1.3x at 5k x 5k) because scipy's decorator machinery allocates a full extra copy of the input, which becomes the bottleneck once the matrix doesn't fit comfortably in cache. Numerical diff remains exactly 0. |
| return _robust_vlr(**kwargs) | ||
| else: | ||
| return _vlr(**kwargs) | ||
| # robust path uses NumPy masked arrays; keep on numpy |
There was a problem hiding this comment.
Keeping the if... else.. pattern would be preferred.
|
I am impressed it is faster on the CPU this way, and it should definitely benefit on the GPU. @mataton What do you think about the added tests? |
centralize()andvlr()are the last two composition stats functions that still usehard-coded NumPy / SciPy calls. This brings them in line with
closure,clr,ilr,alr, and their inverses, which already go throughingest_array.What changed
centralize()scipy.stats.gmeanimport withexp(mean(log(x))), which only usesoperations available in the Array API standard.
ingest_array→xpnamespace like the rest of the module.validateparameter (consistent withclosure,clr, etc.).vlr()/_vlr()xpnamespace from_closure_twoand computeslog-ratio variance using
xp.log,xp.mean,xp.sum.robust=Truepath stays on NumPy (np.mamasked arrays have no Array APIequivalent). The docstring notes this explicitly.
Tests
Tests_CentralizeandTests_VLRclasses intest_ndarray.pywith NumPy,PyTorch, and JAX backend checks (same pattern as
Tests_Closure,Tests_CLR, etc.).What I verified locally
(Torch/JAX/CUDA tests are skipped locally — they'll run in CI where those deps are available.)