Skip to content

ENH: Add Array API support to centralize and vlr#2429

Open
LarytheLord wants to merge 2 commits intoscikit-bio:mainfrom
LarytheLord:enh/array-api-centralize-vlr
Open

ENH: Add Array API support to centralize and vlr#2429
LarytheLord wants to merge 2 commits intoscikit-bio:mainfrom
LarytheLord:enh/array-api-centralize-vlr

Conversation

@LarytheLord
Copy link
Copy Markdown

centralize() and vlr() are the last two composition stats functions that still use
hard-coded NumPy / SciPy calls. This brings them in line with closure, clr, ilr,
alr, and their inverses, which already go through ingest_array.

What changed

centralize()

  • Replaced the scipy.stats.gmean import with exp(mean(log(x))), which only uses
    operations available in the Array API standard.
  • Uses ingest_arrayxp namespace like the rest of the module.
  • Added a validate parameter (consistent with closure, clr, etc.).

vlr() / _vlr()

  • The non-robust path now receives the xp namespace from _closure_two and computes
    log-ratio variance using xp.log, xp.mean, xp.sum.
  • The robust=True path stays on NumPy (np.ma masked arrays have no Array API
    equivalent). The docstring notes this explicitly.

Tests

  • Added Tests_Centralize and Tests_VLR classes in test_ndarray.py with NumPy,
    PyTorch, and JAX backend checks (same pattern as Tests_Closure, Tests_CLR, etc.).

What I verified locally

$ python -m pytest skbio/stats/composition/tests/test_base.py \
                   skbio/stats/composition/tests/test_ndarray.py -q
40 passed, 40 skipped

(Torch/JAX/CUDA tests are skipped locally — they'll run in CI where those deps are available.)

Convert centralize() and vlr() in the composition module to work with
any array backend that follows the Python Array API standard (NumPy,
CuPy, JAX, PyTorch, etc.).

centralize() previously imported scipy.stats.gmean, which tied it to
NumPy. The geometric mean is now computed as exp(mean(log(x))) using
only operations available in the Array API standard.

vlr() previously called np.log and np.var directly. The non-robust path
now accepts arrays from any backend via ingest_array and _closure_two.
The robust path (masked arrays) stays on NumPy since np.ma has no Array
API equivalent.

Both functions gain the same ingest_array / xp-namespace pattern used by
closure, clr, ilr, alr, and their inverses.

Tests added in test_ndarray.py for NumPy, PyTorch, and JAX backends.
from scipy.stats import gmean

mat = closure(mat)
cen = gmean(mat, axis=0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any impact on performance?
Both in terms of runtime and memory usage?

Especially when input is nparray.

@LarytheLord
Copy link
Copy Markdown
Author

For NumPy inputs the overhead is negligible. ingest_array on an ndarray just returns the numpy namespace so no copy, no conversion.

The gmean replacement is actually faster because scipy.stats.gmean adds input coercion and NaN policy handling on top of the same exp(mean(log(x))) formula. Benchmarked on a (1000, 500) composition matrix:

scipy.stats.gmean: 7.48 ms
exp(mean(log(x))): 1.59 ms (~4.7x faster)
Max abs diff: 0.00e+00

Memory footprint is the same, both allocate one temporary from log(). The composition-specific checks (no negatives, no NaN, no all-zero rows) are already handled upstream by _check_composition when validate=True

@sfiligoi
Copy link
Copy Markdown
Contributor

Can you benchmark something bigger?
Say 30k x 30k?

@LarytheLord
Copy link
Copy Markdown
Author

At 30k x 30k:

exp(mean(log(x))): 53.9 s, 7.2 GB peak
scipy.stats.gmean: 173.2 s, 14.4 GB peak

~3.2x faster at this scale, half the memory. The speedup is larger here than at mid-range sizes (~1.3x at 5k x 5k) because scipy's decorator machinery allocates a full extra copy of the input, which becomes the bottleneck once the matrix doesn't fit comfortably in cache. Numerical diff remains exactly 0.

return _robust_vlr(**kwargs)
else:
return _vlr(**kwargs)
# robust path uses NumPy masked arrays; keep on numpy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the if... else.. pattern would be preferred.

@sfiligoi
Copy link
Copy Markdown
Contributor

I am impressed it is faster on the CPU this way, and it should definitely benefit on the GPU.

@mataton What do you think about the added tests?
How do they fit in the changes you are are already working on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants