[WIP] Implement PCA on sparse noncentered data#24415
[WIP] Implement PCA on sparse noncentered data#24415andportnoy wants to merge 37 commits intoscikit-learn:mainfrom
Conversation
|
This test is expected to fail at the moment. I will expand test coverage in the future. |
|
Ran into an issue with the |
This is an intermediate commit with a lot of debug print code. All tests are passing though.
|
I dodged the issue by using the transpose identity That enables randomized SVD in addition to ARPACK. |
I'll need to squash these intermediate commits later.
There was a problem hiding this comment.
Is this PR still WIP? What remains to be done?
Here are some suggestions to move it forward:
- test with larger data than iris (e.g. a few hundred data points and features);
- use the
global_random_seedfixture in the new test (see #22827 for more details); - parametrize the new test to also check with
whitenset toTrue; - please also check that transforming a batch of random test data points (ideally not from the training set) yields the same result with
assert_allclose; - check that it's possible to call transform on dense array of points on a model that was trained with sparse data and vice versa;
- document the change in the changelog for 1.2 (we will move it to 1.3 is the PR is not ready to merge by then).
|
@ogrisel Thank you so much for taking a look and for the suggestions, I will implement those. I was also planning to add support for LOBPCG and PROPACK as sparse SVD methods. That could go in via this PR or as a follow up. When is the merge window closing for 1.2? |
|
Soonish I think :) /cc @jeremiedbb |
|
Uh oh. A couple of days? |
|
@ogrisel Let me know if I interpreted the suggestions correctly, I put a TODO list at the top of the PR. |
76f7f32 to
1dff900
Compare
|
(re the force push) Had to kill some unwanted commits pulled from main directly as opposed to via a merge commit. |
|
@ogrisel Only 2080 out of 16000 tests are passing when testing on 400x300 random sparse matrices of varying densities across the 100 Command: Test matrix: Looking at some of the results manually, the errors are due to 1-2% elements mismatching, I'll try to gather better statistics on that in particular. Below is a high level breakdown of the pass rate by parameter. Plot reproSKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest -v --tb=no -n `nproc --all` sklearn/decomposition/tests/test_pca.py::test_pca_sparse > test-pca-sparse-all-seeds.log
grep -P 'PASSED|FAILED' test-pca-sparse-all-seeds.log | sed -E -e 's/^.*(FAILED|PASSED).*\[(.*)\]/\2 \1/' -e 's/-/ /g' -e 's/ $//' -e 's/ /,/g' > test-sparse-pca-all-seeds.csvimport pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(
'test-pca-sparse-all-seeds.csv',
header=None,
names=['seed', 'solver', 'layout', 'ncomp', 'density', 'outcome']
)
df['pass'] = df.outcome.apply(lambda x: True if x=='PASSED' else False)
def passrate_by(x):
passes = df.groupby(x)['pass']
counts = passes.count()
sums = passes.sum()
return sums / counts
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=200)
seed = passrate_by('seed').hist(ax=axes[0][0])
seed.set_title('pass rate by seed (histogram)')
seed.set_xlabel('pass rate')
seed.set_ylabel('seed count')
seed.set_ylim(top=100)
seed.set_xlim(right=1)
solver = passrate_by('solver').plot.bar(ax=axes[0][1])
solver.set_title('pass rate by solver')
solver.set_xlabel('solver')
solver.set_ylabel('pass rate')
density = passrate_by('density').plot.bar(ax=axes[1][0])
density.set_title('pass rate by density')
density.set_xlabel('density')
density.set_ylabel('pass rate')
ncomp = passrate_by('ncomp').plot.bar(ax=axes[1][1])
ncomp.set_title('pass rate by number of components')
ncomp.set_xlabel('# components')
ncomp.set_ylabel('pass rate')
for bp in (solver, density, ncomp):
bp.set_xticklabels(bp.get_xticklabels(), rotation=0)
bp.set_ylim(top=1)
fig.tight_layout()
fig.savefig('test-pca-sparse-pass-rate.png', facecolor='white', transparent=False) |
test_pca_sparse
test_pca_sparse [azure parallel]
test_pca_sparse
… seeds] test_pca_sparse
|
Updates are posted in the linked issue #12794. |
Previously it was completely ignored and as a result defaulted to 0.01.

Will fix #12794 when complete.
TODOs
whitenglobal_random_seedSKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest sklearn/decomposition/tests/test_pca.py::test_pca_sparse.transform