Skip to content

PCA with sparse data #393

@VolkerBergen

Description

@VolkerBergen

Right now we get different PCs when using sparse data (applying TruncatedSVD) compared to using dense data (applying PCA).

TruncatedSVD(X-X.mean(0)) would be equivalent to PCA(X). X-X.mean(0) would obviously not be sparse anymore, which is why it is currently implemented as TruncatedSVD(X). The first PC will be mainly representing the vector of means, thus be very different from zero-centered PCA. The following components would approximately resemble PCA. However, since all subsequent PCs are orthogonal to the first PC, we will never get to the exact solution. Hence, the PCs are questionable, in particular when the very first ones are quite misleading.

That's not desirable. I think we should obtain the same PCA representation regardless of the data type.

Don't we have to densify X at some point anyways, as we would have to compute X.dot(X.T). Thus it might be worth thinking of some EM approach?

Whatsoever, I think as long the data is manageable and fits into the RAM, we should just use the densified X.

Line 486 in preprocessing/simple I don't quite understand:

if zero_center is not None:
       zero_center = not issparse(adata_comp.X)

It doesn't depend on the actual value of the attribute zero_center anymore. Is that a bug, or what is the rationale behind this?

For now, we can change that into something like

zero_center = zero_center if zero_center is not None else False if issparse(adata.X) and adata.X.shape[0] > 1e4 else True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions