-
Notifications
You must be signed in to change notification settings - Fork 726
PCA with sparse data #393
Description
Right now we get different PCs when using sparse data (applying TruncatedSVD) compared to using dense data (applying PCA).
TruncatedSVD(X-X.mean(0)) would be equivalent to PCA(X). X-X.mean(0) would obviously not be sparse anymore, which is why it is currently implemented as TruncatedSVD(X). The first PC will be mainly representing the vector of means, thus be very different from zero-centered PCA. The following components would approximately resemble PCA. However, since all subsequent PCs are orthogonal to the first PC, we will never get to the exact solution. Hence, the PCs are questionable, in particular when the very first ones are quite misleading.
That's not desirable. I think we should obtain the same PCA representation regardless of the data type.
Don't we have to densify X at some point anyways, as we would have to compute X.dot(X.T). Thus it might be worth thinking of some EM approach?
Whatsoever, I think as long the data is manageable and fits into the RAM, we should just use the densified X.
Line 486 in preprocessing/simple I don't quite understand:
if zero_center is not None:
zero_center = not issparse(adata_comp.X)
It doesn't depend on the actual value of the attribute zero_center anymore. Is that a bug, or what is the rationale behind this?
For now, we can change that into something like
zero_center = zero_center if zero_center is not None else False if issparse(adata.X) and adata.X.shape[0] > 1e4 else True