PCA with sparse data

Right now we get different PCs when using sparse data (applying TruncatedSVD) compared to using dense data (applying PCA). 

`TruncatedSVD(X-X.mean(0))` would be equivalent to `PCA(X)`. `X-X.mean(0)` would obviously not be sparse anymore, which is why it is currently implemented as `TruncatedSVD(X)`. The first PC will be mainly representing the vector of means, thus be very different from zero-centered PCA. The following components would approximately resemble PCA. However, since all subsequent PCs are orthogonal to the first PC, we will never get to the exact solution. Hence, the PCs are questionable, in particular when the very first ones are quite misleading.

That's not desirable. I think we should obtain the same PCA representation regardless of the data type.

Don't we have to densify `X` at some point anyways, as we would have to compute `X.dot(X.T)`. Thus it might be worth thinking of some EM approach?

Whatsoever, I think as long the data is manageable and fits into the RAM, we should just use the densified `X`. 

Line 486 in preprocessing/simple I don't quite understand:
```
if zero_center is not None:
       zero_center = not issparse(adata_comp.X)
```
It doesn't depend on the actual value of the attribute `zero_center` anymore. Is that a bug, or what is the rationale behind this?

For now, we can change that into something like
```
zero_center = zero_center if zero_center is not None else False if issparse(adata.X) and adata.X.shape[0] > 1e4 else True
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA with sparse data #393

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PCA with sparse data #393

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions