Conversation
Value error fix
# Conflicts: # dask_ml/utils.py
| self.batch_size = batch_size | ||
| self.svd_solver = svd_solver | ||
| self.iterated_power = iterated_power | ||
| self.random_state = random_state |
There was a problem hiding this comment.
svd_solver, iterated_power, random_state are added as new optional arguments.
| X = np.vstack((self.singular_values_.reshape((-1, 1)) * | ||
| self.components_, X, mean_correction)) | ||
|
|
||
| solver = self._get_solver(X, self.n_components_) |
There was a problem hiding this comment.
svd_solver is chosen based on the same strategy with PCA.
| solver = self._get_solver(X, self.n_components_) | ||
| if solver in {"full", "tsqr"}: | ||
| if hasattr(X, 'rechunk'): | ||
| X = da.rechunk(X, (X.chunks[0], -1)) |
There was a problem hiding this comment.
If the array is chunked along the last axis, svd won't work.
I'm not sure if rechunking is a right way here.
Probably we can raise an Exception?
There was a problem hiding this comment.
Perhaps check what we do elsewhere in the library. It's not clear to me whether the estimator should implicitly rechunk or not, since this could cause surprisingly high memory usage.
tests/test_incremental_pca.py
Outdated
| 'auto', | ||
| 'randomized' | ||
| ]) | ||
| def test_compare_with_sklearn(svd_solver): |
There was a problem hiding this comment.
This is only one test I added aside from those stolen from scikit-learn.
|
I apologize for the delay @fujiisoup . Many of the Dask maintainers are at a workshop this week. At first glance I am surprised that the Scikit-Learn code works well with small modifications. I haven't looked at it deeply yet, but I have a few questions:
Again, I apologize for the delay, and for not yet giving this a thorough review |
TomAugspurger
left a comment
There was a problem hiding this comment.
I see warnings like
/Users/taugspurger/Envs/dask-dev/lib/python3.7/site-packages/dask/array/core.py:1324: FutureWarning: The `numpy.expand_dims` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
FutureWarning,
I think that could be implemented on dask.array.
| solver = self._get_solver(X, self.n_components_) | ||
| if solver in {"full", "tsqr"}: | ||
| if hasattr(X, 'rechunk'): | ||
| X = da.rechunk(X, (X.chunks[0], -1)) |
There was a problem hiding this comment.
Perhaps check what we do elsewhere in the library. It's not clear to me whether the estimator should implicitly rechunk or not, since this could cause surprisingly high memory usage.
| self.singular_values_ = S[:self.n_components_] | ||
| self.mean_ = col_mean | ||
| self.var_ = col_var | ||
| self.explained_variance_ = explained_variance[:self.n_components_] |
There was a problem hiding this comment.
Can you check the types of all these vs. those in dask_ml.decomposition.PCA? I notice some that differ.
Which do you find more useful here: a Dask Array or a concrete ndarray?
|
Thank you for the comments ;) and sorry for my late reply. We tested this PR with our real data and found some problems. I'll push and post the update (probably later this week or the next week.) |
compute inside partial_fit
* bugfix computing noise_variance and explained_variance_ratio * bugfix factor for previous total_var should be (n_samples_seen - 1) * bugfix using sklearn formula for explained_variance_ratio * bugfix correct formula to compute accumulated variance * add tests for each svd_solver * bugfix argument of svd_compressed should be self.n_components_ * bugfix add correct _fit method * bugfix changing the formula to compute total variance * A bug fix for explained_variance_ratio. The noise variance is still wrong * A bug fix in noise_variance. But can be still wrong. * more tests * typo Co-authored-by: Yuuichi ASAHI <asahi.yuuichi@qst.go.jp>
TomAugspurger
left a comment
There was a problem hiding this comment.
Thanks for working on this. Still want to give it a closer look.
It'd be nice to document the places where we're needing to copy code from scikit-learn. This is a lot of code, when ideally we only need to override a few pieces relevant to Dask.
|
Thanks. I marked the updated part from scikit-learn by comment. |
|
Hello, I have made a performance analysis with the performance_report and a naive timer. The testbed is a Intel Skylake CPU. Using the incremental PCA to our large data ~20GB, it gave the dask-report for the fit method . For some reason, if we call the method under performance_report, it gets 50 times slower. A naive timer gave 184s, 148s and 63s for incremental_mean_and_var, svd, and compute(), respectively. For the naive measurement, we deliberately compute() all the quantities for synchronization. Other functions are immediately finished. |
Hrm, that is concerning. Thank you for reporting this. I don't immediately see why. I raised dask/distributed#3654 to make sure that we are not counting the time taken to generate the plots. However after I looked more closely at your performance report I see that this is not the problem. I see that the computations are actually taking many hours, which is surprising. There are tasks like The performance report code doesn't run during execution, it only collects data afterwards. When I look at the administrative profiles, which observe Dask itself (not the numpy/xarray code) I don't see anything that would cause this. I do however see that it is taking some time to read excess data from disk (you are maybe running out of memory?) But this adds only a few minutes of total execution time across all workers. In summary, I am confused :/ |
|
@fujiisoup apologies for the CI failures. If you merge master they should be resolved. |
tests/test_incremental_pca.py
Outdated
| from sklearn.utils._testing import assert_almost_equal | ||
| from sklearn.utils._testing import assert_array_almost_equal | ||
| from sklearn.utils._testing import assert_allclose_dense_sparse |
There was a problem hiding this comment.
Seems to be causing an import error. Can these be avoided?
There was a problem hiding this comment.
Just noticed that older scikit-learn does not have utils._testing but instead utils.testing.
Added a patch by try: except ImportError.
|
@mrocklin In order to focus on the performance of Dask itself, I have simplified the script with the smaller input data size (~7.5GB). The new performance report seems to show more reasonable performance. Unfortunately, I am not sure why adding the |
TomAugspurger
left a comment
There was a problem hiding this comment.
Thanks. This looks nice.
It's a bit unfortunate that we need so much surrounding code to make the small updates we need. It'd be a fun long-term project to see how much we can push on changes in Dask (array assignment, etc) and scikit-learn (check_array, a few other places) to get more code reuse.
|
|
||
| # Update stats - they are 0 if this is the first step | ||
| # The next line is equivalent with np.repeat(self.n_samples_seen_, X.shape[1]), | ||
| # which dask-array does not support |
There was a problem hiding this comment.
Dask does support np.repeat. Is it just when repeats=0 that there's an issue? Opened dask/dask#6076 for that.
closes #617
Most part of the script is stolen from scikit-learn but updated to be compatible with dask-arrays.
Tests are also copied.