Add blockwise ensemble meta-estimators by TomAugspurger · Pull Request #657 · dask/dask-ml

TomAugspurger · 2020-05-04T21:16:12Z

This adds BlockwiseVotingClassifier and BlockwiseVotingRegressor, which are meta-estimators for

Blockwise training of the sub-estimator.
Ensemble prediction on new data.

Given an input array split into k partitions, k clones of the subestimator are fit independently, one per partition. This is efficient since we don't need to move any data (assuming the partitions of X and y are co-located).

At prediction time we combine the predictions from each of the k fitted models. For classification we take the class with either the most votes (voting="hard") or the highest total probability (voting="soft"). See https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier. For regression we take the average.

In [1]: import sklearn.linear_model
   ...: import dask_ml.datasets
   ...: import dask_ml.ensemble
   ...:
   ...: X, y = dask_ml.datasets.make_classification(n_features=20, chunks=25)
   ...:
   ...: clf = dask_ml.ensemble.BlockwiseVotingClassifier(
   ...:     sklearn.linear_model.LogisticRegression(), voting="soft",
   ...:     classes=[0, 1]
   ...: )
   ...:
   ...: clf.fit(X, y)

In [2]: clf.estimators_
Out[2]:
[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False)]

TomAugspurger · 2020-05-04T21:18:30Z

cc @js3711 from #135 and @nbren12.

One thing I'm not sure about is what the output of .transform should be. In scikit-learn's VotingClassifier, .transform returns the predictions of each sub-estimator. We can add that later if we want to match that.

```python In [1]: import sklearn.linear_model ...: import dask_ml.datasets ...: import dask_ml.ensemble ...: ...: X, y = dask_ml.datasets.make_classification(n_features=20, chunks=25) ...: ...: clf = dask_ml.ensemble.BlockwiseVotingClassifier( ...: sklearn.linear_model.LogisticRegression(), voting="soft", ...: classes=[0, 1] ...: ) ...: ...: clf.fit(X, y) In [2]: clf.estimators_ Out[2]: [LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)] ```

TomAugspurger · 2020-05-05T21:05:16Z

dask_ml/ensemble/_blockwise.py

+            results = [
+                X.map_blocks(_predict, dtype=dtype, estimator=estimator, drop_axis=1)
+                for estimator in self.estimators_
+            ]
+            combined = da.vstack(results).T.rechunk({1: -1})


This is kinda slow for many estimators (many partitions). We end up with X.npartitions * len(self.classifiers_) tasks. Looking into doing a "batch" predict where each task represents the predictions from every estimator on that partition stacked into a single ndarray.

Fixed for array. For dask.dataframe.map_partitions didn't especially like returning a 3D array. Looking into it a bit, but not a blocker right now.

TomAugspurger mentioned this pull request May 5, 2020

BaseEstimator tokenization may cause issues for fitted models #658

Closed

TomAugspurger added 9 commits May 5, 2020 11:18

always register

0b5c422

normalization

e61af4b

fixupc

d994914

fixupc

9702069

fixups

7a0ebee

fix arg

ccdeded

32 bit compat

d59f304

Merge remote-tracking branch 'upstream/master' into blockwise-meta

1e2f9d1

TomAugspurger force-pushed the blockwise-meta branch from 6e0021b to 1e2f9d1 Compare May 5, 2020 18:29

warning

7f6a24a

TomAugspurger commented May 5, 2020

View reviewed changes

TomAugspurger added 8 commits May 5, 2020 16:24

faster

d4837ee

compat

c92ee6f

fixup

98cd72c

check

40bb669

check

7674956

fixup

134fda2

fixup

bd0ee26

mypy

5b17f4b

TomAugspurger merged commit b812fe5 into dask:master May 6, 2020

TomAugspurger deleted the blockwise-meta branch May 6, 2020 11:36

TomAugspurger mentioned this pull request May 6, 2020

Blockwise ensemble meta-estimator #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add blockwise ensemble meta-estimators#657

Add blockwise ensemble meta-estimators#657
TomAugspurger merged 18 commits intodask:masterfrom
TomAugspurger:blockwise-meta

TomAugspurger commented May 4, 2020

Uh oh!

TomAugspurger commented May 4, 2020

Uh oh!

TomAugspurger May 5, 2020

Uh oh!

TomAugspurger May 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TomAugspurger commented May 4, 2020

Uh oh!

TomAugspurger commented May 4, 2020

Uh oh!

TomAugspurger May 5, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger May 5, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant