[MRG] BUG avoid memmaping large dataframe in permutation_importance by shivamgargsya · Pull Request #15898 · scikit-learn/scikit-learn

shivamgargsya · 2019-12-16T08:54:29Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

When using permutation_importance with a large enough pandas DataFrame and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.
The bug was fixed by setting the bug by setting max_nbytes to None.

Any other comments?

Verified using below snippet
`import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

data = load_iris()
df = pd.DataFrame(data=data.data.repeat(1000, axis=0))
y = data.target.repeat(1000)

clf = RandomForestClassifier()
clf.fit(df, y)

r = permutation_importance(clf, df, y, n_jobs=-1)`

…e and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame. The bug was fixed by setting the bug by setting max_nbytes to None.

thomasjpfan

Please add an Fix to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

rth · 2019-12-16T13:14:00Z

Thanks! I think it could be an acceptable workaround, however the initial issue is that we share data between processes, then perform inplace modification of this data.

Triggering inter-process serialization is one way around it, another one could be to keep using mmap and trigger a copy manually. Ideally there may be a way to make copy of the dataframe, where only one columns (the changed one) is a copy and other ones are still views. Also I'm not fully sure how the improvement of inter-process serialization in Python 3.8 would impact the choice of the solution.

Any other thoughts @jeremiedbb @thomasjpfan ?

shivamgargsya · 2019-12-16T13:33:04Z

Please add an Fix to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Thanks @thomasjpfan

andersbogsnes · 2019-12-16T20:07:22Z

Since it only seems to fail on DataFrames, would it be an idea to conditionally turn off memmapping based on whether or not the input is a dataframe?

…s passed

shivamgargsya · 2019-12-17T06:09:22Z

@andersbogsnes I have added the check for Dataframe instance now. Please take a look don't think check for label column is necessary

shivamgargsya · 2019-12-17T07:42:54Z

@andersbogsnes aren't we checking for same by checking type of input X. in line 120
type(X).__name__ == "DataFrame":

andersbogsnes · 2019-12-17T07:48:25Z

That doesn’t allow for duck typing, so if I were to subclass a DataFrame (or use another DataFrame like library) then type would not work. It’s also consistent with the check in _safe_column_indexing

shivamgargsya · 2019-12-17T08:40:36Z

Updated the check based on attribute of data frame class.

doc/whats_new/v0.22.rst

jnothman · 2019-12-17T19:54:36Z

Thanks, @andersbogsnes for the help!

Should we have a way to include such costly test cases, but ordinarily skip them?

andersbogsnes · 2019-12-18T07:36:59Z

I would always recommend having a test written for a known bug - that’s how I caught this error when we swapped our implementation for scikit-learn’s - but there is a cost of course.

The failing example above is fairly cheap though, as the cost is linear per example.

Does the test suite have a slow marker or something similar?

Alternatively, the root cause is that the implementation modifies in-place, wondering out loud if there is a way to test for that instead which might be cheaper?

jnothman · 2019-12-18T12:11:55Z

Yes, I think it should be fine to add that test. It's not so slow, and we can make the RF fitting faster (by using iris without repetition, for example)

andersbogsnes · 2019-12-18T17:01:41Z

Well, you need the repetition to hit the 1M limit where joblib turns on the memmap - I didn’t do any calculations on where that was, just that 100 wasn’t enough but 1000 worked :-)

thomasjpfan · 2019-12-18T17:45:24Z

For this specific case, the following (smallish) example also fails on master:

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.inspection import permutation_importance

X, y = make_classification(n_samples=7000)
df = pd.DataFrame(X)

clf = DummyClassifier(strategy='prior')
clf.fit(df, y)

r = permutation_importance(clf, df, y, n_jobs=2)

We could make this into a test.

glemaitre · 2019-12-19T15:50:59Z

We should also check for the output:

def test_permuation_importance_memmaping_dataframe():
    pd = pytest.importorskip("pandas")

    X, y = make_classification(n_samples=7000)
    df = pd.DataFrame(X)

    clf = HistGradientBoostingClassifier(max_iter=5, random_state=42)
    clf.fit(df, y)

    importances_parallel = permutation_importance(
        clf, df, y, n_jobs=2, random_state=0, n_repeats=2
    )
    importances_sequential  = permutation_importance(
        clf, df, y, n_jobs=1, random_state=0, n_repeats=2
    )

    assert_allclose(
        importances_parallel['importances'],
        importances_sequential['importances']
    )

(we have something weird happening with the random_state thought)

sklearn/inspection/_permutation_importance.py

ogrisel · 2019-12-19T16:21:37Z

To make the test suggested in #15898 (comment) faster yet not too trivial, you can use max_leaf_nodes=3, max_iter=5.

Or alternatively use a regression problem with and a Ridge model.

DummyClassifier/DummyRegressor are too trivial: they do not use X so permutation importances should always be exactly zero and they cannot highlight problem related to non-shared random state instances when n_jobs > 1.

ogrisel · 2019-12-19T17:42:58Z

The issue with different results with n_jobs=1 and n_jobs>1 is being dealt with in #15933 to keep concerns separated (second review welcome).

ogrisel · 2019-12-19T18:54:40Z

Actually, reading this code made me realize that the code is not thread safe. In particular it break with the threading backend.

I have updated #15933 to actually fix both issues at the same time.

Therefore I would rather close this PR in favor of #15933 if people agree that this is a better fix.

thomasjpfan · 2019-12-19T20:12:04Z

Therefore I would rather close this PR in favor of #15933 if people agree that this is a better fix.

I agree

glemaitre · 2019-12-19T21:46:27Z

I'm + 1 for closing. @shivamgargsya This is a bit unfortunate but we discovered the underlying bug because of your work in this PR. Don't feel like it was a vain effort.

ogrisel · 2019-12-20T17:01:53Z

Thanks for your contribution @shivamgargsya! Closing.

shivamgargsya changed the title ~~When using permutation_importance with a large enough pandas DataFram…~~ Permutation importance fail fix #15810 Dec 16, 2019

rth requested a review from thomasjpfan December 16, 2019 12:47

thomasjpfan approved these changes Dec 16, 2019

View reviewed changes

added entry in whats new in v0.22 rst

43e9cf4

Added check to only pass max_nbytes=None only when pandas Dataframe i…

4588267

…s passed

Removed pandas dependency

3283ac6

shivamgargsya changed the title ~~Permutation importance fail fix #15810~~ [MRG]Permutation importance fail fix #15810 Dec 17, 2019

Added check based on the attribute loc of dataframe object

0bd170e

jnothman reviewed Dec 17, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

moved change log to 0.22.1

be1ff21

jnothman added this to the 0.22.1 milestone Dec 18, 2019

glemaitre self-assigned this Dec 19, 2019

glemaitre changed the title ~~[MRG]Permutation importance fail fix #15810~~ [MRG] BUG avoid memmaping large dataframe in permutation_importance Dec 19, 2019

ogrisel reviewed Dec 19, 2019

View reviewed changes

sklearn/inspection/_permutation_importance.py Show resolved Hide resolved

glemaitre mentioned this pull request Dec 19, 2019

BUG ensure that parallel/sequential give the same permutation importances #15933

Merged

ogrisel closed this Dec 20, 2019

Uh oh!

Conversation

shivamgargsya commented Dec 16, 2019 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Dec 16, 2019

Uh oh!

shivamgargsya commented Dec 16, 2019

Uh oh!

andersbogsnes commented Dec 16, 2019

Uh oh!

shivamgargsya commented Dec 17, 2019

Uh oh!

shivamgargsya commented Dec 17, 2019

Uh oh!

andersbogsnes commented Dec 17, 2019

Uh oh!

shivamgargsya commented Dec 17, 2019

Uh oh!

Uh oh!

jnothman commented Dec 17, 2019

Uh oh!

andersbogsnes commented Dec 18, 2019

Uh oh!

jnothman commented Dec 18, 2019

Uh oh!

andersbogsnes commented Dec 18, 2019

Uh oh!

thomasjpfan commented Dec 18, 2019

Uh oh!

glemaitre commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Dec 19, 2019

Uh oh!

glemaitre commented Dec 19, 2019

Uh oh!

ogrisel commented Dec 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

shivamgargsya commented Dec 16, 2019 •

edited by jnothman

Loading

glemaitre commented Dec 19, 2019 •

edited

Loading

ogrisel commented Dec 19, 2019 •

edited

Loading

ogrisel commented Dec 19, 2019 •

edited

Loading

ogrisel commented Dec 19, 2019 •

edited

Loading