[MRG] BUG avoid memmaping large dataframe in permutation_importance#15898
[MRG] BUG avoid memmaping large dataframe in permutation_importance#15898shivamgargsya wants to merge 6 commits intoscikit-learn:masterfrom
Conversation
…e and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame. The bug was fixed by setting the bug by setting max_nbytes to None.
thomasjpfan
left a comment
There was a problem hiding this comment.
Please add an Fix to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:
|
Thanks! I think it could be an acceptable workaround, however the initial issue is that we share data between processes, then perform inplace modification of this data. Triggering inter-process serialization is one way around it, another one could be to keep using mmap and trigger a copy manually. Ideally there may be a way to make copy of the dataframe, where only one columns (the changed one) is a copy and other ones are still views. Also I'm not fully sure how the improvement of inter-process serialization in Python 3.8 would impact the choice of the solution. Any other thoughts @jeremiedbb @thomasjpfan ? |
Thanks @thomasjpfan |
|
Since it only seems to fail on DataFrames, would it be an idea to conditionally turn off memmapping based on whether or not the input is a dataframe? |
|
@andersbogsnes I have added the check for Dataframe instance now. Please take a look don't think check for label column is necessary |
|
@andersbogsnes aren't we checking for same by checking type of input X. in line 120 |
|
That doesn’t allow for duck typing, so if I were to subclass a DataFrame (or use another DataFrame like library) then |
|
Updated the check based on attribute of data frame class. |
|
Thanks, @andersbogsnes for the help! Should we have a way to include such costly test cases, but ordinarily skip them? |
|
I would always recommend having a test written for a known bug - that’s how I caught this error when we swapped our implementation for scikit-learn’s - but there is a cost of course. The failing example above is fairly cheap though, as the cost is linear per example. Does the test suite have a Alternatively, the root cause is that the implementation modifies in-place, wondering out loud if there is a way to test for that instead which might be cheaper? |
|
Yes, I think it should be fine to add that test. It's not so slow, and we can make the RF fitting faster (by using iris without repetition, for example) |
|
Well, you need the repetition to hit the 1M limit where joblib turns on the memmap - I didn’t do any calculations on where that was, just that 100 wasn’t enough but 1000 worked :-) |
|
For this specific case, the following (smallish) example also fails on master: import pandas as pd
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.inspection import permutation_importance
X, y = make_classification(n_samples=7000)
df = pd.DataFrame(X)
clf = DummyClassifier(strategy='prior')
clf.fit(df, y)
r = permutation_importance(clf, df, y, n_jobs=2)We could make this into a test. |
|
We should also check for the output: def test_permuation_importance_memmaping_dataframe():
pd = pytest.importorskip("pandas")
X, y = make_classification(n_samples=7000)
df = pd.DataFrame(X)
clf = HistGradientBoostingClassifier(max_iter=5, random_state=42)
clf.fit(df, y)
importances_parallel = permutation_importance(
clf, df, y, n_jobs=2, random_state=0, n_repeats=2
)
importances_sequential = permutation_importance(
clf, df, y, n_jobs=1, random_state=0, n_repeats=2
)
assert_allclose(
importances_parallel['importances'],
importances_sequential['importances']
)(we have something weird happening with the |
|
To make the test suggested in #15898 (comment) faster yet not too trivial, you can use Or alternatively use a regression problem with and a
|
|
The issue with different results with |
I agree |
|
I'm + 1 for closing. @shivamgargsya This is a bit unfortunate but we discovered the underlying bug because of your work in this PR. Don't feel like it was a vain effort. |
|
Thanks for your contribution @shivamgargsya! Closing. |
Reference Issues/PRs
Fixes #15810
What does this implement/fix? Explain your changes.
When using permutation_importance with a large enough pandas DataFrame and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.
The bug was fixed by setting the bug by setting max_nbytes to None.
Any other comments?
Verified using below snippet
`import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
data = load_iris()
df = pd.DataFrame(data=data.data.repeat(1000, axis=0))
y = data.target.repeat(1000)
clf = RandomForestClassifier()
clf.fit(df, y)
r = permutation_importance(clf, df, y, n_jobs=-1)`