Skip to content

Fix ColumnTransformer in parallel with joblib's auto memmapping#28822

Merged
ogrisel merged 5 commits into
scikit-learn:mainfrom
jeremiedbb:column-transfo-memmap
Apr 22, 2024
Merged

Fix ColumnTransformer in parallel with joblib's auto memmapping#28822
ogrisel merged 5 commits into
scikit-learn:mainfrom
jeremiedbb:column-transfo-memmap

Conversation

@jeremiedbb

Copy link
Copy Markdown
Member

Fixes #28781

When running in parallel, ColumnTransformer will crash if joblib's auto memmap triggers and copies are not made in time.

Currently we index X when declaring the jobs. It means we have copy then read-only memmap. Then if the transformer fails to do inplace transfo, or fails earlier in case of dataframe (see #28781 (comment)).

The fix here proposes to index X within each job instead. This way we have read-only memmap then copy, and the transformer can do inplace transfo.

Disclaimer: it doesn't solve the underlying problem completely. If you select columns by slice it still fails because it creates a view and not a copy. I'm starting to think that the issue is more profound, and lies between the copy parameter and check_array, for all estimators. I think check_array should always make a copy if the array is read-only, even if copy=False because when an estimator has a copy parameter, it's because it wants to do inplace modifications.

@github-actions

github-actions Bot commented Apr 12, 2024

Copy link
Copy Markdown

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f558534. Link to the linter CI: here

@jeremiedbb

Copy link
Copy Markdown
Member Author

I opened #28824 to discuss the read-only situation more globally.

Comment on lines +2465 to +2467
with parallel_backend("loky", max_nbytes=1):
Xt = transformer.fit_transform(X)

@jeremiedbb jeremiedbb Apr 12, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing because this is only doable in joblib>=1.13 and our min is 1.12.
I can use a bigger array for now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could skip the test on that joblib though.

@adrinjalali adrinjalali left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this might be a bigger issue, but this is a minimal change that fixes a few cases. So LGTM.

Comment on lines +2465 to +2467
with parallel_backend("loky", max_nbytes=1):
Xt = transformer.fit_transform(X)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could skip the test on that joblib though.

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ColumnTransformer throws error with n_jobs > 1 input dataframes and joblib auto-memmapping (regression in 1.4.1.post1)

3 participants