Fix ColumnTransformer in parallel with joblib's auto memmapping by jeremiedbb · Pull Request #28822 · scikit-learn/scikit-learn

jeremiedbb · 2024-04-12T14:22:10Z

When running in parallel, ColumnTransformer will crash if joblib's auto memmap triggers and copies are not made in time.

Currently we index X when declaring the jobs. It means we have copy then read-only memmap. Then if the transformer fails to do inplace transfo, or fails earlier in case of dataframe (see #28781 (comment)).

The fix here proposes to index X within each job instead. This way we have read-only memmap then copy, and the transformer can do inplace transfo.

Disclaimer: it doesn't solve the underlying problem completely. If you select columns by slice it still fails because it creates a view and not a copy. I'm starting to think that the issue is more profound, and lies between the copy parameter and check_array, for all estimators. I think check_array should always make a copy if the array is read-only, even if copy=False because when an estimator has a copy parameter, it's because it wants to do inplace modifications.

github-actions · 2024-04-12T14:23:26Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f558534. Link to the linter CI: here}

jeremiedbb · 2024-04-12T15:07:09Z

I opened #28824 to discuss the read-only situation more globally.

jeremiedbb · 2024-04-12T15:19:49Z

+    with parallel_backend("loky", max_nbytes=1):
+        Xt = transformer.fit_transform(X)
+


CI is failing because this is only doable in joblib>=1.13 and our min is 1.12.
I can use a bigger array for now

we could skip the test on that joblib though.

adrinjalali

I agree this might be a bigger issue, but this is a minimal change that fixes a few cases. So LGTM.

adrinjalali · 2024-04-15T07:50:05Z

+    with parallel_backend("loky", max_nbytes=1):
+        Xt = transformer.fit_transform(X)
+


we could skip the test on that joblib though.

ogrisel

LGTM as well.

fix column transfo parallel auto memmap

17c6dbd

github-actions Bot added module:compose module:pipeline labels Apr 12, 2024

lint

e482baf

jeremiedbb commented Apr 12, 2024

View reviewed changes

adrinjalali approved these changes Apr 15, 2024

View reviewed changes

jeremiedbb added 3 commits April 15, 2024 11:16

test only if joblib >= 1.3

171eb7c

what's new entry

9b7fc9c

Merge remote-tracking branch 'upstream/main' into pr/jeremiedbb/28822

f558534

ogrisel approved these changes Apr 22, 2024

View reviewed changes

ogrisel merged commit 51fca39 into scikit-learn:main Apr 22, 2024

0xbe7a mentioned this pull request Jun 10, 2024

Performance Regression in scikit-learn 1.5.0: Execution Time for ColumnTransformer Scales Quadratically with the Number of Transformers when n_jobs > 1 #29229

Closed

jeremiedbb mentioned this pull request Jun 21, 2024

Fix performance regression in ColumnTransformer #29330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix ColumnTransformer in parallel with joblib's auto memmapping#28822

Fix ColumnTransformer in parallel with joblib's auto memmapping#28822
ogrisel merged 5 commits into
scikit-learn:mainfrom
jeremiedbb:column-transfo-memmap

jeremiedbb commented Apr 12, 2024

Uh oh!

github-actions Bot commented Apr 12, 2024 •

edited

Loading

Uh oh!

jeremiedbb commented Apr 12, 2024

Uh oh!

jeremiedbb Apr 12, 2024 •

edited

Loading

Uh oh!

adrinjalali Apr 15, 2024

Uh oh!

adrinjalali left a comment

Uh oh!

adrinjalali Apr 15, 2024

Uh oh!

ogrisel left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		with parallel_backend("loky", max_nbytes=1):
		Xt = transformer.fit_transform(X)

Uh oh!

Uh oh!

Conversation

jeremiedbb commented Apr 12, 2024

Uh oh!

github-actions Bot commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

jeremiedbb commented Apr 12, 2024

Uh oh!

jeremiedbb Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Apr 12, 2024 •

edited

Loading

jeremiedbb Apr 12, 2024 •

edited

Loading