Describe the bug
Original issue: kedro-org/kedro#3674
Relates to #28781
We use multiprocessing managers to work with shared memory for pipeline parallelisation. After this validation step was added we are experiencing ValueError: cannot set WRITEABLE flag to True of this array error when objects are retrieved from shared memory and passed to scikit-learn functions, for example fit, including this validation step.
The only solution that works for us so far is making a deep copy of objects before passing them to those methods which is not the desired solution.
Steps/Code to Reproduce
Some findings:
from concurrent.futures import ProcessPoolExecutor
from multiprocessing.managers import BaseManager
import traceback
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
class MemoryDataset:
def __init__(self):
self._ds = None
def save(self, ds):
self._ds = ds
def load(self):
return self._ds
def train_model(dataset: MemoryDataset) -> LinearRegression:
regressor = LinearRegression()
X_train, y_train = dataset.load()
try:
regressor.fit(X_train, y_train)
except Exception as _:
print(traceback.format_exc())
return regressor
class MyManager(BaseManager):
pass
MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))
def main():
rng = np.random.default_rng()
n_samples = 1000
X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
y_train = pd.Series(rng.random(n_samples))
# Replacing pd.Series with pd.DataFrame solves the issue
# y_train = pd.DataFrame(rng.random((n_samples, 1)), columns=list('E'))
futures = set()
manager = MyManager()
manager.start()
dataset = manager.MemoryDataset()
dataset.save((X_train, y_train))
with ProcessPoolExecutor(max_workers=1) as pool:
futures.add(pool.submit(train_model, dataset))
Expected Results
No error is thrown.
Actual Results
Traceback (most recent call last):
File "/pr-scikit-learn/main.py", line 48, in train_model
regressor.fit(X_train, y_train)
File "/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 609, in fit
X, y = self._validate_data(
^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1282, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1292, in _check_y
y = check_array(
^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1100, in check_array
array.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
Versions
System:
python: 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ]
executable: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.5.dev0
pip: 23.3.1
setuptools: 68.2.2
numpy: 1.26.4
scipy: 1.13.0
Cython: None
pandas: 2.2.2
matplotlib: None
joblib: 1.4.0
threadpoolctl: 3.4.0
Built with OpenMP: False
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: Nehalem
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.26.dev
threading_layer: pthreads
architecture: Nehalem
Describe the bug
Original issue: kedro-org/kedro#3674
Relates to #28781
We use multiprocessing managers to work with shared memory for pipeline parallelisation. After this validation step was added we are experiencing
ValueError: cannot set WRITEABLE flag to True of this arrayerror when objects are retrieved from shared memory and passed toscikit-learnfunctions, for examplefit,including this validation step.The only solution that works for us so far is making a deep copy of objects before passing them to those methods which is not the desired solution.
Steps/Code to Reproduce
Some findings:
n_samples. Whenn_samlesis relatively small ~100 the error is not happening. So can be related to ColumnTransformer throws error with n_jobs > 1 input dataframes and joblib auto-memmapping (regression in 1.4.1.post1) #28781 (comment)pd.Serieswithpd.DataFramesolves the issue but we don't have an idea whyExpected Results
No error is thrown.
Actual Results
Versions
System: python: 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ] executable: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/bin/python machine: macOS-10.16-x86_64-i386-64bit Python dependencies: sklearn: 1.5.dev0 pip: 23.3.1 setuptools: 68.2.2 numpy: 1.26.4 scipy: 1.13.0 Cython: None pandas: 2.2.2 matplotlib: None joblib: 1.4.0 threadpoolctl: 3.4.0 Built with OpenMP: False threadpoolctl info: user_api: blas internal_api: openblas num_threads: 10 prefix: libopenblas filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib version: 0.3.23.dev threading_layer: pthreads architecture: Nehalem user_api: blas internal_api: openblas num_threads: 10 prefix: libopenblas filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib version: 0.3.26.dev threading_layer: pthreads architecture: Nehalem