-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
GridsearchCV error "Access is denied" after finishing all folds of cv but before the final fit and joblib error in the next run. #6820
Description
Hi , I am running gridsearchcv on a 16 core and 112GB RAM Azure VM (rdp). While the same code works fine on my PC with 16 cores and 8GB RAM, I am getting the following error on the VM inconsistently like 8 out of 10 times.The shape of the train data is 800000 * 20. The grid search model is fitting all the folds as I can see from verbose(=10) but before calculating the "best_score_" and "best_estimator_".Below is the traceback. (I am getting the same error for n_jobs=-1 or a number less than 16 and refit is True and for different values of cv ranging from 4-10)
CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False
[CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False, score=0.620114 - 47.6s
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 818, in call
self._terminate_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 549, in _terminate_pool
self._pool.terminate() # terminate does a join()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 583, in terminate
super(MemmapingPool, self).terminate()
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 465, in terminate
self._terminate()
File "C:\Anaconda2\lib\multiprocessing\util.py", line 207, in call
res = self._callback(_self._args, *_self._kwargs)
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 513, in _terminate_pool
p.terminate()
File "C:\Anaconda2\lib\multiprocessing\process.py", line 137, in terminate
self._popen.terminate()
File "C:\Anaconda2\lib\multiprocessing\forking.py", line 312, in terminate
_subprocess.TerminateProcess(int(self._handle), TERMINATE)
WindowsError: [Error 5] Access is denied
ALSO, in the very next run of the same code, I am getting the following Joblib error.
Fitting 4 folds for each of 3 candidates, totalling 12 fits
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 766, in call
n_jobs = self._initialize_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 515, in _initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information.
This is the code,
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, precision_score, recall_score, precision_recall_curve, accuracy_score,confusion_matrix
import sklearn.grid_search
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import pipeline,metrics, grid_search
if name=="main":
train_data=pd.read_csv("train.csv",delimiter=",")
test_data=pd.read_csv("test.csv",delimiter=",")
n_cv=int(raw_input("Please tell us how many fold cross validation you would like to perform,please give a value between 3-10"))
assert (n_cv>=3) and (n_cv<=10), "Looks like the number of folds is not in between 3 and 10"
#scl = StandardScaler()
y_train=train_data["class"]
y_test=test_data["class"]
x_train=train_data
x_test=test_data
#x_test=list(x_test)
y_train=list(y_train)
y_train=[1 if i==True else 0 for i in y_train]
y_test=list(y_test)
y_test=[1 if i==True else 0 for i in y_test]
type_of_ml_model=raw_input("Please input rf for random forests, lr for logistic regression, svm for support vector machines and gbdt for gradient boosting")
if type_of_ml_model=="rf":
ml_model=RandomForestClassifier(random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {
'ml__n_estimators':[100,500],
'ml__max_features':["auto",None,"log2"],
'ml__max_depth':[5,4,3],
'ml__min_samples_split':[1,2],
'ml__oob_score':[True,False],
'ml__class_weight':["balanced","balanced_subsample"]
}
elif type_of_ml_model=="lr":
ml_model=LogisticRegression(class_weight="balanced",random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {#'ml__fit_intercept':[True],'ml__intercept_scaling':[1,2],
'ml__tol':[0.01],'ml__max_iter':[300],'ml__warm_start':[False],
'ml__C': [0.5,0.2,0.6],
'ml__penalty':["l1"]}
elif type_of_ml_model=="svm":
ml_model=LinearSVC(class_weight="balanced",random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {'ml__loss':["squared_hinge"],'ml__dual':[False],'ml__fit_intercept':[True],
'ml__intercept_scaling':[2,3],
'ml__tol':[0.001,0.0001],'ml__max_iter':[200],
'ml__C': [1,0.8],
'ml__penalty':["l1"]}
elif type_of_ml_model=="gbdt":
ml_model=GradientBoostingClassifier(random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {'ml__loss':["deviance"],
'ml__n_estimators':[100,200],
'ml__max_features':["auto",None],
'ml__max_depth':[4,3],
'ml__min_samples_split':[1,2],
'ml__subsample':[1.0],'ml__warm_start':[False],
'ml__min_samples_leaf':[1,2]
}
precision_scorer = metrics.make_scorer(precision_score, greater_is_better = True)
# Initialize Grid Search Model
model = grid_search.GridSearchCV(estimator = clf, param_grid=param_grid, scoring=precision_scorer,
verbose=10,n_jobs=-1, iid=True, refit=True, cv=n_cv)
# Fit Grid Search Model
model.fit(x_train, y_train)
Is there something in parameters that we need to change when we are running on a VM?