-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Multi-label Label Binarizer Memory Error #2441
Copy link
Copy link
Closed
Description
I am working on a text classification problem. I have a largish dataset with about 5 million documents and close to 50000 classes. I have used the TfidfVectorizer to extract features (again about 1 million features) from the documents.
The problem obvious arises when I try to run any classifier, since the OVR uses label_binarize method, and it creates an empty zeros array of shape (6mil x 50k). This is obviously not going to fit in my memory.
The question I have is: Is there a built-in way around this or should I modify the code for label_binarize to write to a sparse matrix instead? Is this doable?
I am open to any suggestions as well.
ver 0.14.1
>>> clf = OneVsRestClassifier(LinearSVC())
>>> clf.fit(xtrain, np_array[:,1])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 201, in fit
n_jobs=self.n_jobs)
File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 88, in fit_ovr
Y = lb.fit_transform(y)
File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
neg_label=self.neg_label)
File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
Y = np.zeros((len(y), len(classes)), dtype=np.int)
MemoryError
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels