Multi-label Label Binarizer Memory Error

I am working on a text classification problem. I have a largish dataset with about 5 million documents and close to 50000 classes. I have used the `TfidfVectorizer` to extract features (again about 1 million features) from the documents. 

The problem obvious arises when I try to run any classifier, since the `OVR` uses `label_binarize` method, and it creates an empty zeros array of shape (6mil x 50k). This is obviously not going to fit in my memory. 

The question I have is: Is there a built-in way around this or should I modify the code for `label_binarize` to write to a sparse matrix instead? Is this doable? 

I am open to any suggestions as well. 

`ver 0.14.1`

```
>>> clf = OneVsRestClassifier(LinearSVC())
>>> clf.fit(xtrain, np_array[:,1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 201, in fit
    n_jobs=self.n_jobs)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 88, in fit_ovr
    Y = lb.fit_transform(y)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
    neg_label=self.neg_label)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
    Y = np.zeros((len(y), len(classes)), dtype=np.int)
MemoryError
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-label Label Binarizer Memory Error #2441

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Multi-label Label Binarizer Memory Error #2441

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions