Skip to content

Multi-label Label Binarizer Memory Error #2441

@rsivapr

Description

@rsivapr

I am working on a text classification problem. I have a largish dataset with about 5 million documents and close to 50000 classes. I have used the TfidfVectorizer to extract features (again about 1 million features) from the documents.

The problem obvious arises when I try to run any classifier, since the OVR uses label_binarize method, and it creates an empty zeros array of shape (6mil x 50k). This is obviously not going to fit in my memory.

The question I have is: Is there a built-in way around this or should I modify the code for label_binarize to write to a sparse matrix instead? Is this doable?

I am open to any suggestions as well.

ver 0.14.1

>>> clf = OneVsRestClassifier(LinearSVC())
>>> clf.fit(xtrain, np_array[:,1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 201, in fit
    n_jobs=self.n_jobs)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 88, in fit_ovr
    Y = lb.fit_transform(y)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
    neg_label=self.neg_label)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
    Y = np.zeros((len(y), len(classes)), dtype=np.int)
MemoryError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions