Fast loader for the svmlight / libsvm format by mblondel · Pull Request #209 · scikit-learn/scikit-learn

mblondel · 2011-06-10T16:21:04Z

Here's a pull-request that implements a fast and memory efficient loader for the svmlight / libsvm sparse dataset format. By memory efficient, I mean that it loads the dataset directly into a scipy sparse CSR without any memory copying.

It should make playing with scikit-learn much easier as a wealth of datasets are availabel in this format, see e.g. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

You can use it like this:

import sys
import numpy as np
from scikits.learn.datasets import load_svmlight_format
from scikits.learn.linear_model.sparse import SGDClassifier

X_train, y_train, X_test, y_test = load_svmlight_format(sys.argv[1], sys.argv[2])

y_pred = SGDClassifier().fit(X_train, y_train).predict(X_test)
print "Accuracy", np.mean(y_pred == y_test)

Timing on the MNIST dataset:

Accuracy 0.8731

real    0m3.104s
user    0m2.920s
sys     0m0.180s

Timing on the news20 dataset:

Accuracy 0.845730027548

real    0m1.134s
user    0m1.070s
sys     0m0.060s

(loading 2 datasets, learning and prediction)

The loader is coded in 250 lines of C++ and a few lines of Python (would be nice if experienced C++ programmers like @larsmans or @jakevdp could review).

Pure-python was out-of-the-question for me as it is really slow for parsing text (I think it would take 10 to 100 times slower and it would require memory copying). I didn't use Cython as I think the code would look just like a C++ dialect.

Also, I had to use the trick described by Travis Oliphant (http://blog.enthought.com/?p=62) to manage memory deallocation. As mentioned in the comments of this blog post, if we assume that the memory allocator is malloc, quite a few lines of code can be spared by using array.flags |= NPY_OWNDATA. However, it seems to me that it is a dangerous assumption so I went for the solution suggested by Travis.

What still needs to be done:

narrative doc and doctests　[done]
check error handling [done]
return a scipy sparse matrix for Y if there's more than 1 label per row (multilabel) [for another pull request]

ogrisel · 2011-06-10T16:33:20Z

Very interesting. Do you think it would be possible to extend it (possibly in another function / pull-request) to also support the vowpal wabbit extensions? See for instance http://hunch.net/~vw/examples.html . Also that would be nice if we could include a low overhead C level callback mechanism (based on function pointers for instance) to be able to plug custom feature hashing logic and / or cross products of features as done by VW.

mblondel · 2011-06-10T16:42:46Z

Yes I'm very open to that (as long as it is backward-compatible with the more simple format). And yes, it would be more manageable for me to do it in another pull-request ;-)

There's no need for them, and _[A-Z] is reserved for the C++ implementation, so the behavior is undefined per C++03.

larsmans · 2011-06-10T17:22:53Z

I've cloned your branch into mblondel-svmlight. I see some opportunities for improvement, including optimization (you've got two levels of indirection due to vector*).

mblondel · 2011-06-10T17:27:08Z

@larsmans: Very happy that you're looking into this! I have made a commit since then so you'll have to merge. I will concentrate on the doc now, so you can safely work on it.

mblondel · 2011-06-10T18:17:08Z

Error propagation and narrative documentation done. Ok, that's all for today.

…arn into mblondel-svmlight Conflicts: scikits/learn/datasets/_svmlight_format.cpp Added a comment.

mblondel · 2011-06-11T03:56:53Z

@larsmans: I merged your contributions, thanks a lot. Please do add your name to the credits.

larsmans · 2011-06-11T09:09:35Z

There are a few more issues with this code which I feel obliged to fix, now that you've made me one of the authors :)

But there's also a more fundamental problem: since we're reading from C++ IOstreams, there's no easy way to plug in support for compressed files. E.g., 20news.binary is 140MB of data compressed to 26MB; you'd want to bunzip2 it on the fly to save 80% of the disk accesses.

There is portable support for this in Boost, but in one of the few non-header only libraries. I'll see if I can fix something up with file descriptor I/O.

mblondel · 2011-06-11T09:20:36Z

How do you plan to do it? Gael won't be happy if we add boost as a dependency.

larsmans · 2011-06-11T09:56:57Z

I'd guessed as much. I think the GNU C++ library offers all the functionality I need. Let me hack for a while...

larsmans · 2011-06-11T09:59:29Z

Oh, by the way: if I understand the DeallocObject correctly, then you've got a huge memory leak by relying on vector::clear to do deallocation (which it doesn't).

mblondel · 2011-06-11T10:26:52Z

Yes you're completely right for vector::clear. I was aware of that issue but I forgot to come back to it. I was planning to use swap(emptyvector). Is it the correct way to release the memory?

larsmans · 2011-06-11T10:33:28Z

No, swap won't do the trick either: even if it were to free the vector's contents, it would still leak the header (12 bytes + allocator overhead). I think the right way to handle this is to use placement new, which also gets rid of a level of indirection.

I'll push a commit once I've implemented that; in the meanwhile, I noticed that a message printed to std::cerr from dealloc doesn't show up in unit testing. Can you verify this and diagnose what's going on?

ogrisel · 2011-06-11T10:34:08Z

Speaking of support for compressed format, I think the whole data hacking community (hadoop, pig, avro, cassandra...) will soon be moving to snappy recently open sourced by Google: https://code.google.com/p/snappy/ . It does not compress as much as gzip or bz2 but it is meant to be much faster (much lower CPU overhead) while compressing the data enough to gain dramatic speedup on the IO side.

The license is compatible with scikit-learn, is self contained and is small enough so that we can embed a copy of it within the source code of the datasets module:

https://code.google.com/p/snappy/source/browse/trunk (it's written in C++ with a C API compatibility layer).

That said, additional support for gzip and bzip2 would still be great if it does not add new dependency.

Quick fix, maybe only partial: the deallocator seems not to be called in unit testing.

mblondel · 2011-06-11T11:20:10Z

Last time I checked on my computer, dealloc did get called. I will check again on Monday as I'm not on my usual computer right now.

larsmans · 2011-06-11T14:07:39Z

Yes, the abstraction level went up, but consistency and safety increased accordingly and it's still only intermediate-level C++. I concur that the double vector owner type is a wart.

As for malloc and realloc: we could do that, but then

we'd have to do more pointer fiddling, resulting in lower-level, harder to read code, and
the parse_line function would become O(n²) unless we do our own dynamic array management, thus reinventing the wheel.

mblondel · 2011-06-11T15:08:12Z

Thanks, that indeed looks much better. If you send me a pull request on mblondel/scikit-learn I will probably be able to merge it from the web interface and this will appear in this pull request too.

We still need to figure why the dealloc function doesn't get called on your machine.

Info from exceptions should be propagated up to the Python level, but currently isn't.

larsmans · 2011-06-11T15:55:30Z

@mblondel: let's get back to the dealloc (renamed destroy_vector_owner) issue next week; I'm not behind my own workstation either, but working on an HPC cluster over SSH from a shabby old laptop. Not ideal for debugging :(

Pull request in a minute.

SVMlight reader update

mblondel · 2011-06-11T16:19:56Z

@larsmans: No problem. Let's see that next week then.

I thought of a possible optim. Even though vector allocates more memory than it needs so that it doesn't need to resize the internal array on each push_back, it would be nice if we could give vector some hints on its future size. vector has a method reserve to do that. So the idea would be to run the parser for say 1000 lines (1000 samples) then given the file size, the current position in the file and the current vector size, it should be possible to make a rough estimate of how many elements will be necessary for the remaining lines. Do you think we could expect improved performance from such an optim? Together with the compressed file support, I think this should be addressed in a separate pull-request though.

larsmans · 2011-06-11T16:28:53Z

@mblondel: that may be worth a try! Maybe we can even do it statistically correct by estimating when we've seen sqrt(filesize)?

Changes behavior: trying to open a non-existent file will result in an IOError rather than a ValueError. Slapped my name onto scikits/learn/datasets/svmlight_format.py out of sheer vanity. svmlight_format.py is now pep8 and pyflakes-clean.

…learn into svmlight_format

mblondel · 2011-06-12T15:53:30Z

I usually work from my Linux workstation in my lab but tonight I used my macbook and I got two compile errors. The first one is that vector.data() doesn't exist. The second comes from obj.~VectorOwner();. I get:

./_svmlight_format.cpp: In function ‘void destroy_vector_owner(PyObject*)’:
./_svmlight_format.cpp:58: error: expected class-name before ‘(’ token
./_svmlight_format.cpp: In function ‘void destroy_vector_owner(PyObject*)’:
./_svmlight_format.cpp:58: error: expected class-name before ‘(’ token
lipo: can't figure out the architecture type of: /var/folders/Lc/Lcm3JWXEErmlcY4v482kF++++TI/-Tmp-//ccoa14VW.out

I've found fixes that I will push in a minute. I also found the reason for the deallocation problem. It was just a ref counting issue. Fix coming too.

mblondel · 2011-06-12T16:56:43Z

Lars, are there other things that you want to see done in this pull-request?

Do we have an agreement on the function, parameter and module names?

larsmans · 2011-06-12T19:56:43Z

Maybe load_svmlight_file would be a better name; now we're suggesting the user can load a format, which doesn't make much sense.

Also, there's still a few memory management/exception-safety issues, I believe. Let me have a good look at this code again tomorrow, maybe then we can pull it.

Also some minor refactoring

Memory leak and exception safety in SVMlight reader

mblondel · 2011-06-13T13:19:17Z

Are we good?

ogrisel · 2011-06-13T13:32:45Z

AFAIK the function name comment by Lars is not yet addressed. Apart from this, this looks good (although I am not experienced in C++ enough to comment the implementation).

Also I would appreciate additional comments from the usual reviewers: ping @fabianp @GaelVaroquaux and @agramfort !

GaelVaroquaux · 2011-06-13T13:34:25Z

I have a course to prepare, and will be travelling a lot in the next
month. Simply move on without me. I am flat out. I'll try and keep an eye
on what @vene does, but for something like this pull request, for which I
have no expertise, I'll miss out.

larsmans · 2011-06-13T13:36:50Z

I think we're done, let the reviewers do their work.

ogrisel · 2011-06-14T08:32:50Z

I have pushed the mldata pull-request to master. It is very likely that this branch needs a refresh to resolve conflicts on the doc/modules/datasets.rst file.

Once this is done I am ok to merge if the others don't react.

Conflicts: doc/modules/datasets.rst

mblondel · 2011-06-14T09:03:34Z

The dataset generator module needs some love IMHO (inconsistent function names, missing doc, etc).

I'll merge this pull request in a couple of hours if nobody reacts.

Fast loader for the svmlight / libsvm format

larsmans · 2011-06-30T14:33:43Z

I've been trying to rewrite the SVMlight reader in Cython to get rid of the boilerplate code. However, I can't implement the vector owner classes in Cython without an extra level of indirection because

C++ classes not allowed as members of an extension type, use a pointer or reference instead

I might try again next week.

mblondel added 3 commits June 11, 2011 00:20

Fast and memory-efficient loader for the svmlight format.

f5c0cb5

Allow to user to fix n_features.

9de3afe

Docstring.

333d5ea

mblondel and others added 3 commits June 11, 2011 02:04

Important note.

1b47641

Propagate errors up to the Python level.

059a129

svmlight reader: don't use leading _ in identifiers

f98bb35

There's no need for them, and _[A-Z] is reserved for the C++ implementation, so the behavior is undefined per C++03.

Narrative documentation.

2b379fe

larsmans added 3 commits June 10, 2011 21:49

Merge branch 'svmlight_format' of git://github.com/mblondel/scikit-le…

a770a59

…arn into mblondel-svmlight Conflicts: scikits/learn/datasets/_svmlight_format.cpp Added a comment.

SVMlight reader: minor fixes

aa3e4f9

SVMlight reader: ensure C calling conventions + docstring

9b3f0f4

mblondel added 3 commits June 11, 2011 13:03

Update credits.

3dc5d24

Return false when couldn't read the file.

6582a72

Fix comment.

7eab00f

larsmans added 2 commits June 11, 2011 12:35

Plumb memory leak in SVMlight reader

8526ab6

Quick fix, maybe only partial: the deallocator seems not to be called in unit testing.

SVMlight reader: one more clear() instead of delete

a1156da

SVMlight reader: cosmetic

d7debeb

Simplify and document SVMlight/libSVM data reader

24619a6

Use C++ exception handling in SVMlight reader.

6566057

Info from exceptions should be propagated up to the Python level, but currently isn't.

Merge pull request #6 from larsmans/mblondel-svmlight

8d5e00d

SVMlight reader update

larsmans and others added 2 commits June 11, 2011 19:15

finish exception handling in SVMlight reader

fb3ecab

Changes behavior: trying to open a non-existent file will result in an IOError rather than a ValueError. Slapped my name onto scikits/learn/datasets/svmlight_format.py out of sheer vanity. svmlight_format.py is now pep8 and pyflakes-clean.

Merge branch 'mblondel-svmlight' of git://github.com/larsmans/scikit-…

1a1510a

…learn into svmlight_format

mblondel added 3 commits June 13, 2011 00:54

Fix compile issues on Mac OS X.

650c464

Fix ref counting bug.

446687a

More comments.

d59a06b

larsmans and others added 3 commits June 13, 2011 12:56

SVMlight reader: memory leak, type test

2cede17

(Hopefully) full exception safety in SVMlight reader

772b24a

Also some minor refactoring

Merge pull request #7 from larsmans/mblondel-svmlight

4f059a5

Memory leak and exception safety in SVMlight reader

load_svmlight_format -> load_svmlight_file.

f70c442

Merge branch 'master' into svmlight_format

a7c75a2

Conflicts: doc/modules/datasets.rst

mblondel added a commit that referenced this pull request Jun 14, 2011

Merge pull request #209 from mblondel/svmlight_format

20d7191

Fast loader for the svmlight / libsvm format

mblondel merged commit 20d7191 into scikit-learn:master Jun 14, 2011

Uh oh!

Conversation

mblondel commented Jun 10, 2011

Uh oh!

ogrisel commented Jun 10, 2011

Uh oh!

mblondel commented Jun 10, 2011

Uh oh!

larsmans commented Jun 10, 2011

Uh oh!

mblondel commented Jun 10, 2011

Uh oh!

mblondel commented Jun 10, 2011

Uh oh!

mblondel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 11, 2011

Uh oh!

mblondel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 11, 2011

Uh oh!

larsmans commented Jun 11, 2011

Uh oh!

mblondel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 11, 2011

Uh oh!

ogrisel commented Jun 11, 2011

Uh oh!

mblondel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 11, 2011

Uh oh!

mblondel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 11, 2011

Uh oh!

mblondel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 11, 2011

Uh oh!

mblondel commented Jun 12, 2011

Uh oh!

mblondel commented Jun 12, 2011

Uh oh!

larsmans commented Jun 12, 2011

Uh oh!

mblondel commented Jun 13, 2011

Uh oh!

ogrisel commented Jun 13, 2011

Uh oh!

GaelVaroquaux commented Jun 13, 2011

Uh oh!

larsmans commented Jun 13, 2011

Uh oh!

ogrisel commented Jun 14, 2011

Uh oh!

mblondel commented Jun 14, 2011

Uh oh!

larsmans commented Jun 30, 2011

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants