Fast loader for the svmlight / libsvm format#209
Fast loader for the svmlight / libsvm format#209mblondel merged 30 commits intoscikit-learn:masterfrom
Conversation
|
Very interesting. Do you think it would be possible to extend it (possibly in another function / pull-request) to also support the vowpal wabbit extensions? See for instance http://hunch.net/~vw/examples.html . Also that would be nice if we could include a low overhead C level callback mechanism (based on function pointers for instance) to be able to plug custom feature hashing logic and / or cross products of features as done by VW. |
|
Yes I'm very open to that (as long as it is backward-compatible with the more simple format). And yes, it would be more manageable for me to do it in another pull-request ;-) |
There's no need for them, and _[A-Z] is reserved for the C++ implementation, so the behavior is undefined per C++03.
|
I've cloned your branch into mblondel-svmlight. I see some opportunities for improvement, including optimization (you've got two levels of indirection due to |
|
@larsmans: Very happy that you're looking into this! I have made a commit since then so you'll have to merge. I will concentrate on the doc now, so you can safely work on it. |
|
Error propagation and narrative documentation done. Ok, that's all for today. |
…arn into mblondel-svmlight Conflicts: scikits/learn/datasets/_svmlight_format.cpp Added a comment.
|
@larsmans: I merged your contributions, thanks a lot. Please do add your name to the credits. |
|
There are a few more issues with this code which I feel obliged to fix, now that you've made me one of the authors :) But there's also a more fundamental problem: since we're reading from C++ IOstreams, there's no easy way to plug in support for compressed files. E.g., There is portable support for this in Boost, but in one of the few non-header only libraries. I'll see if I can fix something up with file descriptor I/O. |
|
How do you plan to do it? Gael won't be happy if we add boost as a dependency. |
|
I'd guessed as much. I think the GNU C++ library offers all the functionality I need. Let me hack for a while... |
|
Oh, by the way: if I understand the |
|
Yes you're completely right for |
|
No, I'll push a commit once I've implemented that; in the meanwhile, I noticed that a message printed to |
|
Speaking of support for compressed format, I think the whole data hacking community (hadoop, pig, avro, cassandra...) will soon be moving to snappy recently open sourced by Google: https://code.google.com/p/snappy/ . It does not compress as much as gzip or bz2 but it is meant to be much faster (much lower CPU overhead) while compressing the data enough to gain dramatic speedup on the IO side. The license is compatible with scikit-learn, is self contained and is small enough so that we can embed a copy of it within the source code of the datasets module: https://code.google.com/p/snappy/source/browse/trunk (it's written in C++ with a C API compatibility layer). That said, additional support for gzip and bzip2 would still be great if it does not add new dependency. |
Quick fix, maybe only partial: the deallocator seems not to be called in unit testing.
|
Last time I checked on my computer, |
|
Yes, the abstraction level went up, but consistency and safety increased accordingly and it's still only intermediate-level C++. I concur that the double vector owner type is a wart. As for
|
|
Thanks, that indeed looks much better. If you send me a pull request on mblondel/scikit-learn I will probably be able to merge it from the web interface and this will appear in this pull request too. We still need to figure why the dealloc function doesn't get called on your machine. |
Info from exceptions should be propagated up to the Python level, but currently isn't.
|
@mblondel: let's get back to the Pull request in a minute. |
SVMlight reader update
|
@larsmans: No problem. Let's see that next week then. I thought of a possible optim. Even though |
|
@mblondel: that may be worth a try! Maybe we can even do it statistically correct by estimating when we've seen sqrt(filesize)? |
Changes behavior: trying to open a non-existent file will result in an IOError rather than a ValueError. Slapped my name onto scikits/learn/datasets/svmlight_format.py out of sheer vanity. svmlight_format.py is now pep8 and pyflakes-clean.
…learn into svmlight_format
|
I usually work from my Linux workstation in my lab but tonight I used my macbook and I got two compile errors. The first one is that ./_svmlight_format.cpp: In function ‘void destroy_vector_owner(PyObject*)’: ./_svmlight_format.cpp:58: error: expected class-name before ‘(’ token ./_svmlight_format.cpp: In function ‘void destroy_vector_owner(PyObject*)’: ./_svmlight_format.cpp:58: error: expected class-name before ‘(’ token lipo: can't figure out the architecture type of: /var/folders/Lc/Lcm3JWXEErmlcY4v482kF++++TI/-Tmp-//ccoa14VW.out I've found fixes that I will push in a minute. I also found the reason for the deallocation problem. It was just a ref counting issue. Fix coming too. |
|
Lars, are there other things that you want to see done in this pull-request? Do we have an agreement on the function, parameter and module names? |
|
Maybe Also, there's still a few memory management/exception-safety issues, I believe. Let me have a good look at this code again tomorrow, maybe then we can pull it. |
Also some minor refactoring
Memory leak and exception safety in SVMlight reader
|
Are we good? |
|
AFAIK the function name comment by Lars is not yet addressed. Apart from this, this looks good (although I am not experienced in C++ enough to comment the implementation). Also I would appreciate additional comments from the usual reviewers: ping @fabianp @GaelVaroquaux and @agramfort ! |
|
I have a course to prepare, and will be travelling a lot in the next |
|
I think we're done, let the reviewers do their work. |
|
I have pushed the mldata pull-request to master. It is very likely that this branch needs a refresh to resolve conflicts on the Once this is done I am ok to merge if the others don't react. |
Conflicts: doc/modules/datasets.rst
|
The dataset generator module needs some love IMHO (inconsistent function names, missing doc, etc). I'll merge this pull request in a couple of hours if nobody reacts. |
Fast loader for the svmlight / libsvm format
|
I've been trying to rewrite the SVMlight reader in Cython to get rid of the boilerplate code. However, I can't implement the vector owner classes in Cython without an extra level of indirection because I might try again next week. |
Here's a pull-request that implements a fast and memory efficient loader for the svmlight / libsvm sparse dataset format. By memory efficient, I mean that it loads the dataset directly into a scipy sparse CSR without any memory copying.
It should make playing with scikit-learn much easier as a wealth of datasets are availabel in this format, see e.g. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
You can use it like this:
Timing on the MNIST dataset:
Timing on the news20 dataset:
(loading 2 datasets, learning and prediction)
The loader is coded in 250 lines of C++ and a few lines of Python (would be nice if experienced C++ programmers like @larsmans or @jakevdp could review).
Pure-python was out-of-the-question for me as it is really slow for parsing text (I think it would take 10 to 100 times slower and it would require memory copying). I didn't use Cython as I think the code would look just like a C++ dialect.
Also, I had to use the trick described by Travis Oliphant (http://blog.enthought.com/?p=62) to manage memory deallocation. As mentioned in the comments of this blog post, if we assume that the memory allocator is malloc, quite a few lines of code can be spared by using
array.flags |= NPY_OWNDATA. However, it seems to me that it is a dangerous assumption so I went for the solution suggested by Travis.What still needs to be done: