Skip to content

Fast loader for the svmlight / libsvm format#209

Merged
mblondel merged 30 commits intoscikit-learn:masterfrom
mblondel:svmlight_format
Jun 14, 2011
Merged

Fast loader for the svmlight / libsvm format#209
mblondel merged 30 commits intoscikit-learn:masterfrom
mblondel:svmlight_format

Conversation

@mblondel
Copy link
Copy Markdown
Member

Here's a pull-request that implements a fast and memory efficient loader for the svmlight / libsvm sparse dataset format. By memory efficient, I mean that it loads the dataset directly into a scipy sparse CSR without any memory copying.

It should make playing with scikit-learn much easier as a wealth of datasets are availabel in this format, see e.g. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

You can use it like this:

import sys
import numpy as np
from scikits.learn.datasets import load_svmlight_format
from scikits.learn.linear_model.sparse import SGDClassifier

X_train, y_train, X_test, y_test = load_svmlight_format(sys.argv[1], sys.argv[2])

y_pred = SGDClassifier().fit(X_train, y_train).predict(X_test)
print "Accuracy", np.mean(y_pred == y_test)

Timing on the MNIST dataset:

Accuracy 0.8731

real    0m3.104s
user    0m2.920s
sys     0m0.180s

Timing on the news20 dataset:

Accuracy 0.845730027548

real    0m1.134s
user    0m1.070s
sys     0m0.060s

(loading 2 datasets, learning and prediction)

The loader is coded in 250 lines of C++ and a few lines of Python (would be nice if experienced C++ programmers like @larsmans or @jakevdp could review).

Pure-python was out-of-the-question for me as it is really slow for parsing text (I think it would take 10 to 100 times slower and it would require memory copying). I didn't use Cython as I think the code would look just like a C++ dialect.

Also, I had to use the trick described by Travis Oliphant (http://blog.enthought.com/?p=62) to manage memory deallocation. As mentioned in the comments of this blog post, if we assume that the memory allocator is malloc, quite a few lines of code can be spared by using array.flags |= NPY_OWNDATA. However, it seems to me that it is a dangerous assumption so I went for the solution suggested by Travis.

What still needs to be done:

  • narrative doc and doctests [done]
  • check error handling [done]
  • return a scipy sparse matrix for Y if there's more than 1 label per row (multilabel) [for another pull request]

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 10, 2011

Very interesting. Do you think it would be possible to extend it (possibly in another function / pull-request) to also support the vowpal wabbit extensions? See for instance http://hunch.net/~vw/examples.html . Also that would be nice if we could include a low overhead C level callback mechanism (based on function pointers for instance) to be able to plug custom feature hashing logic and / or cross products of features as done by VW.

@mblondel
Copy link
Copy Markdown
Member Author

Yes I'm very open to that (as long as it is backward-compatible with the more simple format). And yes, it would be more manageable for me to do it in another pull-request ;-)

mblondel and others added 3 commits June 11, 2011 02:04
There's no need for them, and _[A-Z] is reserved for the C++ implementation,
so the behavior is undefined per C++03.
@larsmans
Copy link
Copy Markdown
Member

I've cloned your branch into mblondel-svmlight. I see some opportunities for improvement, including optimization (you've got two levels of indirection due to vector*).

@mblondel
Copy link
Copy Markdown
Member Author

@larsmans: Very happy that you're looking into this! I have made a commit since then so you'll have to merge. I will concentrate on the doc now, so you can safely work on it.

@mblondel
Copy link
Copy Markdown
Member Author

Error propagation and narrative documentation done. Ok, that's all for today.

@mblondel
Copy link
Copy Markdown
Member Author

@larsmans: I merged your contributions, thanks a lot. Please do add your name to the credits.

@larsmans
Copy link
Copy Markdown
Member

There are a few more issues with this code which I feel obliged to fix, now that you've made me one of the authors :)

But there's also a more fundamental problem: since we're reading from C++ IOstreams, there's no easy way to plug in support for compressed files. E.g., 20news.binary is 140MB of data compressed to 26MB; you'd want to bunzip2 it on the fly to save 80% of the disk accesses.

There is portable support for this in Boost, but in one of the few non-header only libraries. I'll see if I can fix something up with file descriptor I/O.

@mblondel
Copy link
Copy Markdown
Member Author

How do you plan to do it? Gael won't be happy if we add boost as a dependency.

@larsmans
Copy link
Copy Markdown
Member

I'd guessed as much. I think the GNU C++ library offers all the functionality I need. Let me hack for a while...

@larsmans
Copy link
Copy Markdown
Member

Oh, by the way: if I understand the DeallocObject correctly, then you've got a huge memory leak by relying on vector::clear to do deallocation (which it doesn't).

@mblondel
Copy link
Copy Markdown
Member Author

Yes you're completely right for vector::clear. I was aware of that issue but I forgot to come back to it. I was planning to use swap(emptyvector). Is it the correct way to release the memory?

@larsmans
Copy link
Copy Markdown
Member

No, swap won't do the trick either: even if it were to free the vector's contents, it would still leak the header (12 bytes + allocator overhead). I think the right way to handle this is to use placement new, which also gets rid of a level of indirection.

I'll push a commit once I've implemented that; in the meanwhile, I noticed that a message printed to std::cerr from dealloc doesn't show up in unit testing. Can you verify this and diagnose what's going on?

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 11, 2011

Speaking of support for compressed format, I think the whole data hacking community (hadoop, pig, avro, cassandra...) will soon be moving to snappy recently open sourced by Google: https://code.google.com/p/snappy/ . It does not compress as much as gzip or bz2 but it is meant to be much faster (much lower CPU overhead) while compressing the data enough to gain dramatic speedup on the IO side.

The license is compatible with scikit-learn, is self contained and is small enough so that we can embed a copy of it within the source code of the datasets module:

https://code.google.com/p/snappy/source/browse/trunk (it's written in C++ with a C API compatibility layer).

That said, additional support for gzip and bzip2 would still be great if it does not add new dependency.

larsmans added 2 commits June 11, 2011 12:35
Quick fix, maybe only partial: the deallocator seems not to be called
in unit testing.
@mblondel
Copy link
Copy Markdown
Member Author

Last time I checked on my computer, dealloc did get called. I will check again on Monday as I'm not on my usual computer right now.

@larsmans
Copy link
Copy Markdown
Member

Yes, the abstraction level went up, but consistency and safety increased accordingly and it's still only intermediate-level C++. I concur that the double vector owner type is a wart.

As for malloc and realloc: we could do that, but then

  • we'd have to do more pointer fiddling, resulting in lower-level, harder to read code, and
  • the parse_line function would become O(n²) unless we do our own dynamic array management, thus reinventing the wheel.

@mblondel
Copy link
Copy Markdown
Member Author

Thanks, that indeed looks much better. If you send me a pull request on mblondel/scikit-learn I will probably be able to merge it from the web interface and this will appear in this pull request too.

We still need to figure why the dealloc function doesn't get called on your machine.

Info from exceptions should be propagated up to the Python level,
but currently isn't.
@larsmans
Copy link
Copy Markdown
Member

@mblondel: let's get back to the dealloc (renamed destroy_vector_owner) issue next week; I'm not behind my own workstation either, but working on an HPC cluster over SSH from a shabby old laptop. Not ideal for debugging :(

Pull request in a minute.

@mblondel
Copy link
Copy Markdown
Member Author

@larsmans: No problem. Let's see that next week then.

I thought of a possible optim. Even though vector allocates more memory than it needs so that it doesn't need to resize the internal array on each push_back, it would be nice if we could give vector some hints on its future size. vector has a method reserve to do that. So the idea would be to run the parser for say 1000 lines (1000 samples) then given the file size, the current position in the file and the current vector size, it should be possible to make a rough estimate of how many elements will be necessary for the remaining lines. Do you think we could expect improved performance from such an optim? Together with the compressed file support, I think this should be addressed in a separate pull-request though.

@larsmans
Copy link
Copy Markdown
Member

@mblondel: that may be worth a try! Maybe we can even do it statistically correct by estimating when we've seen sqrt(filesize)?

larsmans and others added 2 commits June 11, 2011 19:15
Changes behavior: trying to open a non-existent file will result in
an IOError rather than a ValueError.

Slapped my name onto scikits/learn/datasets/svmlight_format.py out
of sheer vanity.

svmlight_format.py is now pep8 and pyflakes-clean.
@mblondel
Copy link
Copy Markdown
Member Author

I usually work from my Linux workstation in my lab but tonight I used my macbook and I got two compile errors. The first one is that vector.data() doesn't exist. The second comes from obj.~VectorOwner();. I get:

./_svmlight_format.cpp: In function ‘void destroy_vector_owner(PyObject*)’:
./_svmlight_format.cpp:58: error: expected class-name before ‘(’ token
./_svmlight_format.cpp: In function ‘void destroy_vector_owner(PyObject*)’:
./_svmlight_format.cpp:58: error: expected class-name before ‘(’ token
lipo: can't figure out the architecture type of: /var/folders/Lc/Lcm3JWXEErmlcY4v482kF++++TI/-Tmp-//ccoa14VW.out

I've found fixes that I will push in a minute. I also found the reason for the deallocation problem. It was just a ref counting issue. Fix coming too.

@mblondel
Copy link
Copy Markdown
Member Author

Lars, are there other things that you want to see done in this pull-request?

Do we have an agreement on the function, parameter and module names?

@larsmans
Copy link
Copy Markdown
Member

Maybe load_svmlight_file would be a better name; now we're suggesting the user can load a format, which doesn't make much sense.

Also, there's still a few memory management/exception-safety issues, I believe. Let me have a good look at this code again tomorrow, maybe then we can pull it.

@mblondel
Copy link
Copy Markdown
Member Author

Are we good?

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 13, 2011

AFAIK the function name comment by Lars is not yet addressed. Apart from this, this looks good (although I am not experienced in C++ enough to comment the implementation).

Also I would appreciate additional comments from the usual reviewers: ping @fabianp @GaelVaroquaux and @agramfort !

@GaelVaroquaux
Copy link
Copy Markdown
Member

I have a course to prepare, and will be travelling a lot in the next
month. Simply move on without me. I am flat out. I'll try and keep an eye
on what @vene does, but for something like this pull request, for which I
have no expertise, I'll miss out.

@larsmans
Copy link
Copy Markdown
Member

I think we're done, let the reviewers do their work.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 14, 2011

I have pushed the mldata pull-request to master. It is very likely that this branch needs a refresh to resolve conflicts on the doc/modules/datasets.rst file.

Once this is done I am ok to merge if the others don't react.

Conflicts:
	doc/modules/datasets.rst
@mblondel
Copy link
Copy Markdown
Member Author

The dataset generator module needs some love IMHO (inconsistent function names, missing doc, etc).

I'll merge this pull request in a couple of hours if nobody reacts.

mblondel added a commit that referenced this pull request Jun 14, 2011
Fast loader for the svmlight / libsvm format
@mblondel mblondel merged commit 20d7191 into scikit-learn:master Jun 14, 2011
@larsmans
Copy link
Copy Markdown
Member

I've been trying to rewrite the SVMlight reader in Cython to get rid of the boilerplate code. However, I can't implement the vector owner classes in Cython without an extra level of indirection because

C++ classes not allowed as members of an extension type, use a pointer or reference instead

I might try again next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants