Conversation
Conflicts: sklearn/utils/setup.py
wrong build, incremental_isomap_utilities.cpp should be used
-remove double upload incremental_isomap_utilities
Why wouldn't this fit into a 'partial_fit' API, as many other models (see
Is there a reason you are not using the high-level API for nearest |
- Change incremental_isomap_utilities to use long instead of int32 (buffer mismatch before)
|
The 'partial_fit' API seems to fit quite well. I was not aware of it and changed the class accordingly. Regarding the nearest neighbor API of scikit-learn: |
OK, it makes sens. |
There was a problem hiding this comment.
For consistency with other online learner, I think that we would call this "MiniBatchIsomap". What do you think?
There was a problem hiding this comment.
Yes, we could do that.
But be aware that this "MiniBatchIsomap" still needs to store the training data for all samples, so it is not an option if the training data does not fit into memory.
Perhaps the GSoC work on Approximate NN by @maheshakya should take this application into account. |
|
I think that the major concern in this incremental learning is building index in the nearest neighbor data structure since training data changes frequently. In ANN, it concentrates more on query speed. I guess, with some extra work I could work on speeding indexing as well. |
|
The continuous integration build still has errors. The tests run fine on my local installation (Python 2.7 64bit, numpy 1.7.1, Windows 7). This setup is compulsory for me and it's hard to install all possible configurations. The main issue seems to be datatype incompatibilities, e.g. like this: test_isomap.test_isomap_simple_grid ... Exception ValueError: "Buffer dtype mismatch, expected 'ITYPE_t' but got 'int'" in 'graph_shortest_path.dijkstra' ignored Does anybody have experience with these kinds of problems? What is the correct datatype to include in Cython? Should I convert datatypes manually each time before making a call to Cython? |
sklearn/manifold/isomap.py
Outdated
There was a problem hiding this comment.
Here we have a change of the public API. We need to use a property for backward compatibility for a couple of releases.
There was a problem hiding this comment.
Ok that's bad. I'll revert the change.
- remove unnecessary returns - used kng_symmetric_ where possible
Conflicts: sklearn/manifold/__init__.py
|
The continuous integration problem seems to be fairly trivial: I believe that it is just a question of updating your local master branch of scikit-learn, and then merging to it. |
|
Hi @yanlend I am coming back to this after taking a step back to think a bit (and actually discussing it with Thomas Pohl). To summarize what I understand the value and limitation of this incremental isomap is:
In scikit-learn, we do not try to solve the problem of streaming data, as the assumption of scikit-learn is that samples can be shuffled (eg when doing cross-validation). I am not saying that there is no value in solving this problems, that usually relate to time-flowing problems. It's actually a problem set that steps up quite often, and I think that there would be value in having a package that implements machine-learning on time-series and as a derivative of scikit-learn. The people at Blue Yonder have a large internal codebase that does this. We do have an interest for online learning, ie incremental fitting of models using the 'partial_fit' API. One reason is that on big data, it can often be faster to run partial_fit on mini-batches of the data than fitting the model on the whole data. The other reason is that if the data does not fit in memory, reading it by chunks and calling partial-fit can save memory. I am not sure that, in this current state, this pull request brings these benefits. In particular, the fact that you need to store the train data is somewhat against the purpose of partial_fit in scikit-learn. |
|
Hi @GaelVaroquaux, If this is currently not in the scope of scikit learn, the pull-request can be closed. I will figure out another way to host it. |
|
@yanlend: truely thank you for submitting this work to scikit-learn. I am |
Hi everybody,
I implemented incremental isomap as proposed in:
Law, Martin HC, and Anil K. Jain. "Incremental nonlinear dimensionality reduction by manifold learning." Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.3 (2006): 377-391.
As I'm not aware of a Estimator model in sklearn that does incremental fitting like this, I added fit_incremental and fit_transform_incremental. This is debatable.
Besided incremental_isomap.py, the most imporant changes are in graph_shortest_path.pyx, where the shortest paths are returned in addition to the distances and partial shortest path computation is possible.
The speed of incremental learning depends heavily on the speed of nearest neigbhor search. I implemented only brute force search here. Especially adaptive incremental learning can get slow.