Incremental Isomap by yanlend · Pull Request #3175 · scikit-learn/scikit-learn

yanlend · 2014-05-21T15:26:15Z

Hi everybody,

I implemented incremental isomap as proposed in:
Law, Martin HC, and Anil K. Jain. "Incremental nonlinear dimensionality reduction by manifold learning." Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.3 (2006): 377-391.

As I'm not aware of a Estimator model in sklearn that does incremental fitting like this, I added fit_incremental and fit_transform_incremental. This is debatable.
Besided incremental_isomap.py, the most imporant changes are in graph_shortest_path.pyx, where the shortest paths are returned in addition to the distances and partial shortest path computation is possible.

The speed of incremental learning depends heavily on the speed of nearest neigbhor search. I implemented only brute force search here. Especially adaptive incremental learning can get slow.

Conflicts: sklearn/utils/setup.py

wrong build, incremental_isomap_utilities.cpp should be used

-remove double upload incremental_isomap_utilities

GaelVaroquaux · 2014-05-21T15:30:54Z

As I'm not aware of a Estimator model in sklearn that does incremental fitting
like this, I added fit_incremental and fit_transform_incremental. This is
debatable.

Why wouldn't this fit into a 'partial_fit' API, as many other models (see
the SGDClassifier, for instance)?

The speed of incremental learning depends heavily on the speed of nearest
neigbhor search. I implemented only brute force search here.

Is there a reason you are not using the high-level API for nearest
neighbor search of scikit-learn? It would make automatically the right
choice of brute force vs a tree to have fast nearest neighbor search.

- Change incremental_isomap_utilities to use long instead of int32 (buffer mismatch before)

yanlend · 2014-05-22T07:25:48Z

The 'partial_fit' API seems to fit quite well. I was not aware of it and changed the class accordingly.

Regarding the nearest neighbor API of scikit-learn:
for incremental learning, I need to change the training data in each step. This is straightforward for 'brute', but as I see it not implemented and not so easy for "KDTree" and "BallTree". Creating a new tree in each iteration is very costly.

GaelVaroquaux · 2014-05-22T08:37:41Z

Regarding the nearest neighbor API of scikit-learn: for incremental
learning, I need to change the training data in each step. This is
straightforward for 'brute', but as I see it not implemented and not so
easy for "KDTree" and "BallTree". Creating a new tree in each iteration
is very costly.

OK, it makes sens.

GaelVaroquaux · 2014-05-22T08:44:51Z

sklearn/manifold/incremental_isomap.py

For consistency with other online learner, I think that we would call this "MiniBatchIsomap". What do you think?

Yes, we could do that.
But be aware that this "MiniBatchIsomap" still needs to store the training data for all samples, so it is not an option if the training data does not fit into memory.

jnothman · 2014-05-22T08:59:21Z

Regarding the nearest neighbor API of scikit-learn: for incremental
learning, I need to change the training data in each step.

Perhaps the GSoC work on Approximate NN by @maheshakya should take this application into account.

maheshakya · 2014-05-22T09:36:36Z

I think that the major concern in this incremental learning is building index in the nearest neighbor data structure since training data changes frequently. In ANN, it concentrates more on query speed. I guess, with some extra work I could work on speeding indexing as well.
At the current stage, We have not decided which ANN method we should implement here. But I think with the LSH based ANN methods, we can expect speed improvements in indexing(hopefully better than forming new trees in each iteration)

…ant (hopefully) using http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

yanlend · 2014-07-01T08:49:42Z

The continuous integration build still has errors. The tests run fine on my local installation (Python 2.7 64bit, numpy 1.7.1, Windows 7). This setup is compulsory for me and it's hard to install all possible configurations.

The main issue seems to be datatype incompatibilities, e.g. like this:

test_isomap.test_isomap_simple_grid ... Exception ValueError: "Buffer dtype mismatch, expected 'ITYPE_t' but got 'int'" in 'graph_shortest_path.dijkstra' ignored
(I defined ITYPE_t as np.int_t.)

Does anybody have experience with these kinds of problems? What is the correct datatype to include in Cython? Should I convert datatypes manually each time before making a call to Cython?

GaelVaroquaux · 2014-07-01T13:19:49Z

sklearn/manifold/isomap.py

Here we have a change of the public API. We need to use a property for backward compatibility for a couple of releases.

Ok that's bad. I'll revert the change.

- remove unnecessary returns - used kng_symmetric_ where possible

Conflicts: sklearn/manifold/__init__.py

GaelVaroquaux · 2014-08-29T14:25:02Z

The continuous integration problem seems to be fairly trivial: I believe that it is just a question of updating your local master branch of scikit-learn, and then merging to it.

GaelVaroquaux · 2014-08-30T09:28:30Z

Hi @yanlend

I am coming back to this after taking a step back to think a bit (and actually discussing it with Thomas Pohl).

To summarize what I understand the value and limitation of this incremental isomap is:

Pro: It enables to add a new datapoint to an isomap model much quicker than refitting on the whole dataset
Con: It requires storing the whole dataset. For this reason, if working from a stream of data, a common strategy is to "forget" old data, in order to avoid a memory leak.

In scikit-learn, we do not try to solve the problem of streaming data, as the assumption of scikit-learn is that samples can be shuffled (eg when doing cross-validation). I am not saying that there is no value in solving this problems, that usually relate to time-flowing problems. It's actually a problem set that steps up quite often, and I think that there would be value in having a package that implements machine-learning on time-series and as a derivative of scikit-learn. The people at Blue Yonder have a large internal codebase that does this.

We do have an interest for online learning, ie incremental fitting of models using the 'partial_fit' API. One reason is that on big data, it can often be faster to run partial_fit on mini-batches of the data than fitting the model on the whole data. The other reason is that if the data does not fit in memory, reading it by chunks and calling partial-fit can save memory.

I am not sure that, in this current state, this pull request brings these benefits. In particular, the fact that you need to store the train data is somewhat against the purpose of partial_fit in scikit-learn.
Thus, I am afraid that I am not sure that the pull request fits in the scope of scikit-learn. I am really sorry, as this is beautiful work. Maybe you can host it elsewhere?

yanlend · 2014-08-31T18:28:18Z

Hi @GaelVaroquaux,
thanks a lot for your comment.
I agree with your view of the value and limitation of the method. It requires to store the whole dataset and is mainly useful for time-series data.

If this is currently not in the scope of scikit learn, the pull-request can be closed. I will figure out another way to host it.

GaelVaroquaux · 2014-09-02T11:58:29Z

@yanlend: truely thank you for submitting this work to scikit-learn. I am
sorry that it took so long to identify that it was out of scope for
scikit-learn.

yanlend and others added 5 commits May 21, 2014 15:52

incremental isomap initial commit

cfab451

Merge remote-tracking branch 'upstream/master'

f94d2c9

Conflicts: sklearn/utils/setup.py

Delete incremental_isomap_utilities.c

63de794

wrong build, incremental_isomap_utilities.cpp should be used

-tests for graph_shortest_path and modified_graph_shortest_path

be4a3f3

-remove double upload incremental_isomap_utilities

Merge branch 'master' of github.com:yanlend/scikit-learn

e7a8f85

- Change IncrementalIsomap to partial fit API

45b10e7

- Change incremental_isomap_utilities to use long instead of int32 (buffer mismatch before)

Removed xrange from test_incremental_isomap for python 3 compatibility

fed1f2c

GaelVaroquaux reviewed May 22, 2014
View reviewed changes

yanlend added 2 commits May 26, 2014 10:19

incremental_isomap: Adapt datatypes in python and cython to be compli…

dd8e42f

…ant (hopefully) using http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

Rename from IncrementalIsomap to MiniBatchIsomap

1333edf

maheshakya mentioned this pull request Jun 23, 2014

[MRG] Locality Sensitive Hashing for approximate nearest neighbor search(GSoC) #3304

Closed

GaelVaroquaux reviewed Jul 1, 2014
View reviewed changes

yanlend added 2 commits July 2, 2014 10:19

- revert undesirable change of public API

b0150c1

- remove unnecessary returns - used kng_symmetric_ where possible

Merge branch 'master' of github.com:scikit-learn/scikit-learn

747fabf

Conflicts: sklearn/manifold/__init__.py

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

yanlend closed this Sep 2, 2014

Uh oh!

Conversation

yanlend commented May 21, 2014

Uh oh!

GaelVaroquaux commented May 21, 2014

Uh oh!

yanlend commented May 22, 2014

Uh oh!

GaelVaroquaux commented May 22, 2014

Uh oh!

GaelVaroquaux May 22, 2014

Choose a reason for hiding this comment

Uh oh!

yanlend May 22, 2014

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 22, 2014

Uh oh!

maheshakya commented May 22, 2014

Uh oh!

yanlend commented Jul 1, 2014

Uh oh!

GaelVaroquaux Jul 1, 2014

Choose a reason for hiding this comment

Uh oh!

yanlend Jul 1, 2014

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Aug 29, 2014

Uh oh!

GaelVaroquaux commented Aug 30, 2014

Uh oh!

yanlend commented Aug 31, 2014

Uh oh!

GaelVaroquaux commented Sep 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants