Skip to content

Incremental Isomap#3175

Closed
yanlend wants to merge 11 commits intoscikit-learn:masterfrom
yanlend:master
Closed

Incremental Isomap#3175
yanlend wants to merge 11 commits intoscikit-learn:masterfrom
yanlend:master

Conversation

@yanlend
Copy link
Copy Markdown
Contributor

@yanlend yanlend commented May 21, 2014

Hi everybody,

I implemented incremental isomap as proposed in:
Law, Martin HC, and Anil K. Jain. "Incremental nonlinear dimensionality reduction by manifold learning." Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.3 (2006): 377-391.

As I'm not aware of a Estimator model in sklearn that does incremental fitting like this, I added fit_incremental and fit_transform_incremental. This is debatable.
Besided incremental_isomap.py, the most imporant changes are in graph_shortest_path.pyx, where the shortest paths are returned in addition to the distances and partial shortest path computation is possible.

The speed of incremental learning depends heavily on the speed of nearest neigbhor search. I implemented only brute force search here. Especially adaptive incremental learning can get slow.

@GaelVaroquaux
Copy link
Copy Markdown
Member

As I'm not aware of a Estimator model in sklearn that does incremental fitting
like this, I added fit_incremental and fit_transform_incremental. This is
debatable.

Why wouldn't this fit into a 'partial_fit' API, as many other models (see
the SGDClassifier, for instance)?

The speed of incremental learning depends heavily on the speed of nearest
neigbhor search. I implemented only brute force search here.

Is there a reason you are not using the high-level API for nearest
neighbor search of scikit-learn? It would make automatically the right
choice of brute force vs a tree to have fast nearest neighbor search.

- Change incremental_isomap_utilities to use long instead of int32 (buffer mismatch before)
@yanlend
Copy link
Copy Markdown
Contributor Author

yanlend commented May 22, 2014

The 'partial_fit' API seems to fit quite well. I was not aware of it and changed the class accordingly.

Regarding the nearest neighbor API of scikit-learn:
for incremental learning, I need to change the training data in each step. This is straightforward for 'brute', but as I see it not implemented and not so easy for "KDTree" and "BallTree". Creating a new tree in each iteration is very costly.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Regarding the nearest neighbor API of scikit-learn: for incremental
learning, I need to change the training data in each step. This is
straightforward for 'brute', but as I see it not implemented and not so
easy for "KDTree" and "BallTree". Creating a new tree in each iteration
is very costly.

OK, it makes sens.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other online learner, I think that we would call this "MiniBatchIsomap". What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do that.
But be aware that this "MiniBatchIsomap" still needs to store the training data for all samples, so it is not an option if the training data does not fit into memory.

@jnothman
Copy link
Copy Markdown
Member

Regarding the nearest neighbor API of scikit-learn: for incremental
learning, I need to change the training data in each step.

Perhaps the GSoC work on Approximate NN by @maheshakya should take this application into account.

@maheshakya
Copy link
Copy Markdown
Contributor

I think that the major concern in this incremental learning is building index in the nearest neighbor data structure since training data changes frequently. In ANN, it concentrates more on query speed. I guess, with some extra work I could work on speeding indexing as well.
At the current stage, We have not decided which ANN method we should implement here. But I think with the LSH based ANN methods, we can expect speed improvements in indexing(hopefully better than forming new trees in each iteration)

@yanlend
Copy link
Copy Markdown
Contributor Author

yanlend commented Jul 1, 2014

The continuous integration build still has errors. The tests run fine on my local installation (Python 2.7 64bit, numpy 1.7.1, Windows 7). This setup is compulsory for me and it's hard to install all possible configurations.

The main issue seems to be datatype incompatibilities, e.g. like this:

test_isomap.test_isomap_simple_grid ... Exception ValueError: "Buffer dtype mismatch, expected 'ITYPE_t' but got 'int'" in 'graph_shortest_path.dijkstra' ignored
(I defined ITYPE_t as np.int_t.)

Does anybody have experience with these kinds of problems? What is the correct datatype to include in Cython? Should I convert datatypes manually each time before making a call to Cython?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have a change of the public API. We need to use a property for backward compatibility for a couple of releases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's bad. I'll revert the change.

yanlend added 2 commits July 2, 2014 10:19
- remove unnecessary returns
- used kng_symmetric_ where possible
Conflicts:
	sklearn/manifold/__init__.py
@GaelVaroquaux
Copy link
Copy Markdown
Member

The continuous integration problem seems to be fairly trivial: I believe that it is just a question of updating your local master branch of scikit-learn, and then merging to it.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Hi @yanlend

I am coming back to this after taking a step back to think a bit (and actually discussing it with Thomas Pohl).

To summarize what I understand the value and limitation of this incremental isomap is:

  • Pro: It enables to add a new datapoint to an isomap model much quicker than refitting on the whole dataset
  • Con: It requires storing the whole dataset. For this reason, if working from a stream of data, a common strategy is to "forget" old data, in order to avoid a memory leak.

In scikit-learn, we do not try to solve the problem of streaming data, as the assumption of scikit-learn is that samples can be shuffled (eg when doing cross-validation). I am not saying that there is no value in solving this problems, that usually relate to time-flowing problems. It's actually a problem set that steps up quite often, and I think that there would be value in having a package that implements machine-learning on time-series and as a derivative of scikit-learn. The people at Blue Yonder have a large internal codebase that does this.

We do have an interest for online learning, ie incremental fitting of models using the 'partial_fit' API. One reason is that on big data, it can often be faster to run partial_fit on mini-batches of the data than fitting the model on the whole data. The other reason is that if the data does not fit in memory, reading it by chunks and calling partial-fit can save memory.

I am not sure that, in this current state, this pull request brings these benefits. In particular, the fact that you need to store the train data is somewhat against the purpose of partial_fit in scikit-learn.
Thus, I am afraid that I am not sure that the pull request fits in the scope of scikit-learn. I am really sorry, as this is beautiful work. Maybe you can host it elsewhere?

@yanlend
Copy link
Copy Markdown
Contributor Author

yanlend commented Aug 31, 2014

Hi @GaelVaroquaux,
thanks a lot for your comment.
I agree with your view of the value and limitation of the method. It requires to store the whole dataset and is mainly useful for time-series data.

If this is currently not in the scope of scikit learn, the pull-request can be closed. I will figure out another way to host it.

@yanlend yanlend closed this Sep 2, 2014
@GaelVaroquaux
Copy link
Copy Markdown
Member

@yanlend: truely thank you for submitting this work to scikit-learn. I am
sorry that it took so long to identify that it was out of scope for
scikit-learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants