-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Rationalize metric argument in T-SNE #9695
Description
The API of T-SNE have a metric argument to allow changing the metric in the input space. The original paper was developed using only the euclidean distance and changing the metric might change a bit the math.
The main issue is to understand what is going on with the computation of the conditional probabilities. The original paper states that if d_ij is the euclidean distance between x_i and x_j, then the conditional probability is given by
p(x_j | x_i) = exp{ d_ij^2 / sig_i} / Z
where Z is a normalization constant and sig_i is a parameter, computed to fix the perplexity of the model. If d_ij is not the euclidean distance, does it make sense to use this formula?
The current implementation for other metric than the euclidean uses the formula without the square, so
p(x_j | x_i) = exp{ d_ij / sig_i} / Z
If it is fine, this should be documented. Else, shall we change it to use the square or another distribution? Another solution is to deprecate the metric argument.
A second point is the use of init='pca'. It is a very natural initialization for the euclidean metric but should be benchmarked for other to see if it gives an interesting starting point or not.
This issue is a follow-up of #9623 .