Skip to content

Rationalize metric argument in T-SNE #9695

@tomMoral

Description

@tomMoral

The API of T-SNE have a metric argument to allow changing the metric in the input space. The original paper was developed using only the euclidean distance and changing the metric might change a bit the math.

The main issue is to understand what is going on with the computation of the conditional probabilities. The original paper states that if d_ij is the euclidean distance between x_i and x_j, then the conditional probability is given by

p(x_j | x_i) = exp{ d_ij^2 / sig_i} / Z

where Z is a normalization constant and sig_i is a parameter, computed to fix the perplexity of the model. If d_ij is not the euclidean distance, does it make sense to use this formula?
The current implementation for other metric than the euclidean uses the formula without the square, so

p(x_j | x_i) = exp{ d_ij / sig_i} / Z

If it is fine, this should be documented. Else, shall we change it to use the square or another distribution? Another solution is to deprecate the metric argument.

A second point is the use of init='pca'. It is a very natural initialization for the euclidean metric but should be benchmarked for other to see if it gives an interesting starting point or not.

This issue is a follow-up of #9623 .

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions