Currently the dtype of the distance matrix returned by pairwise_distances is not very consistent, depending on the metric and on the value of n_jobs.
For float64 input, everything is consistent: the returned matrix is always in float64.
For mixed float64 X and float32 Y, the return matrix is also always in float64 and this is what should be expected imo.
The troubles come when both X and Y are float32.
- for sklearn metrics:
euclidean and cosine: result is always float32
manhattan: result is float64 if n_jobs=1 and float32 otherwise
- for scipy metrics: result is float64 if n_jobs=1 and float32 otherwise
Note that scipy cdist/pdist always returns float64.
Hence the question: should pairwise_distances preserve float32 ?
My opinion is that it should since pairwise_distances can be used as an intermediate step during fit and since there's ongoing work towards preserving float32 in estimators (see #11000 for transfromers for instance).
An argument against that could be reducing the numerical instabilities. A potential solution could be to use float64 accumulators for the intermediate computations only, still returning a float32 dist matrix. Note that with #23958 we might not need to use the scipy metrics anymore, in favor of the ones defined in dist_metrics, and using float64 accumulators would be easier to implement generally.
Answering this question will help to not go in the wrong direction in #23958
Currently the dtype of the distance matrix returned by
pairwise_distancesis not very consistent, depending on the metric and on the value of n_jobs.For float64 input, everything is consistent: the returned matrix is always in float64.
For mixed float64 X and float32 Y, the return matrix is also always in float64 and this is what should be expected imo.
The troubles come when both X and Y are float32.
euclideanandcosine: result is always float32manhattan: result is float64 if n_jobs=1 and float32 otherwiseNote that scipy cdist/pdist always returns float64.
Hence the question: should
pairwise_distancespreserve float32 ?My opinion is that it should since
pairwise_distancescan be used as an intermediate step during fit and since there's ongoing work towards preserving float32 in estimators (see #11000 for transfromers for instance).An argument against that could be reducing the numerical instabilities. A potential solution could be to use float64 accumulators for the intermediate computations only, still returning a float32 dist matrix. Note that with #23958 we might not need to use the scipy metrics anymore, in favor of the ones defined in
dist_metrics, and using float64 accumulators would be easier to implement generally.Answering this question will help to not go in the wrong direction in #23958