FEA Introduce PairwiseDistances#23958
Conversation
ಠ_ರೃ
|
Failures on this PR might are fixed by #23990. |
Switch condition for safety
Comes with minor adaptations
This: - decreases the number of features by an order to magnetude because in the case of float32, the vectors gets entirely copied for the upcast to float64. This might use too much memory and crash the program - this now accepts the previously xfail parametrisation case by setting on absolute error (seen we are comparing small values)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
| f"usable for this case (EuclideanPairwiseDistances{{name_suffix}}) and will be ignored.", | ||
| UserWarning, | ||
| stacklevel=3, | ||
| ) |
There was a problem hiding this comment.
I would rather avoid introducing a new warning that we want to actually get rid off, so better do this as part of the current PR.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
Using scikit-learn/sklearn/metrics/_dist_metrics.pyx.tp Lines 1110 to 1120 in e41753e I am not entirely but looking at cycles spent on the assembly code line, we can see that loop unrolling + SIMD Percent│
│ Disassembly of section .text:
│
│ 00000000000104b0 <__pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist>:
│ __pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist():
0.30 │ test %rcx,%rcx
0.04 │ ↓ jle 40
0.00 │ movq __pyx_k_C+0x6b4b,%xmm2
0.01 │ xor %eax,%eax
│ pxor %xmm1,%xmm1
0.00 │ nop
0.00 │18: movsd (%rsi,%rax,8),%xmm0
0.28 │ subsd (%rdx,%rax,8),%xmm0
0.25 │ add $0x1,%rax
0.04 │ andpd %xmm2,%xmm0
0.18 │ addsd %xmm0,%xmm1
+98.76 │ cmp %rax,%rcx
0.10 │ ↑ jne 18
0.00 │ movapd %xmm1,%xmm0
0.04 │ ← retq
│ nop
│40: pxor %xmm1,%xmm1
│ movapd %xmm1,%xmm0
│ ← retq It's insightful to see how it's done, and to assess that 98.76% the time here is spent checking if |
|
As discussed in real life, it might be interesting to see of the chunking is not detrimental for this operation: indeed chunking will cause non-contiguous writing of the distance values in the distance matrix output array. It might be worthwhile to conduct dedicated benchmarks to use a chunk size of |
My first intuition is that simply setting I think What do you think? |
|
I am removing it from the 1.2 milestone as it needs more thought and thread-offs' assessments. |
|
Turning this into a draft for two reasons; to me:
|
|
Closing since this has been superseded by #25561. |
Reference Issues/PRs
Relates to #22587.
What does this implement/fix? Explain your changes.
This adds a new back-end to
pairwise_distancescomputations usingPairwiseDistanceswithout any reduction.Any other comments?
TODO: