Describe the issue:
scikit-learn nightly wheels recently started to randomly fail (maybe since we started to build against numpy 2(.0.1) but I am not 100% sure).
Here is an example failure reproduced in an instrumented CI PR:
https://github.com/scikit-learn/scikit-learn/actions/runs/10267815411/job/28409279851?pr=29628#step:6:2328
Here are the relevant snippets from the Python-level faulthandler backtrace and an lldb native backtrace collected from a core dump:
Current thread 0x00000001ec854c00 (most recent call first):
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/numpy/lib/_arraysetops_impl.py", line 356 in _unique1d
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/numpy/lib/_arraysetops_impl.py", line 289 in unique
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/numpy/lib/_arraysetops_impl.py", line 1142 in union1d
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/utils/_array_api.py", line 212 in _union1d
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/metrics/_classification.py", line 119 in _check_targets
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/metrics/_classification.py", line 219 in accuracy_score
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 216 in wrapper
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/base.py", line 764 in score
File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/utils/estimator_checks.py", line 1875 in check_pipeline_consistency
* thread #1
* frame #0: 0x00000001849e2a60 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x0000000184a1ac20 libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x00000001848f11e0 libsystem_c.dylib`raise + 32
frame #3: 0x0000000105e0d83c Python`faulthandler_fatal_error + 392
frame #4: 0x0000000184a4b584 libsystem_platform.dylib`_sigtramp + 56
frame #5: 0x000000010681654c _multiarray_umath.cpython-311-darwin.so`void hwy::N_NEON::detail::Recurse<(hwy::N_NEON::detail::RecurseMode)0, hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long>(hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long*, unsigned long, long long*, unsigned long long*, unsigned long, unsigned long) + 800
frame #6: 0x000000010681654c _multiarray_umath.cpython-311-darwin.so`void hwy::N_NEON::detail::Recurse<(hwy::N_NEON::detail::RecurseMode)0, hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long>(hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long*, unsigned long, long long*, unsigned long long*, unsigned long, unsigned long) + 800
frame #7: 0x0000000106810644 _multiarray_umath.cpython-311-darwin.so`void np::highway::qsort_simd::QSort_ASIMD<long long>(long long*, long) + 108
frame #8: 0x0000000106715d24 _multiarray_umath.cpython-311-darwin.so`quicksort_long + 68
frame #9: 0x00000001066d7184 _multiarray_umath.cpython-311-darwin.so`_new_sortlike + 688
frame #10: 0x00000001066eab20 _multiarray_umath.cpython-311-darwin.so`array_sort + 572
There are other threads in that process (see the full log above) but at the time of the failure they are all waiting and don't seem involve in the failure.
Here is a link to the dumped core file:
https://github.com/scikit-learn/scikit-learn/actions/runs/10267815411/artifacts/1781211741
Note: the zip file is 1.7 GB and contains a single 5.7G core file once decompressed.
This problem can happen in random tests of scikit-learn that call into _unique1d on way or another. 8 times out 10 we can run the full test suite without any failure but other times we get a crash, always when calling numpy _unique1d but in various caller contexts.
We also have feedback from a scikit-learn user who experienced a similar random bus error (or sometimes sigsev) when calling numpy's sort according to faulthandler backtrace. That crash report is also happening on Apple arm64 macOS machines.
Maybe it's a SIMD related problem in np::highway::qsort_simd. Maybe memory alignment might comes into play?
Reproduce the code example:
I failed writing a reproducer. I even failed reproducing locally despite running the same tests in a local env with the same versions of dependencies on an Apple M1 machine many times.
Python and NumPy Versions:
2.0.1
Runtime Environment:
# output of sklearn.show_versions()
System:
python: 3.11.9 (v3.11.9:de54cf5be3, Apr 2 2024, 07:12:50) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/bin/python
machine: macOS-14.5-arm64-arm-64bit
Python dependencies:
sklearn: 1.6.dev0
pip: 24.2
setuptools: 70.1.0
numpy: 2.0.1
scipy: 1.14.0
Cython: None
pandas: 2.2.2
matplotlib: None
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 3
prefix: libomp
filepath: /private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
Describe the issue:
scikit-learn nightly wheels recently started to randomly fail (maybe since we started to build against numpy 2(.0.1) but I am not 100% sure).
Here is an example failure reproduced in an instrumented CI PR:
https://github.com/scikit-learn/scikit-learn/actions/runs/10267815411/job/28409279851?pr=29628#step:6:2328
Here are the relevant snippets from the Python-level faulthandler backtrace and an lldb native backtrace collected from a core dump:
There are other threads in that process (see the full log above) but at the time of the failure they are all waiting and don't seem involve in the failure.
Here is a link to the dumped core file:
https://github.com/scikit-learn/scikit-learn/actions/runs/10267815411/artifacts/1781211741
Note: the zip file is 1.7 GB and contains a single 5.7G core file once decompressed.
This problem can happen in random tests of scikit-learn that call into
_unique1don way or another. 8 times out 10 we can run the full test suite without any failure but other times we get a crash, always when calling numpy_unique1dbut in various caller contexts.We also have feedback from a scikit-learn user who experienced a similar random bus error (or sometimes sigsev) when calling numpy's
sortaccording tofaulthandlerbacktrace. That crash report is also happening on Apple arm64 macOS machines.Maybe it's a SIMD related problem in
np::highway::qsort_simd. Maybe memory alignment might comes into play?Reproduce the code example:
I failed writing a reproducer. I even failed reproducing locally despite running the same tests in a local env with the same versions of dependencies on an Apple M1 machine many times.
Python and NumPy Versions:
2.0.1
Runtime Environment: