Skip to content

BUG: random bus error in _unique1d and sort on macOS arm64 in long running Python programs #27121

Description

@ogrisel

Describe the issue:

scikit-learn nightly wheels recently started to randomly fail (maybe since we started to build against numpy 2(.0.1) but I am not 100% sure).

Here is an example failure reproduced in an instrumented CI PR:

https://github.com/scikit-learn/scikit-learn/actions/runs/10267815411/job/28409279851?pr=29628#step:6:2328

Here are the relevant snippets from the Python-level faulthandler backtrace and an lldb native backtrace collected from a core dump:

  Current thread 0x00000001ec854c00 (most recent call first):
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/numpy/lib/_arraysetops_impl.py", line 356 in _unique1d
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/numpy/lib/_arraysetops_impl.py", line 289 in unique
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/numpy/lib/_arraysetops_impl.py", line 1142 in union1d
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/utils/_array_api.py", line 212 in _union1d
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/metrics/_classification.py", line 119 in _check_targets
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/metrics/_classification.py", line 219 in accuracy_score
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 216 in wrapper
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/base.py", line 764 in score
    File "/private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/utils/estimator_checks.py", line 1875 in check_pipeline_consistency
   * thread #1
    * frame #0: 0x00000001849e2a60 libsystem_kernel.dylib`__pthread_kill + 8
      frame #1: 0x0000000184a1ac20 libsystem_pthread.dylib`pthread_kill + 288
      frame #2: 0x00000001848f11e0 libsystem_c.dylib`raise + 32
      frame #3: 0x0000000105e0d83c Python`faulthandler_fatal_error + 392
      frame #4: 0x0000000184a4b584 libsystem_platform.dylib`_sigtramp + 56
      frame #5: 0x000000010681654c _multiarray_umath.cpython-311-darwin.so`void hwy::N_NEON::detail::Recurse<(hwy::N_NEON::detail::RecurseMode)0, hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long>(hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long*, unsigned long, long long*, unsigned long long*, unsigned long, unsigned long) + 800
      frame #6: 0x000000010681654c _multiarray_umath.cpython-311-darwin.so`void hwy::N_NEON::detail::Recurse<(hwy::N_NEON::detail::RecurseMode)0, hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long>(hwy::N_NEON::Simd<long long, 2ul, 0>, hwy::N_NEON::detail::SharedTraits<hwy::N_NEON::detail::TraitsLane<hwy::N_NEON::detail::OrderAscending<long long>>>, long long*, unsigned long, long long*, unsigned long long*, unsigned long, unsigned long) + 800
      frame #7: 0x0000000106810644 _multiarray_umath.cpython-311-darwin.so`void np::highway::qsort_simd::QSort_ASIMD<long long>(long long*, long) + 108
      frame #8: 0x0000000106715d24 _multiarray_umath.cpython-311-darwin.so`quicksort_long + 68
      frame #9: 0x00000001066d7184 _multiarray_umath.cpython-311-darwin.so`_new_sortlike + 688
      frame #10: 0x00000001066eab20 _multiarray_umath.cpython-311-darwin.so`array_sort + 572

There are other threads in that process (see the full log above) but at the time of the failure they are all waiting and don't seem involve in the failure.

Here is a link to the dumped core file:

https://github.com/scikit-learn/scikit-learn/actions/runs/10267815411/artifacts/1781211741

Note: the zip file is 1.7 GB and contains a single 5.7G core file once decompressed.

This problem can happen in random tests of scikit-learn that call into _unique1d on way or another. 8 times out 10 we can run the full test suite without any failure but other times we get a crash, always when calling numpy _unique1d but in various caller contexts.

We also have feedback from a scikit-learn user who experienced a similar random bus error (or sometimes sigsev) when calling numpy's sort according to faulthandler backtrace. That crash report is also happening on Apple arm64 macOS machines.

Maybe it's a SIMD related problem in np::highway::qsort_simd. Maybe memory alignment might comes into play?

Reproduce the code example:

I failed writing a reproducer. I even failed reproducing locally despite running the same tests in a local env with the same versions of dependencies on an Apple M1 machine many times.

Python and NumPy Versions:

2.0.1

Runtime Environment:

# output of sklearn.show_versions()

   System:
      python: 3.11.9 (v3.11.9:de54cf5be3, Apr  2 2024, 07:12:50) [Clang 13.0.0 (clang-1300.0.29.30)]
  executable: /private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/bin/python
     machine: macOS-14.5-arm64-arm-64bit
  
  Python dependencies:
        sklearn: 1.6.dev0
            pip: 24.2
     setuptools: 70.1.0
          numpy: 2.0.1
          scipy: 1.14.0
         Cython: None
         pandas: 2.2.2
     matplotlib: None
         joblib: 1.4.2
  threadpoolctl: 3.5.0
  
  Built with OpenMP: True
  
  threadpoolctl info:
         user_api: openmp
     internal_api: openmp
      num_threads: 3
           prefix: libomp
         filepath: /private/var/folders/zn/hj183dg15s713b47j2wlhwzw0000gn/T/cibw-run-rl1ximw0/cp311-macosx_arm64/venv-test-arm64/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
          version: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions