Skip to content

Conversation

@Micky774
Copy link
Contributor

@Micky774 Micky774 commented Jun 22, 2022

Reference Issues/PRs

Fixes #16803
Fixes #17554
Resolves #19676 (stalled)
Resolves #20524 (stalled)

What does this implement/fix? Explain your changes.

PR #20524: Calculates number of non-zero terms for each degree (row-wise) and creates dense arrays for data/indices/indptr to pass to Cython _csr_polynomial_expansion. Since the size is known a-priori, the appropriate dtype can be used during construction. The use of fused types in _csr_polynomial_expansion allows for only the minimally sufficient index dtype to be used, decreasing wasted memory when int32 is sufficient.

This PR: reconciles w/ main and makes minor changes.

Any other comments?

The full functionality of this PR is really only enabled in scipy_version>1.8 since it depends on an upstream bug fix

@Micky774 Micky774 changed the title ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures::_csr_polynomial_expansion ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices Jun 22, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Here is a first pass of feedback.

Micky774 and others added 2 commits June 29, 2022 17:26
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Micky774.

A few comments.

Note that a specific dtype and C type have been added for sparse matrices indices.

# scipy matrices indices dtype (namely for indptr and indices arrays)
#
# Note that indices might need to be represented as cnp.int64_t.
# Currently, we use Cython classes which do not handle fused types
# so we hardcode this type to cnp.int32_t, supporting all but edge
# cases.
#
# TODO: support cnp.int64_t for this case
# See: https://github.com/scikit-learn/scikit-learn/issues/23653
ctypedef cnp.int32_t SPARSE_INDEX_TYPE_t

Could we propagate this here for semantics?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Micky774
Copy link
Contributor Author

@ogrisel wondering if you could take another look

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Here are some more comments for a build problem that I am still investigating:

# On Windows, scikit-learn is typically compiled with MSVC that
# does not support int128 arithmetic (at the time of writing):
# https://stackoverflow.com/a/6761962/163740
if sys.maxsize <= 2**32 or sys.platform == "win32":
Copy link
Member

@ogrisel ogrisel Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sys.maxsize <= 2**32 part was unexpected to me but I checked:

and added "Passed" to the status filer and test_sizeof_LARGEST_INT_t to the name filter and indeed it passed on our 32 bit linux build (with gcc).

So I think we need to check if this test would also pass with clang on 32 bit linux (at least manually, we don't necessarily need to configure a new CI entry for this).

clang (or clang-cl) on windows would be challenging to test because I believe there is not documented way to use this compiler to build Python extensions at this time.

I tried to the following debian docker image for i386 (32 bit).

docker run -ti -v `pwd`:/io --platform linux/i386 i386/debian:11.2 bash

Then I installed build dependencies:

apt update && apt install clang python3-pip python3-dev build-essential python3-scipy

and then changed update-alternatives --config cc and update-alternatives --config c++ to point to clang version 11.0.1-2 instead of gcc (not sure if it's needed) and also set the following env variables:

export CC="clang"
export CXX="clang++"

and then installed cython with pip3 and built scikit-learn with in editable mode with --no-build-isolation.

The beginning of the build works fine, but when it reaches this error:

(...)
    clang -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-hfHQKB/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-hfHQKB/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -I/usr/include/python3.9 -c sklearn/preprocessing/_csr_polynomial_expansion.c -o build/temp.linux-i686-3.9/sklearn/preprocessing/_csr_polynomial_expansion.o -g0 -O2 -fopenmp
    sklearn/preprocessing/_csr_polynomial_expansion.c:767:25: error: expected parameter declarator
            typedef _BitInt(128) LARGEST_INT_t;
                            ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:767:25: error: expected ')'
    sklearn/preprocessing/_csr_polynomial_expansion.c:767:24: note: to match this '('
            typedef _BitInt(128) LARGEST_INT_t;
                           ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:767:30: error: expected function body after function declarator
            typedef _BitInt(128) LARGEST_INT_t;
                                 ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2061:22: error: unknown type name 'LARGEST_INT_t'
    static CYTHON_INLINE LARGEST_INT_t __Pyx_pow_LARGEST_INT_t(LARGEST_INT_t, LARGEST_INT_t);
                         ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2061:75: error: redefinition of parameter 'LARGEST_INT_t'
    static CYTHON_INLINE LARGEST_INT_t __Pyx_pow_LARGEST_INT_t(LARGEST_INT_t, LARGEST_INT_t);
                                                                              ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2061:60: error: a parameter list without types is only allowed in a function definition
    static CYTHON_INLINE LARGEST_INT_t __Pyx_pow_LARGEST_INT_t(LARGEST_INT_t, LARGEST_INT_t);
                                                               ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2079:22: error: unknown type name 'LARGEST_INT_t'
    static CYTHON_INLINE LARGEST_INT_t __Pyx_PyInt_As_LARGEST_INT_t(PyObject *);
                         ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2091:63: error: unknown type name 'LARGEST_INT_t'
    static CYTHON_INLINE PyObject* __Pyx_PyInt_From_LARGEST_INT_t(LARGEST_INT_t value);
                                                                  ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2149:154: error: redefinition of parameter 'LARGEST_INT_t'
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg2_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                                             ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2149:169: error: redefinition of parameter 'LARGEST_INT_t'
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg2_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                                                            ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2149:184: error: unexpected type name '__pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t': expected identifier
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg2_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                                                                           ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2149:139: error: a parameter list without types is only allowed in a function definition
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg2_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                              ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2150:154: error: redefinition of parameter 'LARGEST_INT_t'
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg3_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                                             ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2150:169: error: redefinition of parameter 'LARGEST_INT_t'
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg3_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                                                            ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2150:184: error: redefinition of parameter 'LARGEST_INT_t'
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg3_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                                                                           ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2150:199: error: unexpected type name '__pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t': expected identifier
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg3_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                                                                                          ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2150:139: error: a parameter list without types is only allowed in a function definition
    static CYTHON_INLINE __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__deg3_column(LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t); /*proto*/
                                                                                                                                              ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2151:146: error: unexpected type name '__pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t': expected identifier
    static __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__calc_expanded_nnz(LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t, LARGEST_INT_t, int __pyx_skip_dispatch); /*proto*/
                                                                                                                                                     ^
    sklearn/preprocessing/_csr_polynomial_expansion.c:2151:215: error: redefinition of parameter 'LARGEST_INT_t'
    static __pyx_t_7sklearn_5utils_9_typedefs_int64_t __pyx_f_7sklearn_13preprocessing_25_csr_polynomial_expansion__calc_expanded_nnz(LARGEST_INT_t, __pyx_t_7sklearn_13preprocessing_25_csr_polynomial_expansion_FLAG_t, LARGEST_INT_t, int __pyx_skip_dispatch); /*proto*/

The first error seems to be triggering the other errors. But this code works with clang on macos. Maybe the version of clang (11) is too old?

I can try with a more recent version of debian or ubuntu. I used the same docker container as the one we use on the CI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe the problem above was caused by previously generated .c files. I need to try again from a clean build folder to make sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again on 32 bit linux on armv7 because I could find more recent python / clang versions for that platform and now the build works. But the resulting _csr_polynomial_expansion compiled extension is still broken:

root@4af20c1dad16:/io# pytest -lxk test_sizeof_LARGEST_INT_t sklearn/preprocessing/tests/test_polynomial.py 
ImportError while loading conftest '/io/sklearn/conftest.py'.
sklearn/conftest.py:18: in <module>
    from sklearn.datasets import fetch_20newsgroups
sklearn/datasets/__init__.py:8: in <module>
    from ._base import load_breast_cancer
sklearn/datasets/_base.py:20: in <module>
    from ..preprocessing import scale
sklearn/preprocessing/__init__.py:38: in <module>
    from ._polynomial import PolynomialFeatures
sklearn/preprocessing/_polynomial.py:22: in <module>
    from ._csr_polynomial_expansion import (
E   ImportError: /io/sklearn/preprocessing/_csr_polynomial_expansion.cpython-310-arm-linux-gnueabihf.so: undefined symbol: __divti3

Copy link
Contributor Author

@Micky774 Micky774 Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I think we need to first check for 32bit systems before eagerly trying __int128 or _BitInt(128) since neither are representable on 32bit systems. Probably the most portable way would be to check sizeof(void*) == 8 right? I'll implement that soon.

Copy link
Member

@ogrisel ogrisel Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed:

If you compile for a 32-bit architecture like ARM, or x86 with -m32, no 128-bit integer type is supported with even the newest version of any of these compilers. So you need to detect support before using, if it's possible for your code to work at all without it.

from: https://stackoverflow.com/a/54815033/163740

I think we should also link to that SO answer which is quite a good reference for this topic.

Copy link
Member

@ogrisel ogrisel Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a TODO note such that once all platform have C23 compilers, we should instead switch to the new way to do this:

SO C23 will let you typedef unsigned _BitInt(128) u128, modeled on clang's feature originally called _ExtInt() which works even on 32-bit machines; see a brief intro to it. Current GCC -std=gnu2x doesn't even support that syntax yet.

This will require a few years, but then it should work both for 32-bit and 64-bit target architectures.

Note that the clang + 32-bit target is no just a theoretical possibility: emscripten / WASM / pyodide is a concrete platform that is built on that combo. I think @lesteve has the required tooling to test this PR on that target.

BTW @lesteve if you do so, please copy and paste the command lines you use to do this because I am curious but too lazy to RTFM ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick and dirty testing of this PR on Pyodide seems to indicate that the test_sizeof_LARGEST_INT_t fails but the other added tests pass:

>>> pytest.main(['--pyargs', 'sklearn.preprocessing', '-k', 'test_csr_polynomial_expansion_index_overflow_non_reg
ression or test_csr_polynomial_expansion_too_large_to_index or LARGEST_INT', '-v'])
============================= test session starts ==============================
platform emscripten -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /home/pyodide/this.program
cachedir: .pytest_cache
rootdir: /home/pyodide
collecting ... collected 1098 items / 1089 deselected / 9 selected                            
tests/test_polynomial.py::test_csr_polynomial_expansion_index_overflow_non_regression[True-True] PASSED [ 11%]
tests/test_polynomial.py::test_csr_polynomial_expansion_index_overflow_non_regression[True-False] PASSED [ 22%]
tests/test_polynomial.py::test_csr_polynomial_expansion_index_overflow_non_regression[False-True] PASSED [ 33%]
tests/test_polynomial.py::test_csr_polynomial_expansion_index_overflow_non_regression[False-False] PASSED [ 44%]
tests/test_polynomial.py::test_csr_polynomial_expansion_too_large_to_index[True-True] PASSED [ 55%]
tests/test_polynomial.py::test_csr_polynomial_expansion_too_large_to_index[True-False] PASSED [ 66%]
tests/test_polynomial.py::test_csr_polynomial_expansion_too_large_to_index[False-True] PASSED [ 77%]
tests/test_polynomial.py::test_csr_polynomial_expansion_too_large_to_index[False-False] PASSED [ 88%]
tests/test_polynomial.py::test_sizeof_LARGEST_INT_t FAILED               [100%]
=================================== FAILURES ===================================
__________________________ test_sizeof_LARGEST_INT_t ___________________________
    def test_sizeof_LARGEST_INT_t():
        # On Windows, scikit-learn is typically compiled with MSVC that
        # does not support int128 arithmetic (at the time of writing):
        # https://stackoverflow.com/a/6761962/163740
        if sys.maxsize <= 2**32 or sys.platform == "win32":
            expected_size = 8
        else:
            expected_size = 16
    
>       assert _get_sizeof_LARGEST_INT_t() == expected_size
E       assert 16 == 8
E        +  where 16 = _get_sizeof_LARGEST_INT_t()
/lib/python3.11/site-packages/sklearn/preprocessing/tests/test_polynomial.py:1094: AssertionError
=========================== short test summary info ============================
FAILED tests/test_polynomial.py::test_sizeof_LARGEST_INT_t - assert 16 == 8
================= 1 failed, 8 passed, 1089 deselected in 1.03s =================

Note that the clang + 32-bit target is no just a theoretical possibility: emscripten / WASM / pyodide is a concrete platform that is built on that combo.

By the way I am not sure __clang__ is defined in the Pyodide use case, I think __EMSCRIPTEN__ is used instead.

Copy link
Member

@lesteve lesteve Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW @lesteve if you do so, please copy and paste the command lines you use to do this because I am curious but too lazy to RTFM ;)

Set-up the correct emscripten version (it has to match the version Pyodide was built with namely 3.1.32 for Pyodide stable, 0.23.1 at the time of writing, see https://github.com/pyodide/pyodide/blob/0.23.1/Makefile.envs):

git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install 3.1.32
./emsdk activate 3.1.32
source ./emsdk_env.sh

Build a Pyodide wheel, this should work from the scikit-learn repo root:

pip install pyodide-build
pyodide build

Now you need to use this wheel and I would say one of the possibility is something like this:

  • put this wheel in a github repo
  • use the stable Pyodide console
  • in the console use await micropip.install('https://cdn.jsdelivr.net/gh/<owner>/<repo>/path/to/wheel.whl') (the cdn.jsdelivr.net/gh is a CORS proxy if you are wondering)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I have:

  1. Added a check for __EMSCRIPTEN__ as an alternative to __clang__
  2. Updated the runtime test for largest integer type to account for the odd (but beneficial) exception that is emscripten

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when trying to build pydodide/emscripten for myself, I get the following error. I'm not sure what's going on here 🤷 :

Details

* Creating virtualenv isolated environment...
* Installing packages in isolated environment... (Cython>=0.29.24, setuptools<=61.0, wheel)
* Getting dependencies for wheel...
Partial import of sklearn during the build process.
running egg_info
writing scikit_learn.egg-info/PKG-INFO
writing dependency_links to scikit_learn.egg-info/dependency_links.txt
writing requirements to scikit_learn.egg-info/requires.txt
writing top-level names to scikit_learn.egg-info/top_level.txt
reading manifest file 'scikit_learn.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*' found under directory 'asv_benchmarks'
warning: no previously-included files matching '*' found under directory 'benchmarks'
warning: no previously-included files matching '*' found under directory 'build_tools'
warning: no previously-included files matching '*' found under directory 'maint_tools'
warning: no previously-included files matching '*' found under directory 'benchmarks'
warning: no previously-included files matching '*' found under directory '.binder'
warning: no previously-included files matching '*' found under directory '.circleci'
warning: no previously-included files found matching '.codecov.yml'
warning: no previously-included files found matching '.git-blame-ignore-revs'
warning: no previously-included files found matching '.mailmap'
warning: no previously-included files found matching '.pre-commit-config.yaml'
warning: no previously-included files found matching 'azure-pipelines.yml'
warning: no previously-included files found matching 'CODE_OF_CONDUCT.md'
warning: no previously-included files found matching 'CONTRIBUTING.md'
warning: no previously-included files found matching 'SECURITY.md'
warning: no previously-included files found matching 'PULL_REQUEST_TEMPLATE.md'
adding license file 'COPYING'
writing manifest file 'scikit_learn.egg-info/SOURCES.txt'
* Installing packages in isolated environment... (wheel)
* Building wheel...
Partial import of sklearn during the build process.
Traceback (most recent call last):
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/numpy/core/__init__.py", line 23, in <module>
    from . import multiarray
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/numpy/core/multiarray.py", line 10, in <module>
    from . import overrides
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/numpy/core/overrides.py", line 6, in <module>
    from numpy.core._multiarray_umath import (
ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "setup.py", line 221, in check_package_status
    module = importlib.import_module(package)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/numpy/__init__.py", line 141, in <module>
    from . import core
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/numpy/core/__init__.py", line 49, in <module>
    raise ImportError(msg)
ImportError:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.11 from "/tmp/build-env-mnfj07nc/bin/python"
  * The NumPy version is: "1.24.2"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/pep517/in_process/_in_process.py", line 351, in <module>
    main()
  File "/usr/local/lib/python3.11/site-packages/pep517/in_process/_in_process.py", line 333, in main
    json_out['return_val'] = hook(**hook_input['kwargs'])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pep517/in_process/_in_process.py", line 249, in build_wheel
    return _build_backend().build_wheel(wheel_directory, config_settings,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/setuptools/build_meta.py", line 244, in build_wheel
    return self._build_with_temp_dir(['bdist_wheel'], '.whl',
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/setuptools/build_meta.py", line 229, in _build_with_temp_dir
    self.run_setup()
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/setuptools/build_meta.py", line 282, in run_setup
    self).run_setup(setup_script=setup_script)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/build-env-mnfj07nc/lib/python3.11/site-packages/setuptools/build_meta.py", line 174, in run_setup
    exec(compile(code, __file__, 'exec'), locals())
  File "setup.py", line 655, in <module>
    setup_package()
  File "setup.py", line 645, in setup_package
    check_package_status("numpy", min_deps.NUMPY_MIN_VERSION)
  File "setup.py", line 248, in check_package_status
    raise ImportError(
ImportError: numpy is not installed.
scikit-learn requires numpy >= 1.17.3.
Installation instructions are available on the scikit-learn website: http://scikit-learn.org/stable/install.html


ERROR Backend subproccess exited when trying to invoke build_wheel

@jjerphan jjerphan self-requested a review April 27, 2023 19:24
@jjerphan
Copy link
Member

jjerphan commented May 1, 2023

@Micky774: is this PR reviewable now? If it is not the case, could you ping me when it is? :)

@Micky774
Copy link
Contributor Author

Micky774 commented May 1, 2023

@Micky774: is this PR reviewable now? If it is not the case, could you ping me when it is? :)

It should be good for review! The last large point of interest was this thread: #23731 (comment)

which hopefully is resolved through: #23731 (comment)

Let me know if you have any questions or concerns

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, LGTM. If the wasm env triggers a new failure we can fix it up in a follow-up PRs.

At some point we will introduce a pyodide CI entry but later :)

@ogrisel ogrisel merged commit c6628a0 into scikit-learn:main May 4, 2023
@ogrisel
Copy link
Member

ogrisel commented May 4, 2023

Thanks @Micky774!

@Micky774 Micky774 deleted the csr_polynomial branch May 4, 2023 19:31
@jjerphan
Copy link
Member

jjerphan commented May 4, 2023

Thank you all (and especially @Micky774) for your efforts in this rocambolesque journey!

@jjerphan jjerphan removed their request for review May 4, 2023 19:34
@Micky774
Copy link
Contributor Author

Micky774 commented May 4, 2023

It was definitely quite a tale of bugs (old and new) and niche features/interactions 😅

I really appreciate the patience and support of the reviewers as well ❤️

Edit: I should have known better than to say at the beginning of this PR "oh it shouldn't be too bad"

@ogrisel
Copy link
Member

ogrisel commented May 5, 2023

I tried to build scikit-learn in a local emscrypten sdk on macos and I get a build failure (seemingly unrelated to this PR, it fails when building sklearn/utils/_typedefs.cpython-311-wasm32-emscripten.so).

I suspect we might still want to test the code of this PR on pyodide because, in retrospect the emscrypten condition seems a bit counterintuitive to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cython code for PolynomialFeatures should use int64s for indices. index type np.int32_t causes issue in _csr_polynomial_expansion

8 participants