GH-41480: [Python] Building PyArrow: enable/disable python components by default based on availability in Arrow C++#41494
Conversation
…onents by default based on availability in Arrow C++
|
|
|
I have not yet cleaned up all our CI scripts that contain things like |
python/CMakeLists.txt
Outdated
| find_package(Arrow REQUIRED) | ||
|
|
||
| # Top level cmake dir | ||
| if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") |
There was a problem hiding this comment.
@kou for my education, could you explain what this if check is exactly doing or why we need it?
There was a problem hiding this comment.
It's for avoiding defining our options when this is used via add_subdirectory() (or FetchContent).
We don't need this because:
- This is never used via
add_subdirectory() - Our option definitions are harmless with
add_subdirectory()use
| # enforce module dependencies | ||
| if(PYARROW_BUILD_SUBSTRAIT) | ||
| set(PYARROW_BUILD_DATASET ON) | ||
| endif() | ||
| if(PYARROW_BUILD_DATASET) | ||
| set(PYARROW_BUILD_ACERO ON) | ||
| endif() |
There was a problem hiding this comment.
This is to ensure we keep the same logic as we have now in setup.py, but we could in theory also leave this out (and ensure the scripts don't explicitly disable Acero when enabling Dataset)
|
@github-actions submit -g python |
It would be nice if this could be avoided. @kou does that sound possible? |
paleolimbot
left a comment
There was a problem hiding this comment.
I gave this a try and it works great!
kou
left a comment
There was a problem hiding this comment.
One minor unfortunate consequence is that if you change your Arrow C++ build (eg enable a new component), you need to do a clean python build, otherwise CMake is still getting the old options (those option defaults are cached? Not sure if there is a way to solve this)
Cache variables defined by option() are written in ${build_directory}/CMakeCache.txt. If we remove only the file (not entire build directory), the next ninja re-generates it automatically with new default values based on ARROW_*.
Or how about accepting ON/OFF/auto(default) as PYARROW_BUILD_XXX? If auto is specified, we use the default value:
set(PYARROW_BUILD_ACERO "auto" CACHE STRING "Build the PyArrow Acero integration")
if("${PYARROW_BUILD_ACERO}" STREQUAL "auto")
set(PYARROW_BUILD_ACERO_REAL ${ARROW_ACERO})
else()
if(PYARROW_BUILD_ACERO)
set(PYARROW_BUILD_ACERO_REAL ON)
else()
set(PYARROW_BUILD_ACERO_REAL OFF)
endif()
endif()
python/CMakeLists.txt
Outdated
| find_package(Arrow REQUIRED) | ||
|
|
||
| # Top level cmake dir | ||
| if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") |
There was a problem hiding this comment.
It's for avoiding defining our options when this is used via add_subdirectory() (or FetchContent).
We don't need this because:
- This is never used via
add_subdirectory() - Our option definitions are harmless with
add_subdirectory()use
python/setup.py
Outdated
| def maybe_append_cmake_bool(varname): | ||
| env_var = varname.replace("PYARROW_BUILD", "PYARROW_WITH") | ||
| if env_var in os.environ: | ||
| value = strtobool(os.environ.get(env_var)) | ||
| append_cmake_bool(value, varname) |
There was a problem hiding this comment.
How about doing this in CMakeLists.txt too?
We can refer an environment variable by "$ENV{PYARROW_WITH_CUDA} in CMakeLists.txt.
See also: https://cmake.org/cmake/help/latest/variable/ENV.html
There was a problem hiding this comment.
I did that initially in a first version locally, but then moved it again to setup.py to avoid the same problem with it being cached. But with you suggestion above to use an AUTO default, then I can indeed move it back to the cmake.
|
@github-actions crossbow submit -g python |
|
Revision: f096999 Submitted crossbow builds: ursacomputing/crossbow @ actions-47aab0586a |
|
One "feature" of our setup.py that I forgot about is that you actually can pass command line flags to setup.py's build_ext like Is that something we think we should keep working? It's nowhere used in our own codebase, and it is also not documented, except if you run |
We did document it in the past, though ( arrow/docs/source/python/development.rst Lines 208 to 215 in 8ca4138 I will add it back here for now, and then we can still see if we want to deprecate this separately or not. |
…row-build-config
…row-build-config
|
@github-actions crossbow submit wheel-* python-sdist -python- |
|
Revision: f9b2b3c Submitted crossbow builds: ursacomputing/crossbow @ actions-92fbc43a42 |
|
@kou any other comments? |
.pre-commit-config.yaml
Outdated
| ?^go/.*/CMakeLists\.txt$| | ||
| ?^java/.*/CMakeLists\.txt$| | ||
| ?^matlab/.*/CMakeLists\.txt$| | ||
| ?^python/CMakeLists\.txt$| |
There was a problem hiding this comment.
Oh... These patterns are wrong...
Could you use ^XXX/.*CMakeLists\.txt$| for all existing patterns too?
| return True | ||
| return False | ||
| if os.path.exists(built_path): | ||
| self._found_names.append(name) |
There was a problem hiding this comment.
Do we need to prepare self._found_names?
It seems that nobody uses it.
There was a problem hiding this comment.
It's used in get_names -> get_outputs methods. We don't use those ourselves, but I am assuming this might be part of the setuptools build_ext class interface
There was a problem hiding this comment.
Hmm. It seems that the setuptools build_ext doesn't have get_names interface:
>>> import setuptools.command.build_ext
>>> setuptools.command.build_ext.build_ext
<class 'setuptools.command.build_ext.build_ext'>
>>> 'get_names' in dir(setuptools.command.build_ext.build_ext)
False
There was a problem hiding this comment.
It does have get_outputs:
In [9]: 'get_outputs' in dir(setuptools.command.build_ext.build_ext)
Out[9]: True
| os.environ.get('PYARROW_WITH_ORC', '0')) | ||
| self.with_gandiva = strtobool( | ||
| os.environ.get('PYARROW_WITH_GANDIVA', '0')) | ||
| self.with_azure = None |
There was a problem hiding this comment.
Do we still need self.with_XXX? It seems that nobody sets them because we remove os.environ.get(...).
There was a problem hiding this comment.
Yes, they are still used in case someone is passing that to setup.py. See #41494 (comment)
I think that's something we should deprecate, though. But planning to do a separate follow-up PR to that, since it's not actually related to this PR (it's just another way that someone can right now override the default, just as setting an environment variable)
There was a problem hiding this comment.
Ah, we need them for --with-XXX options!
I see.
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1c546fb. There were 2 benchmark results with an error:
There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…onents by default based on availability in Arrow C++ (apache#41494) ### Rationale for this change Currently, when building pyarrow from source, one needs to manually enable the optional components through setting `PYARROW_WITH_...` environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build. ### What changes are included in this PR? Set defaults for the various `PYARROW_BUILD_<component>` based on the `ARROW_<component>` setting. Keep the current `PYARROW_WITH_<component>` environment variables working to allow to override this default. ### Are there any user-facing changes? No * GitHub Issue: apache#41480 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…eing enabled by default based on Arrow C++ (#41705) ### Rationale for this change Follow-up on #41494 to update the Python development guide to reflect the change in how PyArrow is build (defaults for the various `PYARROW_BUILD_<component>` are now set based on the `ARROW_<component>` setting. The current `PYARROW_WITH_<component>` environment variables are kept working to allow to override this default) * GitHub Issue: #41480 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
### Rationale for this change As mentioned in #41494 (comment) (while refactoring how to specify to the pyarrow build which components to build, i.e. to let it follow the Arrow C++ components by default), we do have a "feature" that you can specify which components to build directly to setup.py, like `python setup.py build_ext --with-parquet`. This is currently not used in our own codebase, and is also not documented anymore, but we did document it in the past. In general calling setup.py directly is not recommended (although for development installs, it is still useful), furthermore there are alternatives to those flags (relying on Arrow C++ or setting an environment variable), and this would go away anyhow in case we would move away from setuptools at some point. So I think it is better to deprecate those options. ### What changes are included in this PR? Whenever a user passes such a `--with-` flag, a warning is raised. ### Are these changes tested? Tested it locally ### Are there any user-facing changes? Only for developers building pyarrow from source, they have to potentially update their build instructions. * GitHub Issue: #43514 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
### Rationale for this change As mentioned in apache/arrow#41494 (comment) (while refactoring how to specify to the pyarrow build which components to build, i.e. to let it follow the Arrow C++ components by default), we do have a "feature" that you can specify which components to build directly to setup.py, like `python setup.py build_ext --with-parquet`. This is currently not used in our own codebase, and is also not documented anymore, but we did document it in the past. In general calling setup.py directly is not recommended (although for development installs, it is still useful), furthermore there are alternatives to those flags (relying on Arrow C++ or setting an environment variable), and this would go away anyhow in case we would move away from setuptools at some point. So I think it is better to deprecate those options. ### What changes are included in this PR? Whenever a user passes such a `--with-` flag, a warning is raised. ### Are these changes tested? Tested it locally ### Are there any user-facing changes? Only for developers building pyarrow from source, they have to potentially update their build instructions. * GitHub Issue: #43514 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
Rationale for this change
Currently, when building pyarrow from source, one needs to manually enable the optional components through setting
PYARROW_WITH_...environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build.What changes are included in this PR?
Set defaults for the various
PYARROW_BUILD_<component>based on theARROW_<component>setting. Keep the currentPYARROW_WITH_<component>environment variables working to allow to override this default.Are there any user-facing changes?
No