Skip to content

FIX Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel#1561

Merged
tomMoral merged 18 commits intomainfrom
FIX/memmapping_endianness
Apr 2, 2025
Merged

FIX Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel#1561
tomMoral merged 18 commits intomainfrom
FIX/memmapping_endianness

Conversation

@fcharras
Copy link
Copy Markdown
Contributor

@fcharras fcharras commented Apr 4, 2024

Fixes #1545

Still todo:

@fcharras fcharras force-pushed the FIX/memmapping_endianness branch from 8e1deb4 to 395e9de Compare April 4, 2024 17:31
@fcharras
Copy link
Copy Markdown
Contributor Author

fcharras commented Apr 5, 2024

Regarding the second and third points, in fact it's not an issue because _ensure_native_byte_order is only used in joblib.Parallel calls when auto-memmapping is triggered. Byteorder for small arrays has always been preserved from main process to child process. I will add a unit test that show that in a separate PR.

@fcharras
Copy link
Copy Markdown
Contributor Author

fcharras commented Apr 5, 2024

@alexisthual do you want to try this branch on your usecase and confirm that it fixes it ?

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Here is some feedback. The big picture is that we should never pretend in the code that is possible to swap endianess of memory mapped arrays: the result would be a newly allocated array in memory and no longer backed by a memory mapped buffer managed by the OS.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Apr 5, 2024

Also we should not forget to document the fix in the changelog.

@fcharras fcharras force-pushed the FIX/memmapping_endianness branch from f14b07a to 15fa6a9 Compare February 26, 2025 14:11
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 26, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.49%. Comparing base (0beb688) to head (a9426f8).
Report is 19 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1561      +/-   ##
==========================================
- Coverage   95.53%   95.49%   -0.04%     
==========================================
  Files          46       46              
  Lines        7743     7775      +32     
==========================================
+ Hits         7397     7425      +28     
- Misses        346      350       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fcharras
Copy link
Copy Markdown
Contributor Author

fcharras commented Feb 26, 2025

@ogrisel thanks for the review.

Regarding the default values for occurences of ensure_native_byte_order parameter, I must admit I didn't try to see the big picture and lazily chose to set default values that reproduced the historical behavior, to ensure compatibility with code that could have been currently using those functions and could break if the default behavior changes.

After browsing the code more in depth, I think we can say that all the classes and functions that are changed in this PR (NumpyPickler, NumpyArrayWrapper, _unpickle,...) are private API, that we can break. Do you agree with that ?

Also I grep'ed joblib code and can say that those changes in numpy_pickly.py do not even affect other joblib modules.

So I understand that we can break the default behavior of those internal functions. WDYT doing it a bit differently of what you're suggesting, and instead of setting default ensure_native_byte_order=False, not setting a default at all, so we have to explicitly set it wherever it's used ? since there is no harm in altering the signatures of internals, it will be even more explicit IMO.

I have the feeling that it is saner to make it explicit in the code that the case ensure_native_byte_order and mmap_mode is not None is physically not achievable in general.

Can we re-discuss that in light of the previous commit too

Still todo:

  • : document the fix in the changelog
  • : an additional test with joblib.load(filename, mmap_mode="r")

@fcharras fcharras force-pushed the FIX/memmapping_endianness branch from cd5f242 to b968f3f Compare February 27, 2025 11:59
@fcharras fcharras force-pushed the FIX/memmapping_endianness branch from b968f3f to aa04f09 Compare February 27, 2025 12:49
@lesteve lesteve changed the title Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel ENH Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel Feb 27, 2025
@fcharras fcharras changed the title ENH Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel FIX Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel Feb 27, 2025
Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I am fine with the design (passing an extra explicit argument in the private API) to be explicit everywhere.

I think we should further document this behavior in joblib.load and make it possible to disable native byte ordering conversion on regular numpy arrays.

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass on comment, trying to understand this so mostly rephrasing what I understood in the comments.

Co-authored-by: Thomas Moreau <thomas.moreau.2010@gmail.com>
@tomMoral
Copy link
Copy Markdown
Contributor

tomMoral commented Apr 2, 2025

Merging as the failure is present on main too.

@tomMoral tomMoral merged commit 5aeb4ca into main Apr 2, 2025
31 of 32 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in February 2025 sprint Apr 2, 2025
@tomMoral tomMoral deleted the FIX/memmapping_endianness branch April 2, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Large numpy arrays stored in big-endian format cannot be serialized, leading to errors with Parallel

4 participants