Skip to content

feat: improve merging of NumpyArrays and simplification of UnionArrays + fix ak.almost_equal for UnionArrays#3773

Merged
ikrommyd merged 25 commits intoscikit-hep:mainfrom
ikrommyd:fix-almost-equal
Dec 22, 2025
Merged

feat: improve merging of NumpyArrays and simplification of UnionArrays + fix ak.almost_equal for UnionArrays#3773
ikrommyd merged 25 commits intoscikit-hep:mainfrom
ikrommyd:fix-almost-equal

Conversation

@ikrommyd
Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd commented Dec 17, 2025

Closes #3772

This PR does several improvements to when we can merge NumpyArrays together, to how we simplify UnionArrays using the simplified constructor and also as a result solves the original problem of comparing union arrays in almost_equal which is much deeper that I originally thought.

We do the following:

  1. We introduce the ability to pass in the casting argument to the nplike can_cast method. That is because we want to use something else than the previously hard-coded "same_kind" in certain scenarios.
  2. We introduce a new argument in the mergeable and _mergeable_next functions called mergecastable which can be either "same_kind" (the default), "equiv", or "family" (two of those three names are taken from numpy's can_cast method). We obviously pass around this new argument everywhere _mergeable_next is called (as that does not have default values, only the more public mergeable does). That new argument tells us when we can merge contents. Previously, we'd merge contents when they can be casted together and that remains that default (same_kind). So ints and floats could be merged together. With the two new options we can choose if we only want to merge contents of the same family (int32 and int64 for example but not ints and floats) with the "family" option" or we can only merge exact dtypes together (int32 with int32 only for example) with the "equiv" option. We upgrade the _mergeable_next implementation of NumpyArray to handle these options and keep the default behavior like it always was.
  3. We expose this argument to the UnionArray.simplified constructor which tries to merge contents of the union together. So now we can choose if we want to merge ints and floats together for example and so on.
  4. We introduce one extra argument in the UnionArray.simplified constructor and that is called dropunused. A union array may not use all of its contents. There may be contents that are not used ever in the tags and the simplification can drop those. Because a union array needs at least two contents to be instantiated, we take special care of the cases where one or zero contents are actually used only in order to avoid an __init__ error. That argument naturally has the default value False and will not try to drop unused contents unless explicitly asked for like it always has been.

All those are generally nice features to have and can be used in other contexts because there is not one single way to merge numpy arrays, nor to simplify union arrays. The default behavior hasn't been changed and all those features are opt-in of course.

Now back to the ak.almost_equal problem. The problem is the original algorithm assumed that there is a one-to-one map between the tags of the left array and the right array and that is not generally true as you can have repeated content dtypes in the unions, unused contents and so on.

We upgrade the algorithm as follows. The algorithm now for union arrays in ak.almost_equal uses this new simplified constructor and does the following:

  1. Drops all unused contents of union arrays.
  2. Simplifies the union arrays and merges ONLY identical dtypes if the dtype_exact argument is True and merges ONLY dtypes within families if it is False.
  3. At this point, if two unions do not have the same number of contents, they are not equal and therefore returns False.
  4. Proceeds with the original algorithm that assumes a one-to-one map between the tags of the left and right arrays.

over 1.5K lines of tests have been added to test all that.

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 98.52941% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 82.75%. Comparing base (54dae87) to head (2d59e8a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/awkward/contents/numpyarray.py 90.90% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/_do.py 84.14% <100.00%> (ø)
src/awkward/_nplikes/array_module.py 95.25% <100.00%> (+0.31%) ⬆️
src/awkward/_nplikes/numpy_like.py 100.00% <ø> (ø)
src/awkward/_nplikes/typetracer.py 77.31% <100.00%> (+1.72%) ⬆️
src/awkward/contents/bitmaskedarray.py 70.02% <100.00%> (ø)
src/awkward/contents/bytemaskedarray.py 88.38% <100.00%> (ø)
src/awkward/contents/content.py 77.12% <100.00%> (ø)
src/awkward/contents/emptyarray.py 78.28% <100.00%> (ø)
src/awkward/contents/indexedarray.py 85.06% <100.00%> (ø)
src/awkward/contents/indexedoptionarray.py 89.57% <100.00%> (ø)
... and 8 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3773

@ikrommyd ikrommyd marked this pull request as draft December 17, 2025 12:53
Copy link
Copy Markdown
Member

@ariostas ariostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ikrommyd! From the discussion on Slack, I think this is the best approach, but let's see what other people think.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

Unfortunately @ariostas, this is still not the right fix because simplified promotes dtypes. If you have a float64 and an int64 content in a union array, the simplification will get you back float64 even though the float64 content may not be used at all. Therefore if one of the two arrays that you are comparing has this, almost_equal gets you False back just because of the dtype mismatch.

@ikrommyd ikrommyd changed the title fix: ak.almost_equal raises IndexError for UnionArray with unused contents feat: improve merging of NumpyArrays and simplification of UnionArrays + fix ak.almost_equal for UnionArrays Dec 19, 2025
@ikrommyd ikrommyd requested a review from ariostas December 19, 2025 17:21
@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Dec 19, 2025

Okay @ariostas that turned out to be a little more complicated than we originally thought so I turned this PR into a feat + fix basically but those are useful features that may be used other contexts too in the future.
I have updated the PR description with a big text explaining all the changes and the logic behind them. Please take a look whenever you can 😃.

P.S. Out of those 1.8K lines, the tests are 1.5 so it's not that big and many of the actual code changes are just propagating variables forward.

@ikrommyd ikrommyd marked this pull request as ready for review December 19, 2025 17:28
@ikrommyd ikrommyd requested a review from pfackeldey December 19, 2025 18:52
Copy link
Copy Markdown
Member

@ariostas ariostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you, @ikrommyd.

I think these extra distinctions between castings are very useful. Later on it might be worth reconsidering classifying a union of floats and integers as "invalid".

The only comment I have is that it would be good to document what "family" means. I couldn't think of a better name, and I actually think "same_kind" would have been more appropriate, but that's not what it means in numpy. We'll see if someone else has a suggestion for this.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Dec 19, 2025

Thanks @ariostas! Yeah so "same_kind" and "equiv" I stole from https://numpy.org/devdocs/reference/generated/numpy.can_cast.html.
Numpy also tells you this

In [6]: np.can_cast(np.int64, np.float64, "same_kind")
Out[6]: True

In [7]: np.can_cast(np.int64, np.float64, "equiv")
Out[7]: False

And since we've been using can_cast with "same_kind" since forever and that decided to merge ints and floats together, I naturally thought that I would call this merging the same name and make it the default too.
Additionally we say this in the docstring of almost equal

        dtype_exact: whether the dtypes must be exactly the same, or just the
            same family.

And in the implementation we have this:

    def is_approx_dtype(left, right) -> bool:
        if not dtype_exact:
            for family in np.integer, np.floating:
                if np.issubdtype(left, family):
                    return np.issubdtype(right, family)
        return left == right

So is dtype_exact the dtypes needs to match exactly and if not, the docstring uses the word "same family". Which is why I called this option "family". It just means integers together and floats together and so on. I generally tried to come up with names that match existing patterns in the codebase.

Later on it might be worth reconsidering classifying a union of floats and integers as "invalid".

I also agree about that, it currently decides that on whether it can merge contents with the old default to tell you whether it can. Indeed I don't think we should generally call merging floats and ints a reasonable deafult, there may be a very good reason why someone put them separately in a union.

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - Greater! Thanks for implementing this. Please go ahead and merge it if you’re done with it. Thanks.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Dec 20, 2025

Thanks @ianna, I'll wait a bit in case Peter can take a look too or in case I randomly come up with another test case and will merge afterwards 😃

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Dec 22, 2025

I don't see anything else to do here, I've added a bit more testing but my brain can't think of something more atm. I'll just merge it :)

@ikrommyd ikrommyd merged commit 11b71d6 into scikit-hep:main Dec 22, 2025
37 of 39 checks passed
@ikrommyd ikrommyd deleted the fix-almost-equal branch December 22, 2025 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ak.almost_equal raises IndexError for UnionArray with unused contents

3 participants