feat: improve merging of `NumpyArray`s and simplification of `UnionArray`s + fix `ak.almost_equal` for `UnionArray`s by ikrommyd · Pull Request #3773 · scikit-hep/awkward

ikrommyd · 2025-12-17T11:36:12Z

This PR does several improvements to when we can merge NumpyArrays together, to how we simplify UnionArrays using the simplified constructor and also as a result solves the original problem of comparing union arrays in almost_equal which is much deeper that I originally thought.

We do the following:

We introduce the ability to pass in the casting argument to the nplike can_cast method. That is because we want to use something else than the previously hard-coded "same_kind" in certain scenarios.
We introduce a new argument in the mergeable and _mergeable_next functions called mergecastable which can be either "same_kind" (the default), "equiv", or "family" (two of those three names are taken from numpy's can_cast method). We obviously pass around this new argument everywhere _mergeable_next is called (as that does not have default values, only the more public mergeable does). That new argument tells us when we can merge contents. Previously, we'd merge contents when they can be casted together and that remains that default (same_kind). So ints and floats could be merged together. With the two new options we can choose if we only want to merge contents of the same family (int32 and int64 for example but not ints and floats) with the "family" option" or we can only merge exact dtypes together (int32 with int32 only for example) with the "equiv" option. We upgrade the _mergeable_next implementation of NumpyArray to handle these options and keep the default behavior like it always was.
We expose this argument to the UnionArray.simplified constructor which tries to merge contents of the union together. So now we can choose if we want to merge ints and floats together for example and so on.
We introduce one extra argument in the UnionArray.simplified constructor and that is called dropunused. A union array may not use all of its contents. There may be contents that are not used ever in the tags and the simplification can drop those. Because a union array needs at least two contents to be instantiated, we take special care of the cases where one or zero contents are actually used only in order to avoid an __init__ error. That argument naturally has the default value False and will not try to drop unused contents unless explicitly asked for like it always has been.

All those are generally nice features to have and can be used in other contexts because there is not one single way to merge numpy arrays, nor to simplify union arrays. The default behavior hasn't been changed and all those features are opt-in of course.

Now back to the ak.almost_equal problem. The problem is the original algorithm assumed that there is a one-to-one map between the tags of the left array and the right array and that is not generally true as you can have repeated content dtypes in the unions, unused contents and so on.

We upgrade the algorithm as follows. The algorithm now for union arrays in ak.almost_equal uses this new simplified constructor and does the following:

Drops all unused contents of union arrays.
Simplifies the union arrays and merges ONLY identical dtypes if the dtype_exact argument is True and merges ONLY dtypes within families if it is False.
At this point, if two unions do not have the same number of contents, they are not equal and therefore returns False.
Proceeds with the original algorithm that assumes a one-to-one map between the tags of the left and right arrays.

over 1.5K lines of tests have been added to test all that.

codecov · 2025-12-17T11:43:14Z

Codecov Report

❌ Patch coverage is 98.52941% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 82.75%. Comparing base (54dae87) to head (2d59e8a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/awkward/contents/numpyarray.py	90.90%	1 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
src/awkward/_do.py	`84.14% <100.00%> (ø)`
src/awkward/_nplikes/array_module.py	`95.25% <100.00%> (+0.31%)`	⬆️
src/awkward/_nplikes/numpy_like.py	`100.00% <ø> (ø)`
src/awkward/_nplikes/typetracer.py	`77.31% <100.00%> (+1.72%)`	⬆️
src/awkward/contents/bitmaskedarray.py	`70.02% <100.00%> (ø)`
src/awkward/contents/bytemaskedarray.py	`88.38% <100.00%> (ø)`
src/awkward/contents/content.py	`77.12% <100.00%> (ø)`
src/awkward/contents/emptyarray.py	`78.28% <100.00%> (ø)`
src/awkward/contents/indexedarray.py	`85.06% <100.00%> (ø)`
src/awkward/contents/indexedoptionarray.py	`89.57% <100.00%> (ø)`
... and 8 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-12-17T12:00:07Z

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3773

ariostas

Thank you, @ikrommyd! From the discussion on Slack, I think this is the best approach, but let's see what other people think.

ikrommyd · 2025-12-18T08:26:29Z

Unfortunately @ariostas, this is still not the right fix because simplified promotes dtypes. If you have a float64 and an int64 content in a union array, the simplification will get you back float64 even though the float64 content may not be used at all. Therefore if one of the two arrays that you are comparing has this, almost_equal gets you False back just because of the dtype mismatch.

ikrommyd · 2025-12-19T17:22:43Z

Okay @ariostas that turned out to be a little more complicated than we originally thought so I turned this PR into a feat + fix basically but those are useful features that may be used other contexts too in the future.
I have updated the PR description with a big text explaining all the changes and the logic behind them. Please take a look whenever you can 😃.

P.S. Out of those 1.8K lines, the tests are 1.5 so it's not that big and many of the actual code changes are just propagating variables forward.

ariostas

This looks great! Thank you, @ikrommyd.

I think these extra distinctions between castings are very useful. Later on it might be worth reconsidering classifying a union of floats and integers as "invalid".

The only comment I have is that it would be good to document what "family" means. I couldn't think of a better name, and I actually think "same_kind" would have been more appropriate, but that's not what it means in numpy. We'll see if someone else has a suggestion for this.

ikrommyd · 2025-12-19T23:48:06Z

Thanks @ariostas! Yeah so "same_kind" and "equiv" I stole from https://numpy.org/devdocs/reference/generated/numpy.can_cast.html.
Numpy also tells you this

In [6]: np.can_cast(np.int64, np.float64, "same_kind")
Out[6]: True

In [7]: np.can_cast(np.int64, np.float64, "equiv")
Out[7]: False

And since we've been using can_cast with "same_kind" since forever and that decided to merge ints and floats together, I naturally thought that I would call this merging the same name and make it the default too.
Additionally we say this in the docstring of almost equal

        dtype_exact: whether the dtypes must be exactly the same, or just the
            same family.

And in the implementation we have this:

    def is_approx_dtype(left, right) -> bool:
        if not dtype_exact:
            for family in np.integer, np.floating:
                if np.issubdtype(left, family):
                    return np.issubdtype(right, family)
        return left == right

So is dtype_exact the dtypes needs to match exactly and if not, the docstring uses the word "same family". Which is why I called this option "family". It just means integers together and floats together and so on. I generally tried to come up with names that match existing patterns in the codebase.

Later on it might be worth reconsidering classifying a union of floats and integers as "invalid".

I also agree about that, it currently decides that on whether it can merge contents with the old default to tell you whether it can. Indeed I don't think we should generally call merging floats and ints a reasonable deafult, there may be a very good reason why someone put them separately in a union.

ianna

@ikrommyd - Greater! Thanks for implementing this. Please go ahead and merge it if you’re done with it. Thanks.

ikrommyd · 2025-12-20T13:37:46Z

Thanks @ianna, I'll wait a bit in case Peter can take a look too or in case I randomly come up with another test case and will merge afterwards 😃

ikrommyd · 2025-12-22T23:21:47Z

I don't see anything else to do here, I've added a bit more testing but my brain can't think of something more atm. I'll just merge it :)

ikrommyd added 3 commits December 17, 2025 12:14

proper left_tag_to_right_tag map

6d0a247

add test

541e83d

rename test

9e97557

ikrommyd added 2 commits December 17, 2025 12:46

more testing

82bb354

renames

8fb3c63

ikrommyd marked this pull request as draft December 17, 2025 12:53

ikrommyd added 5 commits December 17, 2025 14:49

better testing

e792841

a couple more tests

1445bf5

maybe this?

652c217

what if we just use simplified?

0f015c0

Merge branch 'main' into fix-almost-equal

3ff19e1

ariostas approved these changes Dec 17, 2025

View reviewed changes

we need this too

82df78e

ikrommyd added 9 commits December 18, 2025 11:05

maybe we should just error if they are not both valid?

37c5a65

missing space

a20b15e

introduce mergecastable argument to _mergeable_next

bfdf99b

implement dropunused

ae29ffb

tests

7b43859

change mergecastable to string and add tests for almost_equal

87cc50b

no reason to swap variables

bb64cf1

more tests

0490bea

consider the complex family too

51c444b

ikrommyd changed the title ~~fix: ak.almost_equal raises IndexError for UnionArray with unused contents~~ feat: improve merging of NumpyArrays and simplification of UnionArrays + fix ak.almost_equal for UnionArrays Dec 19, 2025

explicit dtypes here

a8c425d

ikrommyd requested a review from ariostas December 19, 2025 17:21

ikrommyd marked this pull request as ready for review December 19, 2025 17:28

ikrommyd requested a review from pfackeldey December 19, 2025 18:52

ariostas approved these changes Dec 19, 2025

View reviewed changes

ianna approved these changes Dec 20, 2025

View reviewed changes

ikrommyd added 4 commits December 20, 2025 19:31

bit more testing

1754904

Merge branch 'main' into fix-almost-equal

b24257f

hopefully finalize testing now

a5023c4

comments here

2d59e8a

ikrommyd merged commit 11b71d6 into scikit-hep:main Dec 22, 2025
37 of 39 checks passed

ikrommyd deleted the fix-almost-equal branch December 22, 2025 23:22

This was referenced Dec 22, 2025

fix: incorrect type passed down for the mergeastable argument #3779

Merged

feat: consider unions of ints and floats as valid layouts #3780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve merging of `NumpyArray`s and simplification of `UnionArray`s + fix `ak.almost_equal` for `UnionArray`s#3773

feat: improve merging of `NumpyArray`s and simplification of `UnionArray`s + fix `ak.almost_equal` for `UnionArray`s#3773
ikrommyd merged 25 commits intoscikit-hep:mainfrom
ikrommyd:fix-almost-equal

ikrommyd commented Dec 17, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

ariostas left a comment

Uh oh!

ikrommyd commented Dec 18, 2025

Uh oh!

ikrommyd commented Dec 19, 2025 •

edited

Loading

Uh oh!

ariostas left a comment

Uh oh!

ikrommyd commented Dec 19, 2025 •

edited

Loading

Uh oh!

ianna left a comment

Uh oh!

ikrommyd commented Dec 20, 2025 •

edited

Loading

Uh oh!

ikrommyd commented Dec 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikrommyd commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

ariostas left a comment

Choose a reason for hiding this comment

Uh oh!

ikrommyd commented Dec 18, 2025

Uh oh!

ikrommyd commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariostas left a comment

Choose a reason for hiding this comment

Uh oh!

ikrommyd commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

ikrommyd commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikrommyd commented Dec 17, 2025 •

edited

Loading

codecov bot commented Dec 17, 2025 •

edited

Loading

ikrommyd commented Dec 19, 2025 •

edited

Loading

ikrommyd commented Dec 19, 2025 •

edited

Loading

ikrommyd commented Dec 20, 2025 •

edited

Loading

ikrommyd commented Dec 22, 2025 •

edited

Loading