feat: improve merging of NumpyArrays and simplification of UnionArrays + fix ak.almost_equal for UnionArrays#3773
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files
🚀 New features to boost your workflow:
|
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3773 |
|
Unfortunately @ariostas, this is still not the right fix because |
ak.almost_equal raises IndexError for UnionArray with unused contents NumpyArrays and simplification of UnionArrays + fix ak.almost_equal for UnionArrays
|
Okay @ariostas that turned out to be a little more complicated than we originally thought so I turned this PR into a feat + fix basically but those are useful features that may be used other contexts too in the future. P.S. Out of those 1.8K lines, the tests are 1.5 so it's not that big and many of the actual code changes are just propagating variables forward. |
ariostas
left a comment
There was a problem hiding this comment.
This looks great! Thank you, @ikrommyd.
I think these extra distinctions between castings are very useful. Later on it might be worth reconsidering classifying a union of floats and integers as "invalid".
The only comment I have is that it would be good to document what "family" means. I couldn't think of a better name, and I actually think "same_kind" would have been more appropriate, but that's not what it means in numpy. We'll see if someone else has a suggestion for this.
|
Thanks @ariostas! Yeah so "same_kind" and "equiv" I stole from https://numpy.org/devdocs/reference/generated/numpy.can_cast.html. And since we've been using And in the implementation we have this: def is_approx_dtype(left, right) -> bool:
if not dtype_exact:
for family in np.integer, np.floating:
if np.issubdtype(left, family):
return np.issubdtype(right, family)
return left == rightSo is
I also agree about that, it currently decides that on whether it can merge contents with the old default to tell you whether it can. Indeed I don't think we should generally call merging floats and ints a reasonable deafult, there may be a very good reason why someone put them separately in a union. |
|
Thanks @ianna, I'll wait a bit in case Peter can take a look too or in case I randomly come up with another test case and will merge afterwards 😃 |
|
I don't see anything else to do here, I've added a bit more testing but my brain can't think of something more atm. I'll just merge it :) |
Closes #3772
This PR does several improvements to when we can merge
NumpyArrays together, to how we simplifyUnionArrays using thesimplifiedconstructor and also as a result solves the original problem of comparing union arrays inalmost_equalwhich is much deeper that I originally thought.We do the following:
castingargument to the nplikecan_castmethod. That is because we want to use something else than the previously hard-coded "same_kind" in certain scenarios.mergeableand_mergeable_nextfunctions calledmergecastablewhich can be either "same_kind" (the default), "equiv", or "family" (two of those three names are taken from numpy'scan_castmethod). We obviously pass around this new argument everywhere_mergeable_nextis called (as that does not have default values, only the more publicmergeabledoes). That new argument tells us when we can merge contents. Previously, we'd merge contents when they can be casted together and that remains that default (same_kind). So ints and floats could be merged together. With the two new options we can choose if we only want to merge contents of the same family (int32 and int64 for example but not ints and floats) with the "family" option" or we can only merge exact dtypes together (int32 with int32 only for example) with the "equiv" option. We upgrade the_mergeable_nextimplementation ofNumpyArrayto handle these options and keep the default behavior like it always was.UnionArray.simplifiedconstructor which tries to merge contents of the union together. So now we can choose if we want to merge ints and floats together for example and so on.UnionArray.simplifiedconstructor and that is calleddropunused. A union array may not use all of its contents. There may be contents that are not used ever in thetagsand the simplification can drop those. Because a union array needs at least two contents to be instantiated, we take special care of the cases where one or zero contents are actually used only in order to avoid an__init__error. That argument naturally has the default valueFalseand will not try to drop unused contents unless explicitly asked for like it always has been.All those are generally nice features to have and can be used in other contexts because there is not one single way to merge numpy arrays, nor to simplify union arrays. The default behavior hasn't been changed and all those features are opt-in of course.
Now back to the
ak.almost_equalproblem. The problem is the original algorithm assumed that there is a one-to-one map between the tags of the left array and the right array and that is not generally true as you can have repeated content dtypes in the unions, unused contents and so on.We upgrade the algorithm as follows. The algorithm now for union arrays in
ak.almost_equaluses this newsimplifiedconstructor and does the following:dtype_exactargument isTrueand merges ONLY dtypes within families if it isFalse.False.over 1.5K lines of tests have been added to test all that.