Conversation
ianna
left a comment
There was a problem hiding this comment.
@jpivarski - I think, adding a few specializations to the kernels fixes the issue.
|
It would solve the issue without changing the implementation. We're still not allowing UnionArrays to have more than 127 types, right? Is the idea that we keep the |
These kernels are only used to make an Then this index is used to create a simplified "UnionArray" and return it as a RegularArray:awkward/src/awkward/operations/ak_concatenate.py Lines 278 to 288 in 4db7973 |
jpivarski
left a comment
There was a problem hiding this comment.
Okay, this is fine—the helper functions were designed for UnionArrays, but here they're being used for concatenation and not UnionArrays.
Do you really need int32, uint32, and int64 versions? If it's just an intermediate step in the ak.concatenate operation, why not make that intermediate step always int64? In general, the reason we have multiple dtypes of these nodes is to accept data with those dtypes from other sources without a copy-transformation. In fact, UnionArray's uint32 was motivated by a very early version of RNTuple—before that, Awkward didn't have any unsigned index types. 32-bit indexes in general were introduced for the sake of Arrow (along with BitMaskedArray).
Yes, initially I thought we do not need all of them. However, looking at the code again it seems, yes, we need all of them: these are specializations for the |
|
Based on this PR, I was going to remove a TODO from coffea which worked around the concatenation limit (I thought) However, this fails in the general case with I tried a higher range from the awkward test case added: This might actually be a related but different issue in a way, hard for me to follow through all the code versions: |
|
@ianna We probably just want to choose the dtype of the awkward/src/awkward/operations/ak_concatenate.py Lines 259 to 268 in e25f614 With Without it, the test case for 129 |
|
@NJManganelli - thanks for reporting it! There is a signed 8-bit integer that limits that number. I think, this limitation comes from the fact that our class UnionArray(Content):
def __init__(self, tags, index, contents):
assert isinstance(tags, Index8)
assert isinstance(index, (Index32, IndexU32, Index64))
assert isinstance(contents, list)
assert len(index) >= len(tags) # usually equal
for x in tags:
assert 0 <= x < len(contents)
for i, x in enumerate(tags):
assert 0 <= index[i] < len(contents[x])
self.tags = tags
self.index = index
self.contents = contents |
|
What is the purpose of tags for union arrays? Perhaps there's a smarter way to accomplish what's desired in the referenced function in coffea, which is more or less trying to convert a set of regular boolean masks (shape events * bool) into a jagged array of categorical integers. Where the first mask is true, a |
|
Why wouldn't this #3568 just be the right solution? |
Because it changes the |
The
Thanks for clarifying, I need to think about it. |
I did this in 1 minute so maybe I’m wrong but doesn’t tags_cls already do that since it’s an Index64 if we have more than 128 arrays? |
Ok, I checked all the kernels involved - they do have specialization for |
No description provided.