Skip to content

fix: extend limit on number of regular arrays to concatenate#3312

Merged
ianna merged 4 commits intomainfrom
ianna/3310-akconcatenate-with-axis=1-fails-for-more-than-127-regular-length-arrays
Nov 22, 2024
Merged

fix: extend limit on number of regular arrays to concatenate#3312
ianna merged 4 commits intomainfrom
ianna/3310-akconcatenate-with-axis=1-fails-for-more-than-127-regular-length-arrays

Conversation

@ianna
Copy link
Copy Markdown
Member

@ianna ianna commented Nov 21, 2024

No description provided.

@ianna ianna linked an issue Nov 21, 2024 that may be closed by this pull request
Copy link
Copy Markdown
Member Author

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski - I think, adding a few specializations to the kernels fixes the issue.

@ianna ianna requested a review from jpivarski November 21, 2024 17:04
@jpivarski
Copy link
Copy Markdown
Member

It would solve the issue without changing the implementation. We're still not allowing UnionArrays to have more than 127 types, right? Is the idea that we keep the axis != 0 RegularArray concatenation algorithm the way it is, allow bigger integer types in UnionArray.simplified, but raise an error if UnionArray.simplifed returns a UnionArray? (Which it shouldn't even be able to build because UnionArray constructors should continue to require tags to be int8.)

@ianna
Copy link
Copy Markdown
Member Author

ianna commented Nov 21, 2024

It would solve the issue without changing the implementation. We're still not allowing UnionArrays to have more than 127 types, right? Is the idea that we keep the axis != 0 RegularArray concatenation algorithm the way it is, allow bigger integer types in UnionArray.simplified, but raise an error if UnionArray.simplifed returns a UnionArray? (Which it shouldn't even be able to build because UnionArray constructors should continue to require tags to be int8.)

These kernels are only used to make an index in:

index = ak.contents.UnionArray.regular_index(tags, backend=backend)

Then this index is used to create a simplified "UnionArray" and return it as a RegularArray:
index = ak.contents.UnionArray.regular_index(tags, backend=backend)
inner = ak.contents.UnionArray.simplified(
tags,
index,
[x._content for x in regulararrays],
mergebool=mergebool,
)
return (ak.contents.RegularArray(inner, prototype.size),)
elif all(

Copy link
Copy Markdown
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is fine—the helper functions were designed for UnionArrays, but here they're being used for concatenation and not UnionArrays.

Do you really need int32, uint32, and int64 versions? If it's just an intermediate step in the ak.concatenate operation, why not make that intermediate step always int64? In general, the reason we have multiple dtypes of these nodes is to accept data with those dtypes from other sources without a copy-transformation. In fact, UnionArray's uint32 was motivated by a very early version of RNTuple—before that, Awkward didn't have any unsigned index types. 32-bit indexes in general were introduced for the sake of Arrow (along with BitMaskedArray).

@ianna
Copy link
Copy Markdown
Member Author

ianna commented Nov 22, 2024

Okay, this is fine—the helper functions were designed for UnionArrays, but here they're being used for concatenation and not UnionArrays.

Do you really need int32, uint32, and int64 versions? If it's just an intermediate step in the ak.concatenate operation, why not make that intermediate step always int64? In general, the reason we have multiple dtypes of these nodes is to accept data with those dtypes from other sources without a copy-transformation. In fact, UnionArray's uint32 was motivated by a very early version of RNTuple—before that, Awkward didn't have any unsigned index types. 32-bit indexes in general were introduced for the sake of Arrow (along with BitMaskedArray).

Yes, initially I thought we do not need all of them. However, looking at the code again it seems, yes, we need all of them: these are specializations for the tags defined as either int8 or int64 type.

@ianna ianna merged commit 4eebc08 into main Nov 22, 2024
@ianna ianna deleted the ianna/3310-akconcatenate-with-axis=1-fails-for-more-than-127-regular-length-arrays branch November 22, 2024 16:45
@NJManganelli
Copy link
Copy Markdown
Contributor

NJManganelli commented Jul 5, 2025

Based on this PR, I was going to remove a TODO from coffea which worked around the concatenation limit (I thought)

https://github.com/scikit-hep/coffea/blob/6dd4cc7e27aa22a3bd15a6ed9415cd676f5895fc/src/coffea/analysis_tools.py#L95-L96

However, this fails in the general case with

>>> boolean_masks_to_categorical_integers(masks)
Traceback (most recent call last):
  File "<python-input-14>", line 1, in <module>
    boolean_masks_to_categorical_integers(masks)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/coffea/src/coffea/analysis_tools.py", line 108, in boolean_masks_to_categorical_integers
    irregular_mask_sequel =  awkward.from_regular(awkward.concatenate(mask_inputs, axis=1), axis=1)
                                                  ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 41, in dispatch
    with OperationErrorContext(name, args, kwargs):
         ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_errors.py", line 80, in __exit__
    raise self.decorate_exception(exception_type, exception_value)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 67, in dispatch
    next(gen_or_result)
    ~~~~^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 72, in concatenate
    return _impl(arrays, axis, mergebool, highlevel, behavior, attrs)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 358, in _impl
    out = ak._broadcasting.broadcast_and_apply(
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        content_or_others,
        ^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        right_broadcast=False,
        ^^^^^^^^^^^^^^^^^^^^^^
    )[0]
    ^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1219, in broadcast_and_apply
    out = apply_step(
        backend,
    ...<13 lines>...
        },
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1197, in apply_step
    return continuation()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1166, in continuation
    return broadcast_any_list()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 670, in broadcast_any_list
    outcontent = apply_step(
        backend,
    ...<5 lines>...
        options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1179, in apply_step
    result = action(
        inputs,
    ...<5 lines>...
        options=options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 262, in action
    prototype[start : start + size] = tag
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
OverflowError: Python integer 128 out of bounds for int8

This error occurred while calling

    ak.concatenate(
        [<Array [[False], [False], ..., [False]] type='20 * [1 * bool, parame...
        axis = 1
    )

I tried a higher range from the awkward test case added:

>>> for y in range(126,200):
...     print(y)
...     ak.concatenate([ak.Array([i])[:, None] for i in range(y)], axis=1)
...     
126
<Array [[0, 1, 2, 3, 4, ..., 121, 122, 123, 124, 125]] type='1 * 126 * int64'>
127
<Array [[0, 1, 2, 3, 4, ..., 122, 123, 124, 125, 126]] type='1 * 127 * int64'>
128
<Array [[0, 1, 2, 3, 4, ..., 123, 124, 125, 126, 127]] type='1 * 128 * int64'>
129
Traceback (most recent call last):
  File "<python-input-101>", line 3, in <module>
    ak.concatenate([ak.Array([i])[:, None] for i in range(y)], axis=1)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 41, in dispatch
    with OperationErrorContext(name, args, kwargs):
         ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_errors.py", line 80, in __exit__
    raise self.decorate_exception(exception_type, exception_value)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 67, in dispatch
    next(gen_or_result)
    ~~~~^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 72, in concatenate
    return _impl(arrays, axis, mergebool, highlevel, behavior, attrs)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 358, in _impl
    out = ak._broadcasting.broadcast_and_apply(
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        content_or_others,
        ^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        right_broadcast=False,
        ^^^^^^^^^^^^^^^^^^^^^^
    )[0]
    ^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1219, in broadcast_and_apply
    out = apply_step(
        backend,
    ...<13 lines>...
        },
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1197, in apply_step
    return continuation()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1166, in continuation
    return broadcast_any_list()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 670, in broadcast_any_list
    outcontent = apply_step(
        backend,
    ...<5 lines>...
        options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1179, in apply_step
    result = action(
        inputs,
    ...<5 lines>...
        options=options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 262, in action
    prototype[start : start + size] = tag
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
OverflowError: Python integer 128 out of bounds for int8

This error occurred while calling

    ak.concatenate(
        [<Array [[0]] type='1 * 1 * int64'>, <Array [[1]] type='1 * 1 * int64...
        axis = 1
    )

This might actually be a related but different issue in a way, hard for me to follow through all the code

versions:

>>> ak.__version__
'2.8.5'
>>> np.__version__
'2.2.6'
python -V
Python 3.13.5

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Jul 5, 2025

@ianna We probably just want to choose the dtype of the prototype like we do for the tags_cls right below or am I missing something?

prototype = backend.nplike.empty(sum(sizes), dtype=np.int8)
start = 0
for tag, size in enumerate(sizes):
prototype[start : start + size] = tag
start += size
if len(regulararrays) < 2**7:
tags_cls = ak.index.Index8
else:
tags_cls = ak.index.Index64

With np.int64 prototype this is fine.

In [1]: import awkward as ak

In [2]: import numpy as np

In [3]: ak.concatenate([ak.Array([i])[:, np.newaxis] for i in range(129)], axis=1)
Out[3]: <Array [[0, 1, 2, 3, 4, ..., 124, 125, 126, 127, 128]] type='1 * 129 * int64'>

In [4]: ak.concatenate([ak.Array([i])[:, np.newaxis] for i in range(210)], axis=1)
Out[4]: <Array [[0, 1, 2, 3, 4, ..., 205, 206, 207, 208, 209]] type='1 * 210 * int64'>

Without it, the test case for 129 ak.concatenate([ak.Array([i])[:, np.newaxis] for i in range(129)], axis=1) fails.

@ianna
Copy link
Copy Markdown
Member Author

ianna commented Jul 5, 2025

@NJManganelli - thanks for reporting it! There is a signed 8-bit integer that limits that number. I think, this limitation comes from the fact that our UnionArray layout has Index8 (8-bit integer) as its tags:

class UnionArray(Content):
    def __init__(self, tags, index, contents):
        assert isinstance(tags, Index8)
        assert isinstance(index, (Index32, IndexU32, Index64))
        assert isinstance(contents, list)
        assert len(index) >= len(tags)  # usually equal
        for x in tags:
            assert 0 <= x < len(contents)
        for i, x in enumerate(tags):
            assert 0 <= index[i] < len(contents[x])
        self.tags = tags
        self.index = index
        self.contents = contents

@NJManganelli
Copy link
Copy Markdown
Contributor

What is the purpose of tags for union arrays?

Perhaps there's a smarter way to accomplish what's desired in the referenced function in coffea, which is more or less trying to convert a set of regular boolean masks (shape events * bool) into a jagged array of categorical integers. Where the first mask is true, a 0 is desired in axis= 1; where mask 8 is True, there should be a 7, such that only integers with corresponding True elements in the so-enumerated mask is True and there is no number where False. This is used as a tool to reduce fill calls when there are hundreds of masks that would otherwise have to be applied to numerous arrays, flatten them, then fill a histogram with the masked arrays. This was devastatingly nonperformant in the dask-awkward case. Instead they can all be broadcast to this jagged array of categorical integers, and one fill call per set of broadcast-compatible arrays is made, which results in much smaller graphs, optimization time, etc

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Jul 5, 2025

Why wouldn't this #3568 just be the right solution?

@ianna
Copy link
Copy Markdown
Member Author

ianna commented Jul 6, 2025

Why wouldn't this #3568 just be the right solution?

Because it changes the dtype of tags.

@ianna
Copy link
Copy Markdown
Member Author

ianna commented Jul 6, 2025

What is the purpose of tags for union arrays?

The UnionArray is a union of the contents that are any layouts. The tags point to a content. There is no copy involved unless it’s simplified, I think.

Perhaps there's a smarter way to accomplish what's desired in the referenced function in coffea, which is more or less trying to convert a set of regular boolean masks (shape events * bool) into a jagged array of categorical integers. Where the first mask is true, a 0 is desired in axis= 1; where mask 8 is True, there should be a 7, such that only integers with corresponding True elements in the so-enumerated mask is True and there is no number where False. This is used as a tool to reduce fill calls when there are hundreds of masks that would otherwise have to be applied to numerous arrays, flatten them, then fill a histogram with the masked arrays. This was devastatingly nonperformant in the dask-awkward case. Instead they can all be broadcast to this jagged array of categorical integers, and one fill call per set of broadcast-compatible arrays is made, which results in much smaller graphs, optimization time, etc

Thanks for clarifying, I need to think about it.

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Jul 6, 2025

Why wouldn't this #3568 just be the right solution?

Because it changes the dtype of tags.

I did this in 1 minute so maybe I’m wrong but doesn’t tags_cls already do that since it’s an Index64 if we have more than 128 arrays?

@ianna
Copy link
Copy Markdown
Member Author

ianna commented Jul 6, 2025

Why wouldn't this #3568 just be the right solution?

Because it changes the dtype of tags.

I did this in 1 minute so maybe I’m wrong but doesn’t tags_cls already do that since it’s an Index64 if we have more than 128 arrays?

Ok, I checked all the kernels involved - they do have specialization for int64. Also, both UnionArray functions regular_index and simplified allow Index64 as tags with caveats:

# Note: to help merge more than 128 arrays, tags *can* have type ak.index.Index64.
# This is only supported when index is also Index64,
# and all indexed contents are also Index64.
# We still require that this reduces to no more than 128 variants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ak.concatenate with axis=1 fails for more than 127 regular-length arrays

4 participants