fix: extend limit on number of regular arrays to concatenate by ianna · Pull Request #3312 · scikit-hep/awkward

ianna · 2024-11-21T17:00:26Z

No description provided.

ianna

@jpivarski - I think, adding a few specializations to the kernels fixes the issue.

jpivarski · 2024-11-21T17:07:58Z

It would solve the issue without changing the implementation. We're still not allowing UnionArrays to have more than 127 types, right? Is the idea that we keep the axis != 0 RegularArray concatenation algorithm the way it is, allow bigger integer types in UnionArray.simplified, but raise an error if UnionArray.simplifed returns a UnionArray? (Which it shouldn't even be able to build because UnionArray constructors should continue to require tags to be int8.)

ianna · 2024-11-21T17:33:27Z

It would solve the issue without changing the implementation. We're still not allowing UnionArrays to have more than 127 types, right? Is the idea that we keep the axis != 0 RegularArray concatenation algorithm the way it is, allow bigger integer types in UnionArray.simplified, but raise an error if UnionArray.simplifed returns a UnionArray? (Which it shouldn't even be able to build because UnionArray constructors should continue to require tags to be int8.)

These kernels are only used to make an index in:

awkward/src/awkward/operations/ak_concatenate.py

Line 278 in 4db7973

index = ak.contents.UnionArray.regular_index(tags, backend=backend)

Then this index is used to create a simplified "UnionArray" and return it as a RegularArray:

awkward/src/awkward/operations/ak_concatenate.py

Lines 278 to 288 in 4db7973

    
               index = ak.contents.UnionArray.regular_index(tags, backend=backend) 
        
               inner = ak.contents.UnionArray.simplified( 
        
                   tags, 
        
                   index, 
        
                   [x._content for x in regulararrays], 
        
                   mergebool=mergebool, 
        
               ) 
        
               return (ak.contents.RegularArray(inner, prototype.size),) 
        
           elif all(

jpivarski

Okay, this is fine—the helper functions were designed for UnionArrays, but here they're being used for concatenation and not UnionArrays.

Do you really need int32, uint32, and int64 versions? If it's just an intermediate step in the ak.concatenate operation, why not make that intermediate step always int64? In general, the reason we have multiple dtypes of these nodes is to accept data with those dtypes from other sources without a copy-transformation. In fact, UnionArray's uint32 was motivated by a very early version of RNTuple—before that, Awkward didn't have any unsigned index types. 32-bit indexes in general were introduced for the sake of Arrow (along with BitMaskedArray).

ianna · 2024-11-22T16:44:53Z

Okay, this is fine—the helper functions were designed for UnionArrays, but here they're being used for concatenation and not UnionArrays.

Do you really need int32, uint32, and int64 versions? If it's just an intermediate step in the ak.concatenate operation, why not make that intermediate step always int64? In general, the reason we have multiple dtypes of these nodes is to accept data with those dtypes from other sources without a copy-transformation. In fact, UnionArray's uint32 was motivated by a very early version of RNTuple—before that, Awkward didn't have any unsigned index types. 32-bit indexes in general were introduced for the sake of Arrow (along with BitMaskedArray).

Yes, initially I thought we do not need all of them. However, looking at the code again it seems, yes, we need all of them: these are specializations for the tags defined as either int8 or int64 type.

NJManganelli · 2025-07-05T21:11:56Z

Based on this PR, I was going to remove a TODO from coffea which worked around the concatenation limit (I thought)

https://github.com/scikit-hep/coffea/blob/6dd4cc7e27aa22a3bd15a6ed9415cd676f5895fc/src/coffea/analysis_tools.py#L95-L96

However, this fails in the general case with

>>> boolean_masks_to_categorical_integers(masks)
Traceback (most recent call last):
  File "<python-input-14>", line 1, in <module>
    boolean_masks_to_categorical_integers(masks)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/coffea/src/coffea/analysis_tools.py", line 108, in boolean_masks_to_categorical_integers
    irregular_mask_sequel =  awkward.from_regular(awkward.concatenate(mask_inputs, axis=1), axis=1)
                                                  ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 41, in dispatch
    with OperationErrorContext(name, args, kwargs):
         ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_errors.py", line 80, in __exit__
    raise self.decorate_exception(exception_type, exception_value)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 67, in dispatch
    next(gen_or_result)
    ~~~~^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 72, in concatenate
    return _impl(arrays, axis, mergebool, highlevel, behavior, attrs)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 358, in _impl
    out = ak._broadcasting.broadcast_and_apply(
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        content_or_others,
        ^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        right_broadcast=False,
        ^^^^^^^^^^^^^^^^^^^^^^
    )[0]
    ^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1219, in broadcast_and_apply
    out = apply_step(
        backend,
    ...<13 lines>...
        },
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1197, in apply_step
    return continuation()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1166, in continuation
    return broadcast_any_list()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 670, in broadcast_any_list
    outcontent = apply_step(
        backend,
    ...<5 lines>...
        options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1179, in apply_step
    result = action(
        inputs,
    ...<5 lines>...
        options=options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 262, in action
    prototype[start : start + size] = tag
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
OverflowError: Python integer 128 out of bounds for int8

This error occurred while calling

    ak.concatenate(
        [<Array [[False], [False], ..., [False]] type='20 * [1 * bool, parame...
        axis = 1
    )

I tried a higher range from the awkward test case added:

>>> for y in range(126,200):
...     print(y)
...     ak.concatenate([ak.Array([i])[:, None] for i in range(y)], axis=1)
...     
126
<Array [[0, 1, 2, 3, 4, ..., 121, 122, 123, 124, 125]] type='1 * 126 * int64'>
127
<Array [[0, 1, 2, 3, 4, ..., 122, 123, 124, 125, 126]] type='1 * 127 * int64'>
128
<Array [[0, 1, 2, 3, 4, ..., 123, 124, 125, 126, 127]] type='1 * 128 * int64'>
129
Traceback (most recent call last):
  File "<python-input-101>", line 3, in <module>
    ak.concatenate([ak.Array([i])[:, None] for i in range(y)], axis=1)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 41, in dispatch
    with OperationErrorContext(name, args, kwargs):
         ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_errors.py", line 80, in __exit__
    raise self.decorate_exception(exception_type, exception_value)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_dispatch.py", line 67, in dispatch
    next(gen_or_result)
    ~~~~^^^^^^^^^^^^^^^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 72, in concatenate
    return _impl(arrays, axis, mergebool, highlevel, behavior, attrs)
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 358, in _impl
    out = ak._broadcasting.broadcast_and_apply(
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        content_or_others,
        ^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        right_broadcast=False,
        ^^^^^^^^^^^^^^^^^^^^^^
    )[0]
    ^
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1219, in broadcast_and_apply
    out = apply_step(
        backend,
    ...<13 lines>...
        },
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1197, in apply_step
    return continuation()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1166, in continuation
    return broadcast_any_list()
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 670, in broadcast_any_list
    outcontent = apply_step(
        backend,
    ...<5 lines>...
        options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/_broadcasting.py", line 1179, in apply_step
    result = action(
        inputs,
    ...<5 lines>...
        options=options,
    )
  File "/Users/nmangane/scikit-hep-dev-3/.pixi/envs/default/lib/python3.13/site-packages/awkward/operations/ak_concatenate.py", line 262, in action
    prototype[start : start + size] = tag
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
OverflowError: Python integer 128 out of bounds for int8

This error occurred while calling

    ak.concatenate(
        [<Array [[0]] type='1 * 1 * int64'>, <Array [[1]] type='1 * 1 * int64...
        axis = 1
    )

This might actually be a related but different issue in a way, hard for me to follow through all the code

versions:

>>> ak.__version__
'2.8.5'
>>> np.__version__
'2.2.6'
python -V
Python 3.13.5

ikrommyd · 2025-07-05T21:22:34Z

@ianna We probably just want to choose the dtype of the prototype like we do for the tags_cls right below or am I missing something?

awkward/src/awkward/operations/ak_concatenate.py

Lines 259 to 268 in e25f614

    
           prototype = backend.nplike.empty(sum(sizes), dtype=np.int8) 
        
           start = 0 
        
           for tag, size in enumerate(sizes): 
        
               prototype[start : start + size] = tag 
        
               start += size 
        
           if len(regulararrays) < 2**7: 
        
               tags_cls = ak.index.Index8 
        
           else: 
        
               tags_cls = ak.index.Index64

With np.int64 prototype this is fine.

In [1]: import awkward as ak

In [2]: import numpy as np

In [3]: ak.concatenate([ak.Array([i])[:, np.newaxis] for i in range(129)], axis=1)
Out[3]: <Array [[0, 1, 2, 3, 4, ..., 124, 125, 126, 127, 128]] type='1 * 129 * int64'>

In [4]: ak.concatenate([ak.Array([i])[:, np.newaxis] for i in range(210)], axis=1)
Out[4]: <Array [[0, 1, 2, 3, 4, ..., 205, 206, 207, 208, 209]] type='1 * 210 * int64'>

Without it, the test case for 129 ak.concatenate([ak.Array([i])[:, np.newaxis] for i in range(129)], axis=1) fails.

ianna · 2025-07-05T21:31:48Z

@NJManganelli - thanks for reporting it! There is a signed 8-bit integer that limits that number. I think, this limitation comes from the fact that our UnionArray layout has Index8 (8-bit integer) as its tags:

class UnionArray(Content):
    def __init__(self, tags, index, contents):
        assert isinstance(tags, Index8)
        assert isinstance(index, (Index32, IndexU32, Index64))
        assert isinstance(contents, list)
        assert len(index) >= len(tags)  # usually equal
        for x in tags:
            assert 0 <= x < len(contents)
        for i, x in enumerate(tags):
            assert 0 <= index[i] < len(contents[x])
        self.tags = tags
        self.index = index
        self.contents = contents

NJManganelli · 2025-07-05T22:08:08Z

What is the purpose of tags for union arrays?

Perhaps there's a smarter way to accomplish what's desired in the referenced function in coffea, which is more or less trying to convert a set of regular boolean masks (shape events * bool) into a jagged array of categorical integers. Where the first mask is true, a 0 is desired in axis= 1; where mask 8 is True, there should be a 7, such that only integers with corresponding True elements in the so-enumerated mask is True and there is no number where False. This is used as a tool to reduce fill calls when there are hundreds of masks that would otherwise have to be applied to numerous arrays, flatten them, then fill a histogram with the masked arrays. This was devastatingly nonperformant in the dask-awkward case. Instead they can all be broadcast to this jagged array of categorical integers, and one fill call per set of broadcast-compatible arrays is made, which results in much smaller graphs, optimization time, etc

ikrommyd · 2025-07-05T22:49:52Z

Why wouldn't this #3568 just be the right solution?

ianna · 2025-07-06T05:28:41Z

Why wouldn't this #3568 just be the right solution?

Because it changes the dtype of tags.

ianna · 2025-07-06T05:37:37Z

What is the purpose of tags for union arrays?

The UnionArray is a union of the contents that are any layouts. The tags point to a content. There is no copy involved unless it’s simplified, I think.

Perhaps there's a smarter way to accomplish what's desired in the referenced function in coffea, which is more or less trying to convert a set of regular boolean masks (shape events * bool) into a jagged array of categorical integers. Where the first mask is true, a 0 is desired in axis= 1; where mask 8 is True, there should be a 7, such that only integers with corresponding True elements in the so-enumerated mask is True and there is no number where False. This is used as a tool to reduce fill calls when there are hundreds of masks that would otherwise have to be applied to numerous arrays, flatten them, then fill a histogram with the masked arrays. This was devastatingly nonperformant in the dask-awkward case. Instead they can all be broadcast to this jagged array of categorical integers, and one fill call per set of broadcast-compatible arrays is made, which results in much smaller graphs, optimization time, etc

Thanks for clarifying, I need to think about it.

ikrommyd · 2025-07-06T07:25:09Z

Why wouldn't this #3568 just be the right solution?

Because it changes the dtype of tags.

I did this in 1 minute so maybe I’m wrong but doesn’t tags_cls already do that since it’s an Index64 if we have more than 128 arrays?

ianna · 2025-07-06T08:55:07Z

Why wouldn't this #3568 just be the right solution?

Because it changes the dtype of tags.

I did this in 1 minute so maybe I’m wrong but doesn’t tags_cls already do that since it’s an Index64 if we have more than 128 arrays?

Ok, I checked all the kernels involved - they do have specialization for int64. Also, both UnionArray functions regular_index and simplified allow Index64 as tags with caveats:

# Note: to help merge more than 128 arrays, tags *can* have type ak.index.Index64.
# This is only supported when index is also Index64,
# and all indexed contents are also Index64.
# We still require that this reduces to no more than 128 variants.

fix limit on number of regular arrays

f03a113

ianna linked an issue Nov 21, 2024 that may be closed by this pull request

ak.concatenate with axis=1 fails for more than 127 regular-length arrays #3310

Closed

style: pre-commit fixes

693b94b

ianna commented Nov 21, 2024

View reviewed changes

ianna requested a review from jpivarski November 21, 2024 17:04

pre-commit-ci bot temporarily deployed to docs November 21, 2024 17:13 Inactive

sort specializations

6648437

ianna temporarily deployed to docs November 21, 2024 17:26 — with GitHub Actions Inactive

sort one more

28e222a

ianna deployed to docs November 21, 2024 17:50 — with GitHub Actions View deployment

jpivarski approved these changes Nov 21, 2024

View reviewed changes

ianna merged commit 4eebc08 into main Nov 22, 2024

ianna deleted the ianna/3310-akconcatenate-with-axis=1-fails-for-more-than-127-regular-length-arrays branch November 22, 2024 16:45

NJManganelli mentioned this pull request Jul 5, 2025

chore: Remove workaround in boolean_masks_to_categorical_integers function scikit-hep/coffea#1361

Merged

Conversation

ianna commented Nov 21, 2024

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

jpivarski commented Nov 21, 2024

Uh oh!

ianna commented Nov 21, 2024

Uh oh!

jpivarski left a comment

Choose a reason for hiding this comment

Uh oh!

ianna commented Nov 22, 2024

Uh oh!

NJManganelli commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianna commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NJManganelli commented Jul 5, 2025

Uh oh!

ikrommyd commented Jul 5, 2025

Uh oh!

ianna commented Jul 6, 2025

Uh oh!

ianna commented Jul 6, 2025

Uh oh!

ikrommyd commented Jul 6, 2025

Uh oh!

ianna commented Jul 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NJManganelli commented Jul 5, 2025 •

edited

Loading

ikrommyd commented Jul 5, 2025 •

edited

Loading

ianna commented Jul 5, 2025 •

edited

Loading