Question on performance

### Version of Awkward Array

2.7.4

### Description and code to reproduce

numpy 1.26.4
pyarrow 19.0.0

The origin of the data I will use here is not really important, but for reference, it is:
[1.9GB of points](
https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/microsoft-buildings-point.arrow) in feather2 format.
```
table = pyarrow.feather.read_table("microsoft-buildings-point.arrow")
```
130M points. The "geometry" column has x, y fields, both float64.

Issue 1
=====
(the lesser issue)

Depending on how I convert the data, I get different layouts:
```python
>>> ak.from_arrow(table)["geometry", "x"].layout
<IndexedOptionArray len='129735970'>
    <index><Index dtype='int64' len='129735970'>
        [        0         1         2 ... 129735967 129735968 129735969]
    </Index></index>
    <content><NumpyArray dtype='float64' len='129735970'>
        [ -84.95972352  -84.95973298  -84.9599375  ... -111.04598275
         -111.047405   -111.0478207 ]
    </NumpyArray></content>
</IndexedOptionArray>

>>> ak.from_arrow(table["geometry"])["x"].layout
<UnmaskedArray len='129735970'>
    <content><NumpyArray dtype='float64' len='129735970'>
        [ -84.95972352  -84.95973298  -84.9599375  ... -111.04598275
         -111.047405   -111.0478207 ]
    </NumpyArray></content>
</UnmaskedArray>
```
Here, the second variant is what you should get - we know there are no NULLs. If you don't select "x", you see UnmaskedArray s even for the first route. 

Issue 2
======
Doing some timings:
```python
>>> x = ak.from_arrow(table["geometry"])["x"] # the unmasked variant
>>> np.max(x)
656ms
>>> ak.max(x)
666ms, OK, so dispatch does what we expect
>>> %timeit np.max(x.layout.content.data)
18ms, well that is just a bit faster
>>> %timeit np.nanmax(x.layout.content.data)
20ms, in case of nan (since we shold have no NULLs)
>>>  np.nanmax(np.where(True, x.layout.content.data, np.nan))
176ms, maybe this is what awkward actually does?
```

And with a handwritten simple numba kernel:
```python
@numba.njit(nogil=True, cache=True)
def mymax(x):
    max = -np.inf
    for v in x:
        if np.isfinite(v) and v > max:
            max = v
    return v
```

we get
```
>>> mymax(x)
40.3ms
>>> mymax(x.layout.content.data)
20.2ms
```

So, my question is: how can we avoid the >600ms for this operation while maintaining the awkward API? Am I seeing some kind of weird caching from the many original chunks of the arrow data?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on performance #3403

Version of Awkward Array

Description and code to reproduce

Issue 1

Issue 2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on performance #3403

Description

Version of Awkward Array

Description and code to reproduce

Issue 1

Issue 2

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions