Skip to content

Question on performance #3403

@martindurant

Description

@martindurant

Version of Awkward Array

2.7.4

Description and code to reproduce

numpy 1.26.4
pyarrow 19.0.0

The origin of the data I will use here is not really important, but for reference, it is:
1.9GB of points in feather2 format.

table = pyarrow.feather.read_table("microsoft-buildings-point.arrow")

130M points. The "geometry" column has x, y fields, both float64.

Issue 1

(the lesser issue)

Depending on how I convert the data, I get different layouts:

>>> ak.from_arrow(table)["geometry", "x"].layout
<IndexedOptionArray len='129735970'>
    <index><Index dtype='int64' len='129735970'>
        [        0         1         2 ... 129735967 129735968 129735969]
    </Index></index>
    <content><NumpyArray dtype='float64' len='129735970'>
        [ -84.95972352  -84.95973298  -84.9599375  ... -111.04598275
         -111.047405   -111.0478207 ]
    </NumpyArray></content>
</IndexedOptionArray>

>>> ak.from_arrow(table["geometry"])["x"].layout
<UnmaskedArray len='129735970'>
    <content><NumpyArray dtype='float64' len='129735970'>
        [ -84.95972352  -84.95973298  -84.9599375  ... -111.04598275
         -111.047405   -111.0478207 ]
    </NumpyArray></content>
</UnmaskedArray>

Here, the second variant is what you should get - we know there are no NULLs. If you don't select "x", you see UnmaskedArray s even for the first route.

Issue 2

Doing some timings:

>>> x = ak.from_arrow(table["geometry"])["x"] # the unmasked variant
>>> np.max(x)
656ms
>>> ak.max(x)
666ms, OK, so dispatch does what we expect
>>> %timeit np.max(x.layout.content.data)
18ms, well that is just a bit faster
>>> %timeit np.nanmax(x.layout.content.data)
20ms, in case of nan (since we shold have no NULLs)
>>>  np.nanmax(np.where(True, x.layout.content.data, np.nan))
176ms, maybe this is what awkward actually does?

And with a handwritten simple numba kernel:

@numba.njit(nogil=True, cache=True)
def mymax(x):
    max = -np.inf
    for v in x:
        if np.isfinite(v) and v > max:
            max = v
    return v

we get

>>> mymax(x)
40.3ms
>>> mymax(x.layout.content.data)
20.2ms

So, my question is: how can we avoid the >600ms for this operation while maintaining the awkward API? Am I seeing some kind of weird caching from the many original chunks of the arrow data?

Metadata

Metadata

Assignees

Labels

performanceWorks, but not fast enough or uses too much memory

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions