-
Notifications
You must be signed in to change notification settings - Fork 120
Open
Labels
performanceWorks, but not fast enough or uses too much memoryWorks, but not fast enough or uses too much memory
Description
Version of Awkward Array
2.7.4
Description and code to reproduce
numpy 1.26.4
pyarrow 19.0.0
The origin of the data I will use here is not really important, but for reference, it is:
1.9GB of points in feather2 format.
table = pyarrow.feather.read_table("microsoft-buildings-point.arrow")
130M points. The "geometry" column has x, y fields, both float64.
Issue 1
(the lesser issue)
Depending on how I convert the data, I get different layouts:
>>> ak.from_arrow(table)["geometry", "x"].layout
<IndexedOptionArray len='129735970'>
<index><Index dtype='int64' len='129735970'>
[ 0 1 2 ... 129735967 129735968 129735969]
</Index></index>
<content><NumpyArray dtype='float64' len='129735970'>
[ -84.95972352 -84.95973298 -84.9599375 ... -111.04598275
-111.047405 -111.0478207 ]
</NumpyArray></content>
</IndexedOptionArray>
>>> ak.from_arrow(table["geometry"])["x"].layout
<UnmaskedArray len='129735970'>
<content><NumpyArray dtype='float64' len='129735970'>
[ -84.95972352 -84.95973298 -84.9599375 ... -111.04598275
-111.047405 -111.0478207 ]
</NumpyArray></content>
</UnmaskedArray>Here, the second variant is what you should get - we know there are no NULLs. If you don't select "x", you see UnmaskedArray s even for the first route.
Issue 2
Doing some timings:
>>> x = ak.from_arrow(table["geometry"])["x"] # the unmasked variant
>>> np.max(x)
656ms
>>> ak.max(x)
666ms, OK, so dispatch does what we expect
>>> %timeit np.max(x.layout.content.data)
18ms, well that is just a bit faster
>>> %timeit np.nanmax(x.layout.content.data)
20ms, in case of nan (since we shold have no NULLs)
>>> np.nanmax(np.where(True, x.layout.content.data, np.nan))
176ms, maybe this is what awkward actually does?And with a handwritten simple numba kernel:
@numba.njit(nogil=True, cache=True)
def mymax(x):
max = -np.inf
for v in x:
if np.isfinite(v) and v > max:
max = v
return vwe get
>>> mymax(x)
40.3ms
>>> mymax(x.layout.content.data)
20.2ms
So, my question is: how can we avoid the >600ms for this operation while maintaining the awkward API? Am I seeing some kind of weird caching from the many original chunks of the arrow data?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
performanceWorks, but not fast enough or uses too much memoryWorks, but not fast enough or uses too much memory