Skip to content

numpy function very slow on DataArray compared to DataArray.values #1247

@vnoel

Description

@vnoel

First I create some fake latitude and longitude points. I stash them in a dataset, and compute a 2d histogram on those.

#!/usr/bin/env python

import xarray as xr
import numpy as np

lat = np.random.rand(50000) * 180 - 90
lon = np.random.rand(50000) * 360 - 180
d = xr.Dataset({'latitude':lat, 'longitude':lon})

latbins = np.r_[-90:90:2.]
lonbins = np.r_[-180:180:2.]
h, xx, yy = np.histogram2d(d['longitude'], d['latitude'], bins=(lonbins, latbins))

When I run this I get some underwhelming performance:

> time ./test_with_xarray.py

real	0m28.152s
user	0m27.201s
sys	0m0.630s

If I change the last line to

h, xx, yy = np.histogram2d(d['longitude'].values, d['latitude'].values, bins=(lonbins, latbins))

(i.e. I pass the numpy arrays directly to the histogram2d function), things are very different:

> time ./test_with_xarray.py

real	0m0.996s
user	0m0.569s
sys	0m0.253s

It's ~28 times slower to call histogram2d on the DataArrays, compared to calling it on the underlying numpy arrays. I ran into this issue while histogramming quite large lon/lat vectors from multiple netCDF files. I got tired waiting for the computation to end, added the .values to the call and went through very quickly.

It seems problematic that using xarray can slow down your code by 28 times with no real way for you to know about it...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions