-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
First I create some fake latitude and longitude points. I stash them in a dataset, and compute a 2d histogram on those.
#!/usr/bin/env python
import xarray as xr
import numpy as np
lat = np.random.rand(50000) * 180 - 90
lon = np.random.rand(50000) * 360 - 180
d = xr.Dataset({'latitude':lat, 'longitude':lon})
latbins = np.r_[-90:90:2.]
lonbins = np.r_[-180:180:2.]
h, xx, yy = np.histogram2d(d['longitude'], d['latitude'], bins=(lonbins, latbins))When I run this I get some underwhelming performance:
> time ./test_with_xarray.py
real 0m28.152s
user 0m27.201s
sys 0m0.630s
If I change the last line to
h, xx, yy = np.histogram2d(d['longitude'].values, d['latitude'].values, bins=(lonbins, latbins))(i.e. I pass the numpy arrays directly to the histogram2d function), things are very different:
> time ./test_with_xarray.py
real 0m0.996s
user 0m0.569s
sys 0m0.253s
It's ~28 times slower to call histogram2d on the DataArrays, compared to calling it on the underlying numpy arrays. I ran into this issue while histogramming quite large lon/lat vectors from multiple netCDF files. I got tired waiting for the computation to end, added the .values to the call and went through very quickly.
It seems problematic that using xarray can slow down your code by 28 times with no real way for you to know about it...
Reactions are currently unavailable