Add cython algo to get offset arrays for flat coordinates representation#93
Add cython algo to get offset arrays for flat coordinates representation#93jorisvandenbossche wants to merge 5 commits intopygeos:masterfrom
Conversation
|
cc @jonmmease |
|
I have been thinking about this serialization format. I think it would make sense if we would come up with something that covers all different geometries and not just polygons. One way might be using WKB without the actual coordinates. We could encode everything in one list of 4-byte integers:
see https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry Would this approach work for your application to connect a GeoDataFrame to datashader? |
|
To be clear, this PR only implements it for MultiPolygons, but it's not specific to polygons. All other types can be stored as (possibly) nested flat coordinate arrays as well. You only need to know which type of geometry it represents. But I assume it is this aspect (storing this information directly in the data) you are thinking about. How would you store this type information with the coordinates?
Not directly, as we are using the Arrow memory layout (for list type: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout). One big advantage of using that layout (and that's the main reason I originally proposed that) is that it is also supported out of the box by parquet file format.
Regarding efficient serialization formats, there is also https://github.com/bjornharrtell/flatgeobuf/blob/master/src/fbs/feature.fbs if we want to have something that includes geometry type information (but without inventing our own format) |
|
With the latest pygeos with the addition of def get_flat_coords_offset_arrays(arr):
# explode/flatten the MultiPolygons
arr_flat, part_indices = pygeos.get_parts(arr, return_index=True)
# the offsets into the multipolygon parts
offsets1 = np.insert(np.bincount(part_indices).cumsum(), 0, 0)
# explode/flatten the Polygons into Rings
arr_flat2, ring_indices = pygeos.geometry.get_rings(arr_flat, return_index=True)
# the offsets into the exterior/interior rings of the multipolygon parts
offsets2 = np.insert(np.bincount(ring_indices).cumsum(), 0, 0)
# the coords and offsets into the coordinates of the rings
coords, indices = pygeos.get_coordinates(arr_flat2, return_index=True)
offsets3 = np.insert(np.bincount(indices).cumsum(), 0, 0)
return coords, offsets1, offsets2, offsets3(it's the same as what I showed in the issue (#75 (comment)), but then with faster However, there is still a quite big difference in performance. First using a small example to test: import geopandas
import pygeos
from pygeos.flatcoords import get_offset_arrays
df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
arr = df.geometry.array.data
coords = pygeos.get_coordinates(arr)
offsets1, offsets2, offsets3 = get_offset_arrays(arr)
coordsb, offsets1b, offsets2b, offsets3b = get_flat_coords_offset_arrays(arr)
# check both are equivalent (this passes)
np.testing.assert_allclose(offsets1, offsets1b)
np.testing.assert_allclose(offsets2, offsets2b)
np.testing.assert_allclose(offsets3, offsets3b)
np.testing.assert_allclose(coords, coordsb)Comparing both: So the single-pass cython algo it is considerably faster (well, still a different pass for the coordinates right now). And note that it's mainly getting the coordinates that takes time, as the offsets itself is quite fast: This is of course only a tiny toy dataset. Relative speed-up is a bit smaller in this case, but the cython algo still gives a decent improvement. As comparison, both are faster than conversion to WKB: |
An experiment with cython to tackle #75: creating a representation with a flat array of coordinates and offset arrays. Focusing for now on the (Multi)Polygon use case.
And trying it in cython, for this use case I think it simplifies it quite a bit compared to C (for array creation)
(note: the cython code certainly still needs to be cleaned up, if we want to use it, just dumped everything in a single file now)
See https://nbviewer.jupyter.org/gist/jorisvandenbossche/398ac85012e1c0214633e6d3cdd3ae70 for example of using this to convert a geopandas GeoDataFrame to this representation, to then visualize it with datashader (which needs this representation via spatialpandas).