Context: the "spatialpandas" project is trying to define a representation of geometries based on "ragged arrays" (nested arrays where subarrays can have varying length). This can give a representation of geometries that is independent of GEOS, and can have a faster access to the coordinates (as the coordinates of all geometries are stored as a flat array). The driving use case for this is in the datashader project (https://github.com/pyviz/datashader/), that is adding support for plotting polygons and other geometry types. Datashader is especially targetting visualizing big datasets, and then fast access to the coordinates is important.
@jonmmease is implementing such a "ragged array" representation in holoviz/spatialpandas#2 (pandas ExtensionArrays (to store it in a dataframe) based on pyarrow's ListArrays). I am partly involved in that effort in giving feedback from a geopandas viewpoint (also for visualizing large geodataframes, datashader can be a very interesting approach).
PyGEOS is of course tied to GEOS, and other representations as described above are clearly out of scope.
But, I think it would be useful to have a fast conversion from a GEOS-based representation to such a "flat array + offsets"-based representation. And to have such conversion fast, it is probably easiest to have this live in PyGEOS (or at least provide all necessary building blocks to construct it).
We already have a pygeos.get_coordinates, which gives you access to the flat coordinates array, but in addition to that we would also need functionality to build up the offsets arrays.
Below follows an example of how to create such a "flat array + offsets" representation with the current pygeos API for the countries dataset of geopandas (mixture of polygon and multipolygons, but I first convert them to all multipolygons):
import geopandas
import shapely.geometry
import pygeos
def ensure_multipolygon(geom):
if isinstance(geom, shapely.geometry.MultiPolygon):
return geom
elif isinstance(geom, shapely.geometry.Polygon):
return shapely.geometry.MultiPolygon([geom])
elif geom is None:
return geom
else:
return ValueError
df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# create pygeos array with all MultiPolygons from geopandas/shapely
arr = pygeos.from_wkb([ensure_multipolygon(geom).wkb for geom in df.geometry])
# the flat array of coordinates
coords = pygeos.get_coordinates(arr)
coords_flat = coords.ravel()
# the offsets into the multipolygon parts
offsets1 = np.insert(pygeos.get_num_geometries(arr).cumsum(), 0, 0)
# the offsets into the exterio/interior rings of the multipolygon parts
def flatten_multipolygons(arr):
res = []
for geom in arr:
for i in range(pygeos.get_num_geometries(geom)):
poly = pygeos.get_geometry(geom, i)
res.append(poly)
return np.array(res, dtype=object)
arr_flat = flatten_multipolygons(arr)
offsets2 = np.insert((1 + pygeos.get_num_interior_rings(arr_flat)).cumsum(), 0, 0)
# the offsets into the coordinates of the rings
def flatten_polygons(arr):
res = []
for poly in arr:
res.append(pygeos.get_exterior_ring(poly))
for i in range(pygeos.get_num_interior_rings(poly)):
res.append(pygeos.get_interior_ring(poly, i))
return np.array(res, dtype=object)
arr_flat2 = flatten_polygons(arr_flat)
offsets3 = np.insert(np.cumsum(pygeos.get_num_coordinates(arr_flat2)*2), 0, 0)
# flat array + offsets
(coords_flat, offsets3, offsets2, offsets1)
Some general notes / remarks:
- Specifics of this representation / approach:
- This only works with a uniform geometry type (as from the offsets you can't see which geometry type is being represented)
- The number of offsets arrays (the level of nesting) depends on the geometry type. The MultiPolygons gives the most nested (3 levels: list of MultiPolygon parts -> each part has list of rings -> each ring has list of coordinates). But eg LineStrings would give 1 level of nesting.
- It is already somewhat possible with the current APIs to get the number of geometries, of inner rings, of coordinates, .. (see the above code). But, currently we need to "explode" the array of MultiPolygons to use those APIs (which is probably a useful functionality anyway).
Right now this creates python overhead (as this flattening is done in python loops), but even if there were pygeos functions for it, would give the overhead of creating a full arrays of those multipolygon parts / linearrings, while it is not needed to created full arrays for them.
- Having a dedicated function for it instead of providing the building blocks could also be done in a single pass over the geometries (or one pass to get the size of the arrays + one pass to fill in the arrays), instead of multiple passes for each of the flats coords arrays + offsets.
Context: the "spatialpandas" project is trying to define a representation of geometries based on "ragged arrays" (nested arrays where subarrays can have varying length). This can give a representation of geometries that is independent of GEOS, and can have a faster access to the coordinates (as the coordinates of all geometries are stored as a flat array). The driving use case for this is in the datashader project (https://github.com/pyviz/datashader/), that is adding support for plotting polygons and other geometry types. Datashader is especially targetting visualizing big datasets, and then fast access to the coordinates is important.
@jonmmease is implementing such a "ragged array" representation in holoviz/spatialpandas#2 (pandas ExtensionArrays (to store it in a dataframe) based on pyarrow's ListArrays). I am partly involved in that effort in giving feedback from a geopandas viewpoint (also for visualizing large geodataframes, datashader can be a very interesting approach).
PyGEOS is of course tied to GEOS, and other representations as described above are clearly out of scope.
But, I think it would be useful to have a fast conversion from a GEOS-based representation to such a "flat array + offsets"-based representation. And to have such conversion fast, it is probably easiest to have this live in PyGEOS (or at least provide all necessary building blocks to construct it).
We already have a
pygeos.get_coordinates, which gives you access to the flat coordinates array, but in addition to that we would also need functionality to build up the offsets arrays.Below follows an example of how to create such a "flat array + offsets" representation with the current pygeos API for the countries dataset of geopandas (mixture of polygon and multipolygons, but I first convert them to all multipolygons):
Some general notes / remarks:
Right now this creates python overhead (as this flattening is done in python loops), but even if there were pygeos functions for it, would give the overhead of creating a full arrays of those multipolygon parts / linearrings, while it is not needed to created full arrays for them.