Add cython algo to get offset arrays for flat coordinates representation by jorisvandenbossche · Pull Request #93 · pygeos/pygeos

jorisvandenbossche · 2020-01-31T16:53:03Z

An experiment with cython to tackle #75: creating a representation with a flat array of coordinates and offset arrays. Focusing for now on the (Multi)Polygon use case.
And trying it in cython, for this use case I think it simplifies it quite a bit compared to C (for array creation)

(note: the cython code certainly still needs to be cleaned up, if we want to use it, just dumped everything in a single file now)

See https://nbviewer.jupyter.org/gist/jorisvandenbossche/398ac85012e1c0214633e6d3cdd3ae70 for example of using this to convert a geopandas GeoDataFrame to this representation, to then visualize it with datashader (which needs this representation via spatialpandas).

jorisvandenbossche · 2020-01-31T16:53:27Z

cc @jonmmease

caspervdw · 2020-03-02T10:23:35Z

I have been thinking about this serialization format. I think it would make sense if we would come up with something that covers all different geometries and not just polygons.

One way might be using WKB without the actual coordinates. We could encode everything in one list of 4-byte integers:

for point: type
for linestring: type, number of coordinates
for polygon: type, coordinates in exterior ring, number of interior rings, [number of coordinates in each interior ring]
for collections: type, number of geometries, [wkb-without-coords of each geometry]

see https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry

Would this approach work for your application to connect a GeoDataFrame to datashader?

jorisvandenbossche · 2020-03-02T11:12:03Z

To be clear, this PR only implements it for MultiPolygons, but it's not specific to polygons. All other types can be stored as (possibly) nested flat coordinate arrays as well. You only need to know which type of geometry it represents. But I assume it is this aspect (storing this information directly in the data) you are thinking about.

How would you store this type information with the coordinates?
Or do you mean storing something like this per geometry in a separate array, instead of the separate array(s) of indices (indices into the flat coordinates)?

Would this approach work for your application to connect a GeoDataFrame to datashader?

Not directly, as we are using the Arrow memory layout (for list type: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
At least, that's the memory layout that datashader can now understand and render (not saying this is the only possible layout, but changes to it will require (possibly non-trivial) changes in datashader / spatialpandas).

One big advantage of using that layout (and that's the main reason I originally proposed that) is that it is also supported out of the box by parquet file format.
Of course, something generic variable sized binary is also natively supported by Apache Arrow / Parquet, but the disadvantage is that this will always require custom code to interpret, while the list type can natively be inspected as numpy arrays (with only using pyarrow).

One way might be using WKB without the actual coordinates. We could encode everything in one list of 4-byte integers:

Regarding efficient serialization formats, there is also https://github.com/bjornharrtell/flatgeobuf/blob/master/src/fbs/feature.fbs if we want to have something that includes geometry type information (but without inventing our own format)

jorisvandenbossche · 2021-04-16T11:14:06Z

With the latest pygeos with the addition of get_parts, and potentially get_rings (#342), the functionality of this PR can also be mimicked with those functions:

def get_flat_coords_offset_arrays(arr):
    # explode/flatten the MultiPolygons
    arr_flat, part_indices = pygeos.get_parts(arr, return_index=True)
    # the offsets into the multipolygon parts
    offsets1 = np.insert(np.bincount(part_indices).cumsum(), 0, 0)

    # explode/flatten the Polygons into Rings
    arr_flat2, ring_indices = pygeos.geometry.get_rings(arr_flat, return_index=True)
    # the offsets into the exterior/interior rings of the multipolygon parts 
    offsets2 = np.insert(np.bincount(ring_indices).cumsum(), 0, 0)

    # the coords and offsets into the coordinates of the rings
    coords, indices = pygeos.get_coordinates(arr_flat2, return_index=True)
    offsets3 = np.insert(np.bincount(indices).cumsum(), 0, 0)
    
    return coords, offsets1, offsets2, offsets3

(it's the same as what I showed in the issue (#75 (comment)), but then with faster get_parts/get_rings instead of custom python loops for those)

However, there is still a quite big difference in performance.

First using a small example to test:

import geopandas
import pygeos
from pygeos.flatcoords import get_offset_arrays

df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
arr = df.geometry.array.data

coords = pygeos.get_coordinates(arr)
offsets1, offsets2, offsets3 = get_offset_arrays(arr)

coordsb, offsets1b, offsets2b, offsets3b = get_flat_coords_offset_arrays(arr)

# check both are equivalent (this passes)
np.testing.assert_allclose(offsets1, offsets1b)
np.testing.assert_allclose(offsets2, offsets2b)
np.testing.assert_allclose(offsets3, offsets3b)
np.testing.assert_allclose(coords, coordsb)

Comparing both:

In [4]: %timeit get_flat_coords_offset_arrays(arr)
719 µs ± 24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %%timeit
   ...: coords = pygeos.get_coordinates(arr)
   ...: offsets1, offsets2, offsets3 = get_offset_arrays(arr)
220 µs ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So the single-pass cython algo it is considerably faster (well, still a different pass for the coordinates right now). And note that it's mainly getting the coordinates that takes time, as the offsets itself is quite fast:

In [6]: %timeit offsets1, offsets2, offsets3 = get_offset_arrays(arr)
27.6 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is of course only a tiny toy dataset.
Using a different / larger dataset:

In [3]: df_tl = geopandas.read_file("/home/joris/Downloads/tl_2019_us_zcta510.zip", rows=10000,
   ...:                             ignore_fields=['ZCTA5CE10', 'GEOID10', 'CLASSFP10', 'MTFCC10', 'FUNCSTAT10', 'ALAND10', 'AWATER10', 'INTPTLAT10', 'INTPTLON10'])

In [4]: arr = df_tl.geometry.array.data

In [5]: %timeit get_flat_coords_offset_arrays(arr)
574 ms ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %%timeit
   ...: coords = pygeos.get_coordinates(arr)
   ...: offsets1, offsets2, offsets3 = get_offset_arrays(arr)
223 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Relative speed-up is a bit smaller in this case, but the cython algo still gives a decent improvement.

As comparison, both are faster than conversion to WKB:

In [7]: %timeit pygeos.to_wkb(arr)
876 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Add cython algo to get offset arrays for flat coordinates representation

b1cb2cc

jorisvandenbossche mentioned this pull request Mar 25, 2020

Mix with use of cython? #51

Closed

jorisvandenbossche mentioned this pull request Apr 5, 2020

PROPOSAL: Parquet metadata specification for geospatial vector data geopandas/geopandas#1191

Closed

Merge remote-tracking branch 'upstream/master' into flatcoords

f3e3341

This was referenced Apr 12, 2020

Add metadata schema v0.1.0 geoarrow/geoarrow#2

Merged

Possible encodings (memory layout) for array of geometries geoarrow/geoarrow#3

Closed

jorisvandenbossche mentioned this pull request Apr 28, 2020

Create geometries with different number of points in a vectorized way #138

Closed

jorisvandenbossche mentioned this pull request Aug 10, 2020

A "nested list array" representation for geometries geoarrow/geoarrow#4

Closed

Merge remote-tracking branch 'upstream/master' into flatcoords

b559b6f

brendan-ward mentioned this pull request Sep 24, 2020

ENH: set up Cython support + get parts of multipart geometries #197

Merged

5 tasks

caspervdw mentioned this pull request Nov 10, 2020

RFC: Bulk geometry creation #241

Closed

jorisvandenbossche added 2 commits November 27, 2020 11:13

Merge remote-tracking branch 'upstream/master' into flatcoords

8e171b4

update with new cython machinery

a761543

jorisvandenbossche mentioned this pull request Mar 18, 2021

Enable "return_index" in get_coordinates #318

Merged

jorisvandenbossche mentioned this pull request Apr 14, 2021

Release 0.10 (or 1.0?) #333

Closed

jorisvandenbossche mentioned this pull request Apr 25, 2021

Explode polygons into rings: get_rings #342

Merged

jorisvandenbossche mentioned this pull request Oct 15, 2021

Proposal for a "geo-arrow" format opengeospatial/geoparquet#1

Closed

jorisvandenbossche mentioned this pull request Nov 29, 2021

Add “offsets” as an alternative to “indices” #438

Open

jorisvandenbossche mentioned this pull request Aug 22, 2022

Expose method to get / construct from coordinates and offset arrays shapely/shapely#1498

Closed

jorisvandenbossche mentioned this pull request Oct 3, 2022

ENH: convert to / construct from ragged array (flat coordinates and offset arrays) shapely/shapely#1559

Merged

caspervdw closed this Dec 11, 2022

jorisvandenbossche deleted the flatcoords branch December 11, 2022 08:34

kylebarron mentioned this pull request Sep 26, 2023

Wire up interface for to/from geopandas geoarrow/geoarrow-python#7

Merged

jorisvandenbossche mentioned this pull request Mar 19, 2024

Optimize GeoArrow/Arrow IO shapely/shapely#1953

Closed

jorisvandenbossche mentioned this pull request Nov 7, 2025

perf: Create scalar and ufunc versions of geometry -> double functions shapely/shapely#2339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cython algo to get offset arrays for flat coordinates representation#93

Add cython algo to get offset arrays for flat coordinates representation#93
jorisvandenbossche wants to merge 5 commits intopygeos:masterfrom
jorisvandenbossche:flatcoords

jorisvandenbossche commented Jan 31, 2020 •

edited

Loading

Uh oh!

jorisvandenbossche commented Jan 31, 2020

Uh oh!

caspervdw commented Mar 2, 2020

Uh oh!

jorisvandenbossche commented Mar 2, 2020

Uh oh!

jorisvandenbossche commented Apr 16, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jorisvandenbossche commented Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jan 31, 2020

Uh oh!

caspervdw commented Mar 2, 2020

Uh oh!

jorisvandenbossche commented Mar 2, 2020

Uh oh!

jorisvandenbossche commented Apr 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jorisvandenbossche commented Jan 31, 2020 •

edited

Loading

jorisvandenbossche commented Apr 16, 2021 •

edited

Loading