Skip to content

Add cython algo to get offset arrays for flat coordinates representation#93

Closed
jorisvandenbossche wants to merge 5 commits intopygeos:masterfrom
jorisvandenbossche:flatcoords
Closed

Add cython algo to get offset arrays for flat coordinates representation#93
jorisvandenbossche wants to merge 5 commits intopygeos:masterfrom
jorisvandenbossche:flatcoords

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche commented Jan 31, 2020

An experiment with cython to tackle #75: creating a representation with a flat array of coordinates and offset arrays. Focusing for now on the (Multi)Polygon use case.
And trying it in cython, for this use case I think it simplifies it quite a bit compared to C (for array creation)

(note: the cython code certainly still needs to be cleaned up, if we want to use it, just dumped everything in a single file now)

See https://nbviewer.jupyter.org/gist/jorisvandenbossche/398ac85012e1c0214633e6d3cdd3ae70 for example of using this to convert a geopandas GeoDataFrame to this representation, to then visualize it with datashader (which needs this representation via spatialpandas).

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

cc @jonmmease

@caspervdw
Copy link
Copy Markdown
Member

I have been thinking about this serialization format. I think it would make sense if we would come up with something that covers all different geometries and not just polygons.

One way might be using WKB without the actual coordinates. We could encode everything in one list of 4-byte integers:

  • for point: type
  • for linestring: type, number of coordinates
  • for polygon: type, coordinates in exterior ring, number of interior rings, [number of coordinates in each interior ring]
  • for collections: type, number of geometries, [wkb-without-coords of each geometry]

see https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry

Would this approach work for your application to connect a GeoDataFrame to datashader?

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

To be clear, this PR only implements it for MultiPolygons, but it's not specific to polygons. All other types can be stored as (possibly) nested flat coordinate arrays as well. You only need to know which type of geometry it represents. But I assume it is this aspect (storing this information directly in the data) you are thinking about.

How would you store this type information with the coordinates?
Or do you mean storing something like this per geometry in a separate array, instead of the separate array(s) of indices (indices into the flat coordinates)?

Would this approach work for your application to connect a GeoDataFrame to datashader?

Not directly, as we are using the Arrow memory layout (for list type: https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
At least, that's the memory layout that datashader can now understand and render (not saying this is the only possible layout, but changes to it will require (possibly non-trivial) changes in datashader / spatialpandas).

One big advantage of using that layout (and that's the main reason I originally proposed that) is that it is also supported out of the box by parquet file format.
Of course, something generic variable sized binary is also natively supported by Apache Arrow / Parquet, but the disadvantage is that this will always require custom code to interpret, while the list type can natively be inspected as numpy arrays (with only using pyarrow).

One way might be using WKB without the actual coordinates. We could encode everything in one list of 4-byte integers:

Regarding efficient serialization formats, there is also https://github.com/bjornharrtell/flatgeobuf/blob/master/src/fbs/feature.fbs if we want to have something that includes geometry type information (but without inventing our own format)

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

jorisvandenbossche commented Apr 16, 2021

With the latest pygeos with the addition of get_parts, and potentially get_rings (#342), the functionality of this PR can also be mimicked with those functions:

def get_flat_coords_offset_arrays(arr):
    # explode/flatten the MultiPolygons
    arr_flat, part_indices = pygeos.get_parts(arr, return_index=True)
    # the offsets into the multipolygon parts
    offsets1 = np.insert(np.bincount(part_indices).cumsum(), 0, 0)

    # explode/flatten the Polygons into Rings
    arr_flat2, ring_indices = pygeos.geometry.get_rings(arr_flat, return_index=True)
    # the offsets into the exterior/interior rings of the multipolygon parts 
    offsets2 = np.insert(np.bincount(ring_indices).cumsum(), 0, 0)

    # the coords and offsets into the coordinates of the rings
    coords, indices = pygeos.get_coordinates(arr_flat2, return_index=True)
    offsets3 = np.insert(np.bincount(indices).cumsum(), 0, 0)
    
    return coords, offsets1, offsets2, offsets3

(it's the same as what I showed in the issue (#75 (comment)), but then with faster get_parts/get_rings instead of custom python loops for those)

However, there is still a quite big difference in performance.

First using a small example to test:

import geopandas
import pygeos
from pygeos.flatcoords import get_offset_arrays

df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
arr = df.geometry.array.data

coords = pygeos.get_coordinates(arr)
offsets1, offsets2, offsets3 = get_offset_arrays(arr)

coordsb, offsets1b, offsets2b, offsets3b = get_flat_coords_offset_arrays(arr)

# check both are equivalent (this passes)
np.testing.assert_allclose(offsets1, offsets1b)
np.testing.assert_allclose(offsets2, offsets2b)
np.testing.assert_allclose(offsets3, offsets3b)
np.testing.assert_allclose(coords, coordsb)

Comparing both:

In [4]: %timeit get_flat_coords_offset_arrays(arr)
719 µs ± 24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %%timeit
   ...: coords = pygeos.get_coordinates(arr)
   ...: offsets1, offsets2, offsets3 = get_offset_arrays(arr)
220 µs ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So the single-pass cython algo it is considerably faster (well, still a different pass for the coordinates right now). And note that it's mainly getting the coordinates that takes time, as the offsets itself is quite fast:

In [6]: %timeit offsets1, offsets2, offsets3 = get_offset_arrays(arr)
27.6 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is of course only a tiny toy dataset.
Using a different / larger dataset:

In [3]: df_tl = geopandas.read_file("/home/joris/Downloads/tl_2019_us_zcta510.zip", rows=10000,
   ...:                             ignore_fields=['ZCTA5CE10', 'GEOID10', 'CLASSFP10', 'MTFCC10', 'FUNCSTAT10', 'ALAND10', 'AWATER10', 'INTPTLAT10', 'INTPTLON10'])

In [4]: arr = df_tl.geometry.array.data

In [5]: %timeit get_flat_coords_offset_arrays(arr)
574 ms ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %%timeit
   ...: coords = pygeos.get_coordinates(arr)
   ...: offsets1, offsets2, offsets3 = get_offset_arrays(arr)
223 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Relative speed-up is a bit smaller in this case, but the cython algo still gives a decent improvement.

As comparison, both are faster than conversion to WKB:

In [7]: %timeit pygeos.to_wkb(arr)
876 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants