Skip to content

Added STRtree::nearest function#111

Closed
brendan-ward wants to merge 10 commits intopygeos:masterfrom
brendan-ward:tree_nearest
Closed

Added STRtree::nearest function#111
brendan-ward wants to merge 10 commits intopygeos:masterfrom
brendan-ward:tree_nearest

Conversation

@brendan-ward
Copy link
Copy Markdown
Contributor

This is one of a few variants of nearest neighbor functionality based on the STRtree.

This version wraps the GEOS nearest neighbor functionality and returns the index of the nearest item in the tree.

Right now this is limited to a singular geometry operation. I'm considering changing that to a bulk geometry operation (given array of source geometries, return nearest neighbor index for each), which would likely be more useful.

Other variants that return multiple neighbors for each source geometry could be handled in a later PR.

@adriangb
Copy link
Copy Markdown

adriangb commented Mar 15, 2020

A couple of questions:

  1. Are you sure that it returns the lowest index for equidistant geometries? This would be contrary to the behavior of libspatialindex used in Toblerity/rtree. This behavior is actually super important for spatial joins to be consistent. A distance-based search with 0 distance should be the same as an intersection search. Plus, it is always easy to drop the other indexes but impossible to recover them.
  2. Is the distance being calculated on the bounding boxes or the geometries themselves? From peaking at the source, it looks like the latter, which would be super useful!

@brendan-ward
Copy link
Copy Markdown
Contributor Author

@adriangb good catch - I had falsely concluded the ordering of results based on conflating the order of insertion into the tree with the ordering of the extents in the evaluation of nearest neighbors.

I have a hard time following the implementation in libspatialindex but it looks like the case there is that it will return multiple matches if there are multiple with the same distance.

GEOS returns only one even in the case of multiple matches.

It is a little hard to follow the implementation, but it looks like it recursively tests the geometry extent against the extent of each node in the tree starting from the root. Based on which ones have the minimum distance between extents, it then picks either the largest or the one with leaf nodes and continues down the tree. At a leaf node, it then iterates over the child items and calculates their minimum distance; if the distance is lower than the minimum it has seen so far, it pushes them onto the priority queue.

Here's some examples (note: these are in the middle of a refactor of this PR to operate on multiple geoms and returns source index, tree index):

tree geoms created from bottom left to upper right:

>>> tree = STRtree(pygeos.points(np.arange(10), np.arange(10)))
>>> tree.nearest(pygeos.box(1, 1, 3, 3)).tolist()
[[0], [1]]   # point at bottom left of the box: <pygeos.Geometry POINT (1 1)>

tree geoms created from upper right to bottom left:

>>> tree = STRtree(pygeos.points(np.arange(10, -1, -1), np.arange(10, -1, -1)))
>>> tree.nearest(pygeos.box(1, 1, 3, 3)).tolist()
[[0], [9]]  # point at bottom left of the box: <pygeos.Geometry POINT (1 1)> 

However, this doesn't seem particularly consistent in more complex examples.
Example grid of points from 0,0 to 4,4 in columns:

>>> grid = np.mgrid[0:5, 0:5]
>>> geoms = pygeos.points(grid[0].flatten(), grid[1].flatten())
>>> tree = STRtree(geoms)
>>> tree.nearest(pygeos.box(1, 1, 3, 3)).tolist()
[[0], [7]]  # <pygeos.Geometry POINT (1 2)>

If we reverse the x direction so that these start at 4,4 and go to 0,0 in columns:

>>> grid = np.mgrid[0:5, 0:5]
>>> geoms = pygeos.points(grid[0].flatten()[::-1], grid[1].flatten())
>>> tree = STRtree(geoms)
>>> tree.nearest(pygeos.box(1, 1, 3, 3)).tolist()
[[0], [7]]  # <pygeos.Geometry POINT (3 2)>

These are also sensitive to the leaf size of the tree:

>>> grid = np.mgrid[0:5, 0:5]
>>> geoms = pygeos.points(grid[0].flatten(), grid[1].flatten())
>>> tree = STRtree(geoms, leafsize=2)
>>> tree.nearest(pygeos.box(1, 1, 3, 3)).tolist()
[[0], [6]]  # <pygeos.Geometry POINT (1 1)>

per your 1) We don't have any ability here to specify a minimum (or even maximum) distance, so we can't ensure that these will produce same indices as an intersection if distance is 0. I think what you are looking for a is a more robust kNN search, but that isn't available in GEOS yet.

per your 2) Once individual items are tested against the input geometry, they are evaluated according to the distance function we provide, which calculates the distance between the geometries not their envelopes.

@adriangb
Copy link
Copy Markdown

adriangb commented Mar 15, 2020

Hmm. 1) is a real deal breaker. To clarify, what I mean is that if you have intersecting geometries, the results from nearest and query(geom, op="intersects") should be identical:

>>> tree = STRtree([pygeos.bbox(-1, -1, 0, 0), pygeos. bbox(-2, -2, 0, 0)])  # overlapping bboxes
>>> p = pygeos.point(0, 0)
>>> tree.nearest(p).tolist() == tree.query(p, op="intersects").tolist()
# should be True

I think this behavior makes a lot more sense than returning an almost arbitrary index. If we can't fix that I wouldn't even bother implementing it, the behavior would be too confusing to use reliably. It would be really confusing if the user doesn't carefully examine the implementation or it can at least very succinctly explained in the docs.

@brendan-ward
Copy link
Copy Markdown
Contributor Author

@adriangb more testing or investigation into GEOS behavior certainly would be appreciated, I could be interpreting the source code or results above incorrectly.

If this is a bug in GEOS behavior, we should raise the issue over there.

Is the problem more around intersections (distance == 0) or is it also present for multiple tree geometries the same distance away, e.g., 3 geoms that are exactly 5 meters away?

For the intersection case, we could theoretically try to handle this internally. We could set a flag if our distance callback function detects that it calculated a distance of 0. Then in this case, we could query the tree for any intersecting geometries and return all of them.

We could theoretically do something similar for a fixed distance away, but it would be more involved. We would need to set the flag if distance function encounters the same minimum distance multiple times, then expand the envelope of the input geometry by that amount, and query the tree for those geometries, then calculate the distance to each again ourselves. Not feeling good about this option.

@brendan-ward
Copy link
Copy Markdown
Contributor Author

For what it's worth, shapely is also using the GEOS STRtree nearest neighbor function.

@adriangb
Copy link
Copy Markdown

adriangb commented Mar 15, 2020

This applies to any equidistant geometries.

I am only using intersections as a reference because I think it is very logical that the results for that specific case should be consistent query with op="intersects". Then by the transitive property, the same should apply to any number equidistant geometries at any distance.

In terms of how it is implemented elsewhere, libspatialindex and by extension Toblerity/rtree (which is what GeoPandas uses at the moment) implement it as I described above. So for 3 tree geometries exactly 5 meters away, all 3 would be returned in no particular order.*

* I would like to clarify that Toblerity/rtree operates only on bounding boxes, but that is beside the point.

I do think this should be fixed in GEOS and not monkey patched in this repo, but what about a naive solution (could be very inefficient): the nearest function always queries the tree for intersecting geometries for any geometries it finds and returns them all. In python it would look something like:

def nearest(self, geom):
    geometry = np.asarray(geometry)
    if geometry.ndim == 0:
        geometry = np.expand_dims(geometry, 0)

    matches = self._tree.nearest(geometry)
    # do some numpy stuff
    all_matches = self._tree.bulk_query(matches[:, 1], op='intersects')
    # some more numpy stuff to expand input indexes to new matches
    # and re-compute distances to filter out false positives
    # (just because there is an intersection doesn't mean the distance to the input
    # geometry is the same)
    return result

I'm not sure if bulk_query operates on bounds or geometries (I think it is the former) but regardless, this would result in some loss of accuracy. Maybe we can have STRTree.nearest check for the distance to the bounds instead of the actual geometries (might save a bit of computation) and then check all distances in result above, and discard those that aren't equal to the minimum distance? This could also open up the possibility of returning the actual distances along with the indexes (since they were computed anyway).
This all seems somewhat inefficient, it involves nested tree queries, but at least the results would be consistent and it shouldn't be too hard to implement.

@adriangb
Copy link
Copy Markdown

adriangb commented Mar 16, 2020

@brendan-ward I implemented a workaround in strtree.c on my branch (diff with master, difff with this branch)

The algo looks something like this:

1. Find a single nearest match using GEOSSTRtree_nearest_generic_r
2. Get min_distance
3. Buffer input geometry by min_distance
4. Query the tree for all tree geometries intersecting buffered input geometry
5. Check the distance for each one of these new matches and copy equidistant indexes to results.

I also modified the tests. Instead of manually checking each expected value, I wrote a brute force search that I am using as a source of truth.

This approach has pros and cons:
Cons: this is quite a bit of added complexity and overhead on top of GEOS.
Pros: it is still probably orders of magnitude faster than doing it in Python or not using a spatial index. From the geopandas perspective, it gives a ~10x improvement over rtree with a similar algorithm (rtree operates on bounding boxes, so all of the distances need to be calculated in Python).

@brendan-ward
Copy link
Copy Markdown
Contributor Author

@adriangb thanks for experimenting with that!

Note: some of the tests for nearest will fail without being updated; some were failing anyway because multiple equidistant hits were returning non-deterministically between different CI runs (proving your point above).

I've been looking into creating a test case in GEOS to demonstrate the issue. However, given their promise of API stability, I don't know that the existing nearest neighbor functions there can be adapted to address these issues. Even if nearest neighbors can be returned more deterministically, the API is limited to returning a single result. Likewise in JTS (which drives the design for GEOS), the API for nearestNeighbor returns a single result.

It seems like we are touching on some higher-order issues here, so it would be good to get some broader input.

@caspervdw @jorisvandenbossche As described above, the existing GEOS nearest neighbor function returns a single result, nondeterministically, in the case of multiple intersecting features with the input geometry or multiple equidistant features. A k nearest neighbor function has not yet been implemented in GEOS (but has been added to JTS and could be ported over), but theoretically could help mitigate this issue (provided that k > equidistant neighbors).

Questions:

  1. Should we add this as-is, and do our best to state clearly what users should expect in case of equidistant results?
  2. Should we try to overcome this issue on the pygeos side by returning all equidistant results instead of just one? (see above for some of the considerations here)
  3. The original approach here focused on returning indexes of a similar nature as STRtree::query_bulk; should we extend that further to return an array of distances for each item? Assume that we either keep the original distances as calculated by the tree callback function or that we have to calculate them for 2), rather than calculating them purely for returning distances.
  4. Should we hold off on adding a nearest neighbor function against the tree until a kNN implementation is available from GEOS?

@adriangb
Copy link
Copy Markdown

@brendan-ward what are your thoughts on continuing this? Now that geopandas has pygeos, this would be really useful. As per the comment above, I have a working version using the workaround of checking for intersections. Do you think that is a workable solution? If so, should I open a seperate PR or PR my branch into your existing branch that is tied with this PR?

@brendan-ward
Copy link
Copy Markdown
Contributor Author

@adriangb I wanted to pause until we got broader feedback. There is a very reasonable tension around how far to go beyond what GEOS natively provides (tradeoffs re: complexity, maintainability, etc), so I want to make sure we're collectively in sync.

As such, I've held off on offering comments on your implementation. I think there are some things we could optimize there, such as having the distance callback cache indexes and distances. But for now, my recommendation would be to hold off until we have a bit more consensus.

@adriangb
Copy link
Copy Markdown

Now that we are using pygeos's STRtree pretty extensively in geopandas, I think we should revive this discussion. Implementing it here would make it trivial to add an sjoin_nearest feature to geopandas.

@brendan-ward
Copy link
Copy Markdown
Contributor Author

I think the core question is how much we want to extend beyond the core functionality available in GEOS, which is pretty limited in this case.

Specifically:

  • returning > 1 result when results are equidistant
  • sorting of equidistant results regardless of number of results returned (apparently non-deterministic across GEOS versions and number of leaves setting in STRtree)

Another option would be to move forward with this more or less as-is, but noting equidistant results and sorting as a known issue to be solved later. In the case of 1 nearest neighbor out of multiple equidistant neighbors, which one to return seems to be undefined behavior. Is there any other library that handles that in a predictable way?

@adriangb
Copy link
Copy Markdown

adriangb commented May 13, 2020

Agreed. I think in an ideal world, pygeos would not extend core functionality at all. But in practice, the implementation in C wouldn't be too bad, and it would be a lot faster than implementing in geopandas/Python.

To address your points in particular:

  1. >1 equidistant results: this is pretty intensive processing that can easily be implemented here but would be hard to do in geopandas/python. I would do this here.
  2. sorting: this can easily be implemented in geopandas/python and it would be fast with numpy, so I wouldn't worry about it here.

I personally think that returning a single nearest result in a non-deterministic fashion is could lead to a lot of confusion. It is however better than nothing. If that is the route chosen, I would suggest renaming to sample_nearest or something that makes it clear what the expectations are.

@adriangb
Copy link
Copy Markdown

Pinging @jorisvandenbossche and @caspervdw for some input on this.

@carlosg-m
Copy link
Copy Markdown

carlosg-m commented Oct 7, 2020

This is the most interesting discussion I've seen in the last months.

I've tried implementing a nearest neighbor join in python using Toblerity/rtree but failed because their implementation uses bounding boxes and only returns as nearest neighbors geometries with MBRs that intersect, even if there is a closer geometry with a MBR that does not intersect.

There are two workarounds that consist in increasing K, or number of neighbors returned or expanding the bounding box.

The first does guarantee exact result or full recall, the second does not scale.

Shapely's STRtree has other properties that can be used to achieve this goal but unfortunately it's very slow.

Let me just ask a dumb question: is this an implementation problem or it's an algorithm problem?

Are there efficient algorithms to achieve exact nearest neighbor search between geometries (Points, Lines and Polygons) without any constraint (e.g. radius) in current literature?

What do the PostGIS and Esri people do to solve this problem?

What about geohashes and fancy indexes like S2 and H3, do they solve this problem?

Should we use the vertices of geometries with a kd-tree to get an approximation, and create bounding boxes based on closest vertex distance to query the rtree for intersections, testing the distances and to get an exact solution?

Can we modify a balltree with a custom metric to calculate distances between line segments like this?

from sklearn.neighbors import BallTree
from shapely.geometry import Point, LineString

keys = np.array([[ 1.,  2.,  3.,  4.],
                 [ 3.,  4.,  5.,  5.],
                 [ 5.,  5.,  9., 11.]])

def line_distance(a1, b1): 
    a1 = LineString([Point(a1[0],a1[1]),Point(a1[2],a1[3])])
    b1 = LineString([Point(b1[0],b1[1]),Point(b1[2],b1[3])])
    return a1.distance(b1)

tree = BallTree(keys, leaf_size=5, metric=line_distance)

@adriangb
Copy link
Copy Markdown

adriangb commented Oct 7, 2020

the second does not scale

Just curious, have you tried the algo in #111 (comment) with rtree?

Basically you first query with k=1, record that distance, then use that as your bounding box. You could then do a nearest radius search or you could do what I proposed there which is to buffer the input geometry by that radius and do an interesects search. The latter I'm pretty sure works and I think you could implement today in rtree within ~10 LOC.

@brendan-ward
Copy link
Copy Markdown
Contributor Author

Superseded by #272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants