Added STRtree::nearest function#111
Added STRtree::nearest function#111brendan-ward wants to merge 10 commits intopygeos:masterfrom brendan-ward:tree_nearest
Conversation
|
A couple of questions:
|
|
@adriangb good catch - I had falsely concluded the ordering of results based on conflating the order of insertion into the tree with the ordering of the extents in the evaluation of nearest neighbors. I have a hard time following the implementation in GEOS returns only one even in the case of multiple matches. It is a little hard to follow the implementation, but it looks like it recursively tests the geometry extent against the extent of each node in the tree starting from the root. Based on which ones have the minimum distance between extents, it then picks either the largest or the one with leaf nodes and continues down the tree. At a leaf node, it then iterates over the child items and calculates their minimum distance; if the distance is lower than the minimum it has seen so far, it pushes them onto the priority queue. Here's some examples (note: these are in the middle of a refactor of this PR to operate on multiple geoms and returns source index, tree index): tree geoms created from bottom left to upper right: tree geoms created from upper right to bottom left: However, this doesn't seem particularly consistent in more complex examples. If we reverse the x direction so that these start at 4,4 and go to 0,0 in columns: These are also sensitive to the leaf size of the tree: per your 1) We don't have any ability here to specify a minimum (or even maximum) distance, so we can't ensure that these will produce same indices as an intersection if distance is 0. I think what you are looking for a is a more robust kNN search, but that isn't available in GEOS yet. per your 2) Once individual items are tested against the input geometry, they are evaluated according to the distance function we provide, which calculates the distance between the geometries not their envelopes. |
|
Hmm. 1) is a real deal breaker. To clarify, what I mean is that if you have intersecting geometries, the results from >>> tree = STRtree([pygeos.bbox(-1, -1, 0, 0), pygeos. bbox(-2, -2, 0, 0)]) # overlapping bboxes
>>> p = pygeos.point(0, 0)
>>> tree.nearest(p).tolist() == tree.query(p, op="intersects").tolist()
# should be TrueI think this behavior makes a lot more sense than returning an almost arbitrary index. If we can't fix that I wouldn't even bother implementing it, the behavior would be too confusing to use reliably. It would be really confusing if the user doesn't carefully examine the implementation or it can at least very succinctly explained in the docs. |
…ssues with query functions
|
@adriangb more testing or investigation into GEOS behavior certainly would be appreciated, I could be interpreting the source code or results above incorrectly. If this is a bug in GEOS behavior, we should raise the issue over there. Is the problem more around intersections (distance == 0) or is it also present for multiple tree geometries the same distance away, e.g., 3 geoms that are exactly 5 meters away? For the intersection case, we could theoretically try to handle this internally. We could set a flag if our distance callback function detects that it calculated a distance of 0. Then in this case, we could query the tree for any intersecting geometries and return all of them. We could theoretically do something similar for a fixed distance away, but it would be more involved. We would need to set the flag if distance function encounters the same minimum distance multiple times, then expand the envelope of the input geometry by that amount, and query the tree for those geometries, then calculate the distance to each again ourselves. Not feeling good about this option. |
|
For what it's worth, |
|
This applies to any equidistant geometries. I am only using intersections as a reference because I think it is very logical that the results for that specific case should be consistent In terms of how it is implemented elsewhere, * I would like to clarify that I do think this should be fixed in GEOS and not monkey patched in this repo, but what about a naive solution (could be very inefficient): the nearest function always queries the tree for intersecting geometries for any geometries it finds and returns them all. In python it would look something like: def nearest(self, geom):
geometry = np.asarray(geometry)
if geometry.ndim == 0:
geometry = np.expand_dims(geometry, 0)
matches = self._tree.nearest(geometry)
# do some numpy stuff
all_matches = self._tree.bulk_query(matches[:, 1], op='intersects')
# some more numpy stuff to expand input indexes to new matches
# and re-compute distances to filter out false positives
# (just because there is an intersection doesn't mean the distance to the input
# geometry is the same)
return resultI'm not sure if |
|
@brendan-ward I implemented a workaround in The algo looks something like this: I also modified the tests. Instead of manually checking each expected value, I wrote a brute force search that I am using as a source of truth. This approach has pros and cons: |
|
@adriangb thanks for experimenting with that! Note: some of the tests for I've been looking into creating a test case in GEOS to demonstrate the issue. However, given their promise of API stability, I don't know that the existing nearest neighbor functions there can be adapted to address these issues. Even if nearest neighbors can be returned more deterministically, the API is limited to returning a single result. Likewise in JTS (which drives the design for GEOS), the API for nearestNeighbor returns a single result. It seems like we are touching on some higher-order issues here, so it would be good to get some broader input. @caspervdw @jorisvandenbossche As described above, the existing GEOS nearest neighbor function returns a single result, nondeterministically, in the case of multiple intersecting features with the input geometry or multiple equidistant features. A k nearest neighbor function has not yet been implemented in GEOS (but has been added to JTS and could be ported over), but theoretically could help mitigate this issue (provided that k > equidistant neighbors). Questions:
|
|
@brendan-ward what are your thoughts on continuing this? Now that geopandas has pygeos, this would be really useful. As per the comment above, I have a working version using the workaround of checking for intersections. Do you think that is a workable solution? If so, should I open a seperate PR or PR my branch into your existing branch that is tied with this PR? |
|
@adriangb I wanted to pause until we got broader feedback. There is a very reasonable tension around how far to go beyond what GEOS natively provides (tradeoffs re: complexity, maintainability, etc), so I want to make sure we're collectively in sync. As such, I've held off on offering comments on your implementation. I think there are some things we could optimize there, such as having the distance callback cache indexes and distances. But for now, my recommendation would be to hold off until we have a bit more consensus. |
|
Now that we are using pygeos's STRtree pretty extensively in geopandas, I think we should revive this discussion. Implementing it here would make it trivial to add an |
|
I think the core question is how much we want to extend beyond the core functionality available in GEOS, which is pretty limited in this case. Specifically:
Another option would be to move forward with this more or less as-is, but noting equidistant results and sorting as a known issue to be solved later. In the case of 1 nearest neighbor out of multiple equidistant neighbors, which one to return seems to be undefined behavior. Is there any other library that handles that in a predictable way? |
|
Agreed. I think in an ideal world, pygeos would not extend core functionality at all. But in practice, the implementation in C wouldn't be too bad, and it would be a lot faster than implementing in geopandas/Python. To address your points in particular:
I personally think that returning a single nearest result in a non-deterministic fashion is could lead to a lot of confusion. It is however better than nothing. If that is the route chosen, I would suggest renaming to |
|
Pinging @jorisvandenbossche and @caspervdw for some input on this. |
|
This is the most interesting discussion I've seen in the last months. I've tried implementing a nearest neighbor join in python using Toblerity/rtree but failed because their implementation uses bounding boxes and only returns as nearest neighbors geometries with MBRs that intersect, even if there is a closer geometry with a MBR that does not intersect. There are two workarounds that consist in increasing K, or number of neighbors returned or expanding the bounding box. The first does guarantee exact result or full recall, the second does not scale. Shapely's STRtree has other properties that can be used to achieve this goal but unfortunately it's very slow. Let me just ask a dumb question: is this an implementation problem or it's an algorithm problem? Are there efficient algorithms to achieve exact nearest neighbor search between geometries (Points, Lines and Polygons) without any constraint (e.g. radius) in current literature? What do the PostGIS and Esri people do to solve this problem? What about geohashes and fancy indexes like S2 and H3, do they solve this problem? Should we use the vertices of geometries with a kd-tree to get an approximation, and create bounding boxes based on closest vertex distance to query the rtree for intersections, testing the distances and to get an exact solution? Can we modify a balltree with a custom metric to calculate distances between line segments like this? |
Just curious, have you tried the algo in #111 (comment) with rtree? Basically you first query with |
|
Superseded by #272 |
This is one of a few variants of nearest neighbor functionality based on the STRtree.
This version wraps the GEOS nearest neighbor functionality and returns the index of the nearest item in the tree.
Right now this is limited to a singular geometry operation. I'm considering changing that to a bulk geometry operation (given array of source geometries, return nearest neighbor index for each), which would likely be more useful.
Other variants that return multiple neighbors for each source geometry could be handled in a later PR.