API: update STRtree interface (merge shapely/pygeos features)#1251
API: update STRtree interface (merge shapely/pygeos features)#1251jorisvandenbossche merged 14 commits intoshapely:mainfrom
Conversation
|
About the (old pygeos) Then, if you provide a scalar, you get the same behaviour as the shapely |
|
Good point, as long as So we currently have: >>> points = shapely.points(range(5), range(5))
>>> tree = shapely.STRtree(points)
# current shapely method only works for a single geometry, returning a single index
>>> tree.nearest_item(shapely.points(1.5, 1.5))
1
# this is the pygeos.STRtree.nearest (which I called nearest_bulk here temporarily)
>>> tree.nearest_bulk(shapely.points(1.5, 1.5))
array([[0], # <-- this could also return the scalar "1" instead for scalar input
[1]])
>>> tree.nearest_bulk(shapely.points([(1.5, 1.5), (3.5, 3.5)]))
array([[0, 1], # <-- for array input, this can return the array [1, 3] as output
[1, 3]])
So currently for pygeos' nearest, those are skipped: >>> tree.nearest_bulk([shapely.points(1.5, 1.5), None, shapely.points(3.5, 3.5)])
array([[0, 2],
[1, 3]])which is a reason that we needed to return a 2D array with the indices of the query geoms as well. |
That makes sense. I think I’d opt for raising an exception for None input when returning indices. |
|
I did an attempt to do what we discussed above: let the old pygeos' For missing values as input, for now I am raising an error. But there is another case where we need to decide what to do: querying an empty tree. In pygeos, that returned an empty array of indices, but that's also no longer possible (the output needs to have the same shape as the input). Shapely currently returns None for that case. So one option is to keep that behaviour, another option is to raise an error? |
Pull Request Test Coverage Report for Build 1631872869Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
brendan-ward
left a comment
There was a problem hiding this comment.
Thanks for working on this @jorisvandenbossche !
I agree with raising an exception for None inputs.
I think raising an exception for any query against an empty tree also makes sense. If the tree is empty, there can be no useful results, and raising an error may better help the user identify that something went wrong when creating the tree if they expected it to be non-empty. In fact, if we caught this during the creation of the tree (since it is immutable), this may simplify later checks that the tree is non-empty since we'd never be able to construct an empty tree.
API-wise, it feels a bit awkward that query_items can return items from items created with the tree, whereas query_bulk only ever returns indices. I'm not sure of the use case, but it seems like query_bulk should also return items if those are provided, so that it is the vectorized equivalent of query_items. Either items should be managed by the tree (as here for query_items) or totally external to the tree and managed by the user, but not a mix of both... (so probably any of the singular or bulk functions that return indices should return items instead if present).
| indices = self._tree.nearest(np.atleast_1d(geometry_arr))[1] | ||
|
|
||
| if geometry_arr.ndim == 0: | ||
| return indices[0] |
There was a problem hiding this comment.
Will this fail if indices is []?
There was a problem hiding this comment.
A geometry always has some nearest geometry, so this should never be an empty array, except if the tree is empty. But so that gets us back to the non-inline discussion (and currently nearest_item will check for this and return None in that case, not getting to this line)
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
|
Thanks @brendan-ward!
I think I would also prefer raising an error, at least when doing the nearest query, instead of returning None.
Yeah, that seems indeed best to do to ensure consistency .. (although I also don't see any use case for this) |
Actually, that's not possible with the current API. |
Right, sorry, bad idea on my part. Thank you for exploring the idea though! Also scratch the idea about preventing construction of empty trees, for the reasons you provided. |
|
I had one proposal for “nearest” in the case of missing inputs or trees. Instead of raising directly, you could also return an index that is out of bounds (latest + 1) This happens also for instance when using “np.searchsorted”. It may be useful also for implementing the version of “nearest” that returns geometries. |
sgillies
left a comment
There was a problem hiding this comment.
I've got some comments inline, can't approve this yet.
shapely/strtree.py
Outdated
| geoms: Iterable[BaseGeometry], | ||
| items: Iterable[Any] = None, | ||
| node_capacity: int = 10, | ||
| leafsize: int = 10, |
There was a problem hiding this comment.
This breaks the 1.8 API unnecessarily.
shapely/strtree.py
Outdated
|
|
||
| The 'dwithin' predicate requires GEOS >= 3.10. | ||
|
|
||
| If geometry is None, an empty array is returned. |
There was a problem hiding this comment.
👍 returning empty result sequences is good in the way that Python slicing returns empty sequences. It designs errors out of existence.
There was a problem hiding this comment.
Yes, I think the same is true for an empty geometry (can add that)
shapely/strtree.py
Outdated
| Return the index of all geometries in the tree with extents that | ||
| intersect the envelope of the input geometry. |
There was a problem hiding this comment.
Strictly speaking, the original sentence was more correct. Shapely's r-tree is more general that pygeos was. We can store values that are not indexes of anything.
| Return the index of all geometries in the tree with extents that | |
| intersect the envelope of the input geometry. | |
| Return stored items from tree nodes that intersect the envelope of the input geometry, optionally filtering them by a spatial predicate involving the input geometry. |
There was a problem hiding this comment.
I will try to update the docstrings a bit based on your feedback, but I would prefer to keep some simplified version. For example, I am not sure that the "from tree nodes" adds value for this first sentence (that seems an implementation detail?), and rather makes it harder to understand. What matters to the user is that it returns the items that correspond to the geometries that intersect with the envelope, I think?
There was a problem hiding this comment.
Not sure if this helps, but tries to help better tease apart the distinction between index vs item / arbitrary value here:
Return items from the tree that intersect with the envelope of the input geometry. Items may be identified by an integer index (default) or arbitrary item values (optional) if those are provided when constructing the tree. Results can be optionally filtered by a spatial predicate...
shapely/strtree.py
Outdated
| If predicate is provided, a prepared version of the input geometry | ||
| is tested using the predicate function against each item whose | ||
| extent intersects the envelope of the input geometry: | ||
| predicate(geometry, tree_geometry). |
There was a problem hiding this comment.
I think that to help users who don't know the implementation we could be more clear about which thing is on the left hand side of the predicate and which is on the right side. The input geometry is on the left and the item to be tested is on the right, yes?
There was a problem hiding this comment.
The line you are commenting on (predicate(geometry, tree_geometry) basically shows that "geometry" is left and "tree geometry" is right? (but maybe can make it predicate(input_geometry, tree_geometry (i.e. added input_) to make it clearer)
Or that's not super obvious this way?
There was a problem hiding this comment.
Adding input_ prefix would help here, though it immediately follows (is implicit) from the prior statement.
| An array of items stored in the tree. | ||
|
|
||
| ndarray | ||
| An array of indexes (or stored items) of geometries in the tree |
There was a problem hiding this comment.
The previous version was more correct. The case of storing indexes is a specialization that will be obvious to users.
There was a problem hiding this comment.
Since indexes are the default and stored arbitrary values are optional, the phrasing in this PR seems more accurate.
| def query_bulk(self, geometry, predicate=None, distance=None): | ||
| """Returns all combinations of each input geometry and geometries in the tree | ||
| where the envelope of each input geometry intersects with the envelope of a | ||
| tree geometry. |
There was a problem hiding this comment.
I think query_combinations may be more descriptive than query_bulk?
There was a problem hiding this comment.
I don't have a strong opinion on the exact name (cc @brendan-ward who I think designed this initially)
The reason for the "bulk" is because it is basically a version of query that does the query for an array of geometries at once ("in bulk").
(in the end, the query method also returns the combinations but only with a single input geometry)
There was a problem hiding this comment.
I'm not at all attached to _bulk and never particularly liked the term; we came to that point to try to differentiate it from the _all functions in pygeos which behaved differently (though nearest_all is yet again different).
_combinations doesn't seem to add clarity, since both methods involve combinations (1 vs N tree geoms for query, M vs N geoms for query_bulk).
I think what we're getting at here is that this is the vectorized form of query (input is a vector instead of a scalar), but also that it isn't 1:1 vectorized to the inputs (where 1:1 vectorized returns same shape as input). So, query_vec or somesuch may work? Something to think about for some of the other functions that clearly need to differentiate themselves as only taking vector inputs (at least for the primary input param).
There was a problem hiding this comment.
Given that the differentiation is also that the one takes a single geometry as input, and the "bulk" version an array of geometries, another option for a name could be query_array.
Although, that also doesn't fully mach with the query_geoms and query_items terminology where the word after the _ is what is being returned, not what is passed as input.
There was a problem hiding this comment.
Sorry, I don't have strong opinions on the name for this. Names are hard!
| """Query the tree for the nearest geometry. | ||
|
|
||
| Returns the index (or stored item) of the nearest geometry in the tree | ||
| for each input geometry. |
There was a problem hiding this comment.
Once again, index is the special case. The previous docstring was better.
|
@jorisvandenbossche I left a bunch of comments inline. I feel like the first sentences in some docstrings regressed a bit and the predicate needs to be explained a little better. But generally, this is looking good. |
It's actually the items that is the special case, as by default (when only passing geometries to the STRtree contructor) it are indexes that get returned. |
@caspervdw I think for the "nearest" purpose, that might be annoying to work with afterwards. You could typically use the return value as an index to get the corresponding geometry (or attributes corresponding to that geometry), and then an out of bounds index will error (so you would also have to explicitly check for that). A difference with |
|
(gentle ping to get some responses to some of the comments/answers) |
|
Probably unrelated to your discussion but of interest to me. In pygeos tree = pygeos.STRtree(points_pygeos)
a, b = tree.query_bulk(poly_pygeos, predicate="contains") |
|
@mathause The query against the underlying tree always uses the envelope of the geometries. Only the optional predicate test involves the actual geometries. The new API here for the bulk case is the same as for PyGEOS. |
True, but getting an error or a |
|
@jorisvandenbossche I disagree with what you say in
I think the awkward situation we're in is only an accident. Let's not get hung up on it now, though. I'm happy with the PR and will suggest changes later. |
|
I updated this PR (addressed part of the feedback + fixed the conflicts). The docstring examples are not yet fully updated, but if we are not yet fully sure about the final API, I would prefer waiting with fixing those (all other docstrings from functions merged from pygeos still need to be updated as well anyway).
@caspervdw yes, I see your point, that could indeed be useful. My main worry is that this is a rather advanced feature and it would easily give silent errors for users not aware of it returning this index + 1 number (while an error is then the "safer", less surprising default). Although it will maybe most of the time give some loud error? (if you use it for indexing, both with numpy, python list, or pandas Series, you will get errors)
@sgillies That sounds good for this PR. Looking forward to hear your further thoughts on the API / wording. |
|
I am planning to merge this shortly (but anyway we can continue the discussion in follow-up issues where needed) |
|
I opened two follow-up issues for the items that were still being discussed above: |
This PR tries to consolidate the
STRtreeimplementations of Shapely and PyGEOS.For now, I did the following:
querytoquery_itemsto follow the API discussion / refactoring from #1064, #1112. Compared to the current shapely version, thisquery_itemsnow has the additionalpredicateanddistancekeywords.Rename pygeos'nearestmethod tonearest_bulk. I am not very of this new name, but so we have to discuss / decide how to otherwise resolve the inconsistency with shapely's currentnearestwhich only accepts a single geometry as input and returns a single index (instead of 2D array).Rename pygeos'
nearesttonearest_itemand changed it to return a 1D instead of 2D array (see discussion below) to have it be compatible with the shapely version. Compared to the current shapely version ofnearest_item, the new version is vectorized and can accept an array of geometries.query_bulkandnearest_allas is, as those are strictly new features not clashing with existing shapely methods.See also some of the discussion in pygeos/pygeos#398
Note for reviewing: I actually removed shapely's
strtree.pyand updated the pygeos version (strtree_pygeos.py) with new things from shapely in a first commit, and then renamed that tostrtree.pyin a second commit. So if you want to see the changes coming from pygeos, check the diff of only the first commit, while in the overall diff git shows the changes compared to shapely's existing strtree.py.I still need to merge the docstrings (and especially the example sections)