Skip to content

API: update STRtree interface (merge shapely/pygeos features)#1251

Merged
jorisvandenbossche merged 14 commits intoshapely:mainfrom
jorisvandenbossche:update-strtree
Apr 26, 2022
Merged

API: update STRtree interface (merge shapely/pygeos features)#1251
jorisvandenbossche merged 14 commits intoshapely:mainfrom
jorisvandenbossche:update-strtree

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche commented Dec 5, 2021

This PR tries to consolidate the STRtree implementations of Shapely and PyGEOS.

For now, I did the following:

  • Rename pygeos' query to query_items to follow the API discussion / refactoring from #1064, #1112. Compared to the current shapely version, this query_items now has the additional predicate and distance keywords.
  • Rename pygeos' nearest method to nearest_bulk. I am not very of this new name, but so we have to discuss / decide how to otherwise resolve the inconsistency with shapely's current nearest which only accepts a single geometry as input and returns a single index (instead of 2D array).
    Rename pygeos' nearest to nearest_item and changed it to return a 1D instead of 2D array (see discussion below) to have it be compatible with the shapely version. Compared to the current shapely version of nearest_item, the new version is vectorized and can accept an array of geometries.
  • Added pygeos' query_bulk and nearest_all as is, as those are strictly new features not clashing with existing shapely methods.

See also some of the discussion in pygeos/pygeos#398


Note for reviewing: I actually removed shapely's strtree.py and updated the pygeos version (strtree_pygeos.py) with new things from shapely in a first commit, and then renamed that to strtree.py in a second commit. So if you want to see the changes coming from pygeos, check the diff of only the first commit, while in the overall diff git shows the changes compared to shapely's existing strtree.py.

I still need to merge the docstrings (and especially the example sections)

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+3.02%) to 84.151% when pulling 1586d9f on jorisvandenbossche:update-strtree into a0f8a31 on Toblerity:main.

@caspervdw
Copy link
Copy Markdown
Collaborator

About the (old pygeos) nearest method: I think that its output should be changed from an (n, 2) array to an array/scalar with the same shape as the input. The same as a ufunc actually.

Then, if you provide a scalar, you get the same behaviour as the shapely nearest.
One question is: what do we return for “None” input? -1?

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

Good point, as long as nearest only returns a single value per input geometry (which is the difference with nearest_all), we indeed don't need to return a 2D array. That makes for scalars similar as shapely's nearest_item, and then we can safely extend that function to also work on array input.

So we currently have:

>>> points = shapely.points(range(5), range(5))
>>> tree = shapely.STRtree(points)

# current shapely method only works for a single geometry, returning a single index
>>> tree.nearest_item(shapely.points(1.5, 1.5))
1  


# this is the pygeos.STRtree.nearest (which I called nearest_bulk here temporarily)
>>> tree.nearest_bulk(shapely.points(1.5, 1.5))
array([[0],    # <-- this could also return the scalar "1" instead for scalar input
       [1]])

>>> tree.nearest_bulk(shapely.points([(1.5, 1.5), (3.5, 3.5)]))
array([[0, 1],   # <-- for array input, this can return the array [1, 3] as output
       [1, 3]])

One question is: what do we return for “None” input? -1?

So currently for pygeos' nearest, those are skipped:

>>> tree.nearest_bulk([shapely.points(1.5, 1.5), None, shapely.points(3.5, 3.5)])
array([[0, 2],
       [1, 3]])

which is a reason that we needed to return a 2D array with the indices of the query geoms as well.
For integer output, I assume -1 is the only possible value to use. It is a bit annoying though, as you might typically use the result in an indexing context, and then -1 can actually be a valid value but meaning something differently. Another option could be to raise an error for missing values as input, although this also deviates from our general handling of missing values. But it might be the safer option here.

@caspervdw
Copy link
Copy Markdown
Collaborator

Another option could be to raise an error for missing values as input, although this also deviates from our general handling of missing values. But it might be the safer option here.

That makes sense. I think I’d opt for raising an exception for None input when returning indices.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

jorisvandenbossche commented Dec 27, 2021

I did an attempt to do what we discussed above: let the old pygeos' nearest return a single scalar / 1D array of indices instead of the 2D array (what I temporarily called nearest_bulk in this PR), so it can be combined into shapely's nearest_item.
(for now just did the changes on the python side; once we are happy with the API, we should probably move that to the C implementation)

For missing values as input, for now I am raising an error.

But there is another case where we need to decide what to do: querying an empty tree. In pygeos, that returned an empty array of indices, but that's also no longer possible (the output needs to have the same shape as the input). Shapely currently returns None for that case. So one option is to keep that behaviour, another option is to raise an error?

@coveralls
Copy link
Copy Markdown

coveralls commented Dec 27, 2021

Pull Request Test Coverage Report for Build 1631872869

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 71 of 72 (98.61%) changed or added relevant lines in 1 file are covered.
  • 27 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+3.1%) to 86.982%

Changes Missing Coverage Covered Lines Changed/Added Lines %
shapely/strtree.py 71 72 98.61%
Files with Coverage Reduction New Missed Lines %
shapely/strtree.py 1 98.36%
shapely/geometry/polygon.py 2 96.03%
shapely/geometry/base.py 24 88.93%
Totals Coverage Status
Change from base Build 1626379189: 3.1%
Covered Lines: 2265
Relevant Lines: 2604

💛 - Coveralls

Copy link
Copy Markdown
Collaborator

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jorisvandenbossche !

I agree with raising an exception for None inputs.

I think raising an exception for any query against an empty tree also makes sense. If the tree is empty, there can be no useful results, and raising an error may better help the user identify that something went wrong when creating the tree if they expected it to be non-empty. In fact, if we caught this during the creation of the tree (since it is immutable), this may simplify later checks that the tree is non-empty since we'd never be able to construct an empty tree.

API-wise, it feels a bit awkward that query_items can return items from items created with the tree, whereas query_bulk only ever returns indices. I'm not sure of the use case, but it seems like query_bulk should also return items if those are provided, so that it is the vectorized equivalent of query_items. Either items should be managed by the tree (as here for query_items) or totally external to the tree and managed by the user, but not a mix of both... (so probably any of the singular or bulk functions that return indices should return items instead if present).

indices = self._tree.nearest(np.atleast_1d(geometry_arr))[1]

if geometry_arr.ndim == 0:
return indices[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this fail if indices is []?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A geometry always has some nearest geometry, so this should never be an empty array, except if the tree is empty. But so that gets us back to the non-inline discussion (and currently nearest_item will check for this and return None in that case, not getting to this line)

Co-authored-by: Brendan Ward <bcward@astutespruce.com>
@jorisvandenbossche
Copy link
Copy Markdown
Member Author

jorisvandenbossche commented Dec 28, 2021

Thanks @brendan-ward!

I think raising an exception for any query against an empty tree also makes sense. ...
In fact, if we caught this during the creation of the tree (since it is immutable), this may simplify later checks that the tree is non-empty since we'd never be able to construct an empty tree.

I think I would also prefer raising an error, at least when doing the nearest query, instead of returning None.
But already raising an error upon creation of the tree would be empty, might also be annoying in some use cases. Consider for example geopandas sindex / sjoin case. If the strtree creation can error, geopandas would need to protect for this in .sindex and have it return something else (eg None) if it would error (or protect every access of .sindex). In case of sjoin, geopandas currently is happy with the empty result from query_bulk for an empty tree, as that will result in an empty dataframe as join result (note that an empty tree doesn't necessarily need to come from emtpy dataframes as input, it can be a dataframe with only missing or empty geometries).

API-wise, it feels a bit awkward that query_items can return items from items created with the tree, whereas query_bulk only ever returns indices. I'm not sure of the use case, but it seems like query_bulk should also return items if those are provided, so that it is the vectorized equivalent of query_items.

Yeah, that seems indeed best to do to ensure consistency .. (although I also don't see any use case for this)

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

API-wise, it feels a bit awkward that query_items can return items from items created with the tree, whereas query_bulk only ever returns indices. I'm not sure of the use case, but it seems like query_bulk should also return items if those are provided, so that it is the vectorized equivalent of query_items.

Yeah, that seems indeed best to do to ensure consistency .. (although I also don't see any use case for this)

Actually, that's not possible with the current API. query_bulk returns a 2D array, with indices both into the input geometries and into the tree geometries. The indices into the tree geometries can be replaced with the "items" stored in the tree. But for the input geometries, we only have actual positional indices that can be used.
And returning a mix seems even worse than only returning indices (and ignoring items)? Even if we would want to return a mix, it would need to be two separate arrays, as the items are not guaranteed to be integer.

@brendan-ward
Copy link
Copy Markdown
Collaborator

brendan-ward commented Dec 29, 2021

it would need to be two separate arrays, as the items are not guaranteed to be integer

Right, sorry, bad idea on my part. Thank you for exploring the idea though!

Also scratch the idea about preventing construction of empty trees, for the reasons you provided.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

Any more comments on this? (cc @sgillies @mwtoews)
Otherwise I will fix up the docstring examples, and would like to get this merged.

@caspervdw
Copy link
Copy Markdown
Collaborator

caspervdw commented Jan 26, 2022

I had one proposal for “nearest” in the case of missing inputs or trees. Instead of raising directly, you could also return an index that is out of bounds (latest + 1)

This happens also for instance when using “np.searchsorted”.

It may be useful also for implementing the version of “nearest” that returns geometries.

Copy link
Copy Markdown
Contributor

@sgillies sgillies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some comments inline, can't approve this yet.

geoms: Iterable[BaseGeometry],
items: Iterable[Any] = None,
node_capacity: int = 10,
leafsize: int = 10,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the 1.8 API unnecessarily.


The 'dwithin' predicate requires GEOS >= 3.10.

If geometry is None, an empty array is returned.
Copy link
Copy Markdown
Contributor

@sgillies sgillies Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 returning empty result sequences is good in the way that Python slicing returns empty sequences. It designs errors out of existence.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the same is true for an empty geometry (can add that)

Comment on lines +136 to +137
Return the index of all geometries in the tree with extents that
intersect the envelope of the input geometry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the original sentence was more correct. Shapely's r-tree is more general that pygeos was. We can store values that are not indexes of anything.

Suggested change
Return the index of all geometries in the tree with extents that
intersect the envelope of the input geometry.
Return stored items from tree nodes that intersect the envelope of the input geometry, optionally filtering them by a spatial predicate involving the input geometry.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to update the docstrings a bit based on your feedback, but I would prefer to keep some simplified version. For example, I am not sure that the "from tree nodes" adds value for this first sentence (that seems an implementation detail?), and rather makes it harder to understand. What matters to the user is that it returns the items that correspond to the geometries that intersect with the envelope, I think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this helps, but tries to help better tease apart the distinction between index vs item / arbitrary value here:

Return items from the tree that intersect with the envelope of the input geometry. Items may be identified by an integer index (default) or arbitrary item values (optional) if those are provided when constructing the tree. Results can be optionally filtered by a spatial predicate...

If predicate is provided, a prepared version of the input geometry
is tested using the predicate function against each item whose
extent intersects the envelope of the input geometry:
predicate(geometry, tree_geometry).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that to help users who don't know the implementation we could be more clear about which thing is on the left hand side of the predicate and which is on the right side. The input geometry is on the left and the item to be tested is on the right, yes?

Copy link
Copy Markdown
Member Author

@jorisvandenbossche jorisvandenbossche Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line you are commenting on (predicate(geometry, tree_geometry) basically shows that "geometry" is left and "tree geometry" is right? (but maybe can make it predicate(input_geometry, tree_geometry (i.e. added input_) to make it clearer)
Or that's not super obvious this way?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding input_ prefix would help here, though it immediately follows (is implicit) from the prior statement.

An array of items stored in the tree.

ndarray
An array of indexes (or stored items) of geometries in the tree
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version was more correct. The case of storing indexes is a specialization that will be obvious to users.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since indexes are the default and stored arbitrary values are optional, the phrasing in this PR seems more accurate.

def query_bulk(self, geometry, predicate=None, distance=None):
"""Returns all combinations of each input geometry and geometries in the tree
where the envelope of each input geometry intersects with the envelope of a
tree geometry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think query_combinations may be more descriptive than query_bulk?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on the exact name (cc @brendan-ward who I think designed this initially)

The reason for the "bulk" is because it is basically a version of query that does the query for an array of geometries at once ("in bulk").
(in the end, the query method also returns the combinations but only with a single input geometry)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not at all attached to _bulk and never particularly liked the term; we came to that point to try to differentiate it from the _all functions in pygeos which behaved differently (though nearest_all is yet again different).

_combinations doesn't seem to add clarity, since both methods involve combinations (1 vs N tree geoms for query, M vs N geoms for query_bulk).

I think what we're getting at here is that this is the vectorized form of query (input is a vector instead of a scalar), but also that it isn't 1:1 vectorized to the inputs (where 1:1 vectorized returns same shape as input). So, query_vec or somesuch may work? Something to think about for some of the other functions that clearly need to differentiate themselves as only taking vector inputs (at least for the primary input param).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the differentiation is also that the one takes a single geometry as input, and the "bulk" version an array of geometries, another option for a name could be query_array.

Although, that also doesn't fully mach with the query_geoms and query_items terminology where the word after the _ is what is being returned, not what is passed as input.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't have strong opinions on the name for this. Names are hard!

"""Query the tree for the nearest geometry.

Returns the index (or stored item) of the nearest geometry in the tree
for each input geometry.
Copy link
Copy Markdown
Contributor

@sgillies sgillies Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, index is the special case. The previous docstring was better.

@sgillies
Copy link
Copy Markdown
Contributor

sgillies commented Jan 30, 2022

@jorisvandenbossche I left a bunch of comments inline. I feel like the first sentences in some docstrings regressed a bit and the predicate needs to be explained a little better. But generally, this is looking good.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

The case of storing indexes is a specialization

It's actually the items that is the special case, as by default (when only passing geometries to the STRtree contructor) it are indexes that get returned.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

I had one proposal for “nearest” in the case of missing inputs or trees. Instead of raising directly, you could also return an index that is out of bounds (latest + 1)

This happens also for instance when using “np.searchsorted”.

@caspervdw I think for the "nearest" purpose, that might be annoying to work with afterwards. You could typically use the return value as an index to get the corresponding geometry (or attributes corresponding to that geometry), and then an out of bounds index will error (so you would also have to explicitly check for that).

A difference with np.searchsorted is that this functions gives an index where the element would need to be inserted to keep order, so to expand the original array. So if you search for an element that would be added at the end (or missing / np.nan would also get added at the end), you get an "out of bounds" +1 index. But here that's expected, since the original array will grow (the return value of np.searchsorted is not expected to be usable as scalar index into the original array).

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

(gentle ping to get some responses to some of the comments/answers)

@mathause
Copy link
Copy Markdown
Contributor

Probably unrelated to your discussion but of interest to me. In pygeos query_bulk works on the polygons, while in shapely it worked on the envelope. What is the plan for the new API?

tree = pygeos.STRtree(points_pygeos)
a, b = tree.query_bulk(poly_pygeos, predicate="contains")

@brendan-ward
Copy link
Copy Markdown
Collaborator

@mathause The query against the underlying tree always uses the envelope of the geometries. Only the optional predicate test involves the actual geometries. The new API here for the bulk case is the same as for PyGEOS.

@caspervdw
Copy link
Copy Markdown
Collaborator

I had one proposal for “nearest” in the case of missing inputs or trees. Instead of raising directly, you could also return an index that is out of bounds (latest + 1)
This happens also for instance when using “np.searchsorted”.

@caspervdw I think for the "nearest" purpose, that might be annoying to work with afterwards. You could typically use the return value as an index to get the corresponding geometry (or attributes corresponding to that geometry), and then an out of bounds index will error (so you would also have to explicitly check for that).

True, but getting an error or a None (unexpected return type) would also be annoying. Returning index + 1 does give the additional option to substitute those values with None / NaN when transforming indices to geometries. In that way, None input may result in None output when querying the tree for nearest geometries.

@sgillies
Copy link
Copy Markdown
Contributor

@jorisvandenbossche I disagree with what you say in

It's actually the items that is the special case, as by default (when only passing geometries to the STRtree contructor) it are indexes that get returned.

I think the awkward situation we're in is only an accident. Let's not get hung up on it now, though. I'm happy with the PR and will suggest changes later.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

I updated this PR (addressed part of the feedback + fixed the conflicts).

The docstring examples are not yet fully updated, but if we are not yet fully sure about the final API, I would prefer waiting with fixing those (all other docstrings from functions merged from pygeos still need to be updated as well anyway).

True, but getting an error or a None (unexpected return type) would also be annoying. Returning index + 1 does give the additional option to substitute those values with None / NaN when transforming indices to geometries. In that way, None input may result in None output when querying the tree for nearest geometries.

@caspervdw yes, I see your point, that could indeed be useful. My main worry is that this is a rather advanced feature and it would easily give silent errors for users not aware of it returning this index + 1 number (while an error is then the "safer", less surprising default). Although it will maybe most of the time give some loud error? (if you use it for indexing, both with numpy, python list, or pandas Series, you will get errors)

I think the awkward situation we're in is only an accident. Let's not get hung up on it now, though. I'm happy with the PR and will suggest changes later.

@sgillies That sounds good for this PR. Looking forward to hear your further thoughts on the API / wording.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

I am planning to merge this shortly (but anyway we can continue the discussion in follow-up issues where needed)

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

I opened two follow-up issues for the items that were still being discussed above:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants