ENH: Define and register geometry dtype#34
ENH: Define and register geometry dtype#34jorisvandenbossche wants to merge 1 commit intopygeos:masterfrom
Conversation
|
Note that this is not yet complete, we probably need to implement some more of the Also cast functions could be added, if we want (eg from object dtype -> geometry dtype). For fitting the ufuncs, the main bottleneck is that we need to specify the geometry dtype as an arg type, and this type is only available after registering it. So will need to reorganize the code somewhat (eg create ufunc register function that takes the geometry dtype as input argument). |
78da3ec to
5db0c36
Compare
|
@mattip running your script gets me a segfault on: |
|
Matti told me about this project, I tried the same and also got either segfaults (or on debug asserts). The problem is that right now user types requiring reference counting is just not expected. I.e. at the very least when deleting the (original) array that owns the data, in the best case scenario all the objects contained are simply not cleaned up. If you can allocate a struct (without any pointers/references) to store all necessary data, you should be able to create the dtype fine right now (there may be some odd corners, such as byteswapping being assumed to exist). The one hack I could a bit imagine may currently work, is if you create something like a structured dtype with only a single field, which contains an object dtype (and you ensure it is of your type). We really need to support custom object dtypes, but I do not think users only need to override some of the dtype functions for them (and even than only for speed, such as comparing/sorting), since most things are defined as ufuncs. |
|
@seberg thanks a lot for the feedback! (although it's "bad" news ;)) Below some more remarks / questions (and sorry, it got a bit long). After sprinting on it with @pierreglaser yesterday, we were also coming somewhat to the idea that a possible conclusion could be "this is not possible" (while trying to figure out segfaults ..). For example, something that we were hitting: in numpy arrays, accessing elements gives either array scalars (for all dtypes except object), or either a plain python object (for object dtype). And the array scalars then can be converted in a plain python object with
Is the problem that in the numpy codebase there are often hardcoded checks for
The goal is to work with GEOSGeometry objects, provided by the GEOS C++ library, and all the functions from GEOS expect a pointer to a GEOSGeometry object, so we need this pointer (and GEOS also does not expose the internal memory of those GEOSGeometry objects to eg be able to store that directly). So this is not possible in this case.
How can you ensure the values in a field are of a given type? (eg
Do you mean that numpy could grow the ability of an object dtype with a fixed class (and so numpy would check that all elements are of the same class, eg on setitem, as we were trying to do here?) For me, releasing the GIL is still important (and having this dtype could have helped to reduce a little bit the overhead of checking the types in the ufuncs, but in the end not essentially related to the question of whether releasing the GIL is possible/safe). @seberg could you give your opinion on doing two passes over the data in the ufuncs as shown in #7 (comment) ? (first pass checking the data, second pass actual operation without the GIL). I suppose this is still potentially unsafe if someone sets a value in the array while the ufunc in running. Would it help making setting the writable flag to False before calling the ufunc? |
|
I doubt you can hack such a "safe" datatype, numpy will always at least allow you to work around it, I am sure (plus it will expose as a field, etc.), it was more of a thought. We can definitely grow the ability to have typed pyobject arrays, I think that should be one of the things higher up the list to do. Unfortunately, right now we have to put down quite a bit of ground work first. As to the dual pass approach. It can have poor performance in some cases, but should work fine (we could grow support for that, but that is a bit further off maybe). Although I think the current comment is missing the reference count increasing of the buffer, to ensure the objects do not get collected. |
My experiments related to https://github.com/caspervdw/pygeos/issues/11 (only look at second commit to see the changes related to the dtype, the first commit is from https://github.com/caspervdw/pygeos/pull/33)
Basically it is a custimized
objectdtype where in thesetitemfunction we check that what is set is an actual Geometry object.This works:
But it is not yet integrated with the ufunc machinery.