Skip to content

ENH: Define and register geometry dtype#34

Closed
jorisvandenbossche wants to merge 1 commit intopygeos:masterfrom
jorisvandenbossche:register-dtype
Closed

ENH: Define and register geometry dtype#34
jorisvandenbossche wants to merge 1 commit intopygeos:masterfrom
jorisvandenbossche:register-dtype

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

My experiments related to https://github.com/caspervdw/pygeos/issues/11 (only look at second commit to see the changes related to the dtype, the first commit is from https://github.com/caspervdw/pygeos/pull/33)

Basically it is a custimized object dtype where in the setitem function we check that what is set is an actual Geometry object.

This works:

In [1]: import pygeos 

In [2]: a = np.array([pygeos.Geometry('POINT ( 1 1 )'), pygeos.Geometry('POINT ( 2 2 )')])

In [3]: a 
Out[3]:
array([<pygeos.Geometry POINT (1 1)>, <pygeos.Geometry POINT (2 2)>],
      dtype=GEOSGeometry)

In [4]: a[0] = 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-c8548dbc0a7e> in <module>
----> 1 a[0] = 1

TypeError: Unknown input to GEOMETRY_setitem

But it is not yet integrated with the ufunc machinery.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

Note that this is not yet complete, we probably need to implement some more of the PyArray_ArrFuncs. Eg currently np.empty still creates an array of Nones.

Also cast functions could be added, if we want (eg from object dtype -> geometry dtype).

For fitting the ufuncs, the main bottleneck is that we need to specify the geometry dtype as an arg type, and this type is only available after registering it. So will need to reorganize the code somewhat (eg create ufunc register function that takes the geometry dtype as input argument).

@pierreglaser
Copy link
Copy Markdown

pierreglaser commented Sep 6, 2019

@mattip running your script gets me a segfault on:
python3.6.9,
ubuntu18.04
geos 3.6.2 (apt)
numpy 1.17.1
pygeos installed with CFLAGS=-O0 pip install -e .

@seberg
Copy link
Copy Markdown

seberg commented Sep 6, 2019

Matti told me about this project, I tried the same and also got either segfaults (or on debug asserts). The problem is that right now user types requiring reference counting is just not expected. I.e. at the very least when deleting the (original) array that owns the data, in the best case scenario all the objects contained are simply not cleaned up. If you can allocate a struct (without any pointers/references) to store all necessary data, you should be able to create the dtype fine right now (there may be some odd corners, such as byteswapping being assumed to exist).

The one hack I could a bit imagine may currently work, is if you create something like a structured dtype with only a single field, which contains an object dtype (and you ensure it is of your type).

We really need to support custom object dtypes, but I do not think users only need to override some of the dtype functions for them (and even than only for speed, such as comparing/sorting), since most things are defined as ufuncs.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

jorisvandenbossche commented Sep 7, 2019

@seberg thanks a lot for the feedback! (although it's "bad" news ;)) Below some more remarks / questions (and sorry, it got a bit long).

After sprinting on it with @pierreglaser yesterday, we were also coming somewhat to the idea that a possible conclusion could be "this is not possible" (while trying to figure out segfaults ..).

For example, something that we were hitting: in numpy arrays, accessing elements gives either array scalars (for all dtypes except object), or either a plain python object (for object dtype). And the array scalars then can be converted in a plain python object with .item() (so something like "array -> getitem -> array scalar -> item -> plain python object" or "array -> getitem -> plain python object" for object).
But in our case, our python object was also registered as the array scalar, giving an unclear situation of "array -> getitem -> our Geometry array scalar / python object -> item -> itself??" (but currently going through getitem again and giving a segfault ..)

The problem is that right now user types requiring reference counting is just not expected.

Is the problem that in the numpy codebase there are often hardcoded checks for NPY_OBJECT instead of checking the flag of the descriptor (eg NPY_ITEM_IS_POINTER or NPY_ITEM_REFCOUNT) in cases for special object code paths? Or is there a more fundamental reason this is not possible? (not that such a less fundamental reason might not be difficult / a lot of work to solve)
(something I also thought of is that code paths using PyArray_IsScalar might also not expect python objects to be scalars)

If you can allocate a struct (without any pointers/references) to store all necessary data, you should be able to create the dtype fine right now (there may be some odd corners, such as byteswapping being assumed to exist).

The goal is to work with GEOSGeometry objects, provided by the GEOS C++ library, and all the functions from GEOS expect a pointer to a GEOSGeometry object, so we need this pointer (and GEOS also does not expose the internal memory of those GEOSGeometry objects to eg be able to store that directly). So this is not possible in this case.

The one hack I could a bit imagine may currently work, is if you create something like a structured dtype with only a single field, which contains an object dtype (and you ensure it is of your type).

How can you ensure the values in a field are of a given type? (eg np.dtype([('geom', pygeos.Geometry)]) gives dtype([('geom', 'O')]), so that field can contain any python object? Or is that possible on the C level?

We really need to support custom object dtypes, but I do not think users only need to override some of the dtype functions for them

Do you mean that numpy could grow the ability of an object dtype with a fixed class (and so numpy would check that all elements are of the same class, eg on setitem, as we were trying to do here?)
That would be very nice.


For me, releasing the GIL is still important (and having this dtype could have helped to reduce a little bit the overhead of checking the types in the ufuncs, but in the end not essentially related to the question of whether releasing the GIL is possible/safe). @seberg could you give your opinion on doing two passes over the data in the ufuncs as shown in #7 (comment) ? (first pass checking the data, second pass actual operation without the GIL). I suppose this is still potentially unsafe if someone sets a value in the array while the ufunc in running. Would it help making setting the writable flag to False before calling the ufunc?

@seberg
Copy link
Copy Markdown

seberg commented Sep 7, 2019

I doubt you can hack such a "safe" datatype, numpy will always at least allow you to work around it, I am sure (plus it will expose as a field, etc.), it was more of a thought.

We can definitely grow the ability to have typed pyobject arrays, I think that should be one of the things higher up the list to do. Unfortunately, right now we have to put down quite a bit of ground work first.

As to the dual pass approach. It can have poor performance in some cases, but should work fine (we could grow support for that, but that is a bit further off maybe). Although I think the current comment is missing the reference count increasing of the buffer, to ensure the objects do not get collected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants