Skip to content

ENH: include dt64/td64 isinstance checks in __init__.pxd#16266

Merged
mattip merged 9 commits intonumpy:masterfrom
jbrockmendel:cyapi
May 23, 2020
Merged

ENH: include dt64/td64 isinstance checks in __init__.pxd#16266
mattip merged 9 commits intonumpy:masterfrom
jbrockmendel:cyapi

Conversation

@jbrockmendel
Copy link
Contributor

There are a handful of checks related to datetime64/timedelta64 objects that pandas implements but would make more sense to have directly in numpy. This PR ports those checks over and puts them in the __init__.pxd file. If this PR is accepted, I'll make the same PR on the cython version of this file.

Suggestions on where tests for these belong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this can be nogil. It is defined as

#define PyObject_TypeCheck(ob, tp) \
    (Py_IS_TYPE(ob, tp) || PyType_IsSubtype(Py_TYPE(ob), (tp)))

Where PyType_IsSubtype is a function call

PyAPI_FUNC(int) PyType_IsSubtype(PyTypeObject *, PyTypeObject *);

and I think that function needs to hold the GIL.since it traverses the type object's mro. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I think that function needs to hold the GIL.since it traverses the type object's mro. Thoughts?

I can't speak to most of this, but pandas has used this verbatim for a couple years now without any trouble.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike IsSubclass, the IsSubtype code looks safe. But its not documented as such, so unless pandas uses it without the GIL, maybe its better to just remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so unless pandas uses it without the GIL, maybe its better to just remove it?

I just tried removing the "nogil" from these declarations in pandas (and for the 4 other isinstance-like functions not included in this PR) and didn't get any cythonize-time build errors, so it looks like we don't call it without the GIL. That said, ill do some %timeits to check if this change affects perf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can affect performance, since Cython does not release the GIL unless you ask it to AFAIK. In fact, if it did release the GIL here, that would be a performance bug, since releasing the GIL is unnecessary overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I looked at the CPython code, and at this time it looks absolutely safe to me. Maybe we can open a bug at python asking to document it as being safe (or the opposite). I am not sure how smart cython is, does it re-grab the GIL when calling a non-gil function from within a GIL context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I wonder if thats just a fluke in timing, but not sure. In any case, I don't have a huge problem with keeping it around, although nogil should as far as I know make no difference at all, except causing errors when the function is used in a with nogil: block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill double-check with some more timing results.

Suggestions on how/where to implement tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill double-check with some more timing results.

Looking at these again, the performance looks indistinguishable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. I would prefer without the nogil since this is a function call and not a macro, who knows what mischief the implementation could do when it scans the type's mro.

As for tests: that is a good question. We should be testing all of the content of this file. Perhaps a new test file numpy/core/tests/test_cython.py which would call cython similar to what numpy/random/tests/test_extending.test_cython does.

@mattip
Copy link
Member

mattip commented May 17, 2020

Thanks. I agree this code should not be maintained by pandas.

I am not a big fan of exposing numpy internals to downstream libraries. I understand this code already is in use in Pandas (likewise get_timedelta64). It seems these uses are

  • determine if v == NPY_NAT
  • expose the units of the datetime dtype

Additionally, there is code in pandas to read/write a npy_datetimestruct

I wonder if we already have appropriate and fast APIs for handling these cases that we could add to our __init__.pxd without requiring exposing internals

@jbrockmendel
Copy link
Contributor Author

expose the units of the datetime dtype

The main use case for this is to implement an analogue to astype("datetime64[ns]") that raises on overflows. https://github.com/pandas-dev/pandas/blob/1c4114d2f7bb97c0eb3f3010f3537a9050730d63/pandas/_libs/tslibs/conversion.pyx#L132

Additionally, there is code in pandas to read/write a npy_datetimestruct

pandas used to use a copy/pasted version that called it pandas_datetimestruct but I changed it to use the upstream version to de-duplicate. Reverting that would not be that troublesome (though I'd rather share more rather than less).

@jbrockmendel
Copy link
Contributor Author

One other related issue that might be worth addressing in this PR: a while back I determined that cython's numpy/__init__.pxd file had an incorrect specification of numpy.dtype (see cython/cython#2022, #10063 (comment)). As a result, pandas has an ugly override. Can we confirm that the version in numpy's __init__.pxd file is correct?

@mattip
Copy link
Member

mattip commented May 19, 2020

incorrect specification of numpy.dtype

Indeed. There is confusion there. It should be mapped with aliases (like for itemsize and "elsize")

python C
type typeobj
char type

@jbrockmendel
Copy link
Contributor Author

It should be mapped with aliases (like for itemsize and "elsize")

huh, looking at the pandas version, the main difference seems to be that we have "object fields" instead of "dict fields"

@jbrockmendel
Copy link
Contributor Author

updated to 1) remove nogil and 2) take a stab at tests. the tests do not work locally, which I hope is a sys.path issue because the dummy setup.py is finding the wrong np.get_include, in which case the CI should be OK.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting the testing started. I took the liberty of fixing a few things, there are still some failing tests


config.add_subpackage('tests')
config.add_data_dir('tests/data')
config.add_data_dir('tests/examples')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the missing piece to get tests going. I also checked that the new files are included in the sdist, we do not need to change MANIFEST.in

@jbrockmendel
Copy link
Contributor Author

@mattip thanks for helping out with the tests; this is a really non-standard test setup.

Looks like the remaining CI failure is an unrelated HTTP error

@mattip
Copy link
Member

mattip commented May 23, 2020

LGTM. I think we should put this in:

  • it provides a nice framework for adding more tests of the cython file in the future
  • it does not expose anything that is outside the public headers and API.

CI test failure is not related to this PR.

Thanks @jbrockmendel

@mattip mattip merged commit a597fe9 into numpy:master May 23, 2020
@mattip
Copy link
Member

mattip commented May 23, 2020

If the tests slow down CI we might want to mark them as @slow. We could also refactor the common parts of the new tests and the cython test in numpy/random/tests/test_extending.py, especially around checking cython version.

@jbrockmendel
Copy link
Contributor Author

awesome, thanks for walking me through the process

Comment on lines +1021 to +1048
cdef inline bint is_timedelta64_object(object obj):
"""
Cython equivalent of `isinstance(obj, np.timedelta64)`

Parameters
----------
obj : object

Returns
-------
bool
"""
return PyObject_TypeCheck(obj, &PyTimedeltaArrType_Type)


cdef inline bint is_datetime64_object(object obj):
"""
Cython equivalent of `isinstance(obj, np.datetime64)`

Parameters
----------
obj : object

Returns
-------
bool
"""
return PyObject_TypeCheck(obj, &PyDatetimeArrType_Type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #16363, which continues from this change.

Comment on lines +1068 to +1072
cdef inline NPY_DATETIMEUNIT get_datetime64_unit(object obj) nogil:
"""
returns the unit part of the dtype for a numpy datetime64 object.
"""
return <NPY_DATETIMEUNIT>(<PyDatetimeScalarObject*>obj).obmeta.base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is risky - I can't think of any way of using this function that doesn't produce dangerously incorrect results for np.datetime64("now", "20s") as input.

@charris charris mentioned this pull request Oct 10, 2020
20 tasks
@charris charris changed the title ENH: include dt64/td64 isinstance checks in __init__.pxd ENH: include dt64/td64 isinstance checks in __init__.pxd Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants