Skip to content

ENH: Prepare umath module for the C++#19753

Open
seiko2plus wants to merge 9 commits intonumpy:mainfrom
seiko2plus:cpp_and_umath
Open

ENH: Prepare umath module for the C++#19753
seiko2plus wants to merge 9 commits intonumpy:mainfrom
seiko2plus:cpp_and_umath

Conversation

@seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Aug 25, 2021

merge after #23298

This work aims to greatly reduce reliance on the C API, in order to facilitate the complete elimination of C code in the future and count on modern C++.

This request includes multiple patches, the main purpose of which is to impose a framework that facilitates the creation of universal functions using the C++ language without the need of use python code as generator.

Two examples have been added to show how to use it. You can explore them by reviewing ufunc_operations.cpp

TODO:

  • wrapping PyObject and PyArray Adding Array to manage the array data only
  • tweak the C++ wrapper for universal functions(add Doxygen comment blocks, add static assertation)
    will be included by ENH, SIMD: Add C++ wrapper for universal intrinsics #21057
  • add C++ types for float16, datetime, timedelta for type safety and overloading.
  • add testing unit for the C++ wrapper
  • add more examples

@seiko2plus seiko2plus force-pushed the cpp_and_umath branch 8 times, most recently from b314ac3 to 3edb342 Compare August 25, 2021 21:44
@rgommers rgommers added the C++ Related to introduction and use of C++ in the NumPy code base label Oct 27, 2021
@seiko2plus seiko2plus marked this pull request as draft November 10, 2021 18:46
@seiko2plus
Copy link
Member Author

seiko2plus commented Nov 10, 2021

There's a necessity to provide a light C++ wrapper for PyObject and PyArray, similar to Boost.Python and pybind11 but without supporting C++ exceptions to fit our approach(google style). I'm going to push the local changes once I get done with these wrappers.

The work has been scaled down and focused only on universal functions, so please disregard this comment.

@seiko2plus seiko2plus marked this pull request as ready for review May 2, 2023 18:58
@seiko2plus
Copy link
Member Author

@mattip, @seberg, I'm done here and its ready for reviewing.

@seberg
Copy link
Member

seberg commented May 2, 2023

I am not sure I will make the meeting tomorrow but maybe you can organize a meeting together with @mattip some time at the end of the week if not? I need to let this sink, on first sight it looks like quite a bit new things compared to last time I had a brief look (introducing Slice and unwrapping the old API into your new one, and especially since we currently have yet another layer since that old API is not sustainable forever, that seems all not ideal to me TBH).

I guess you want that C++ style for loop. Not sure if I think it is worth a lot (but I am not big into C++), but maybe 1-3 lines of boilerplate, rather than another indirection and complicated wrapping layer would do the trick as well?

Anyway, have to look at it closer and discuss a bit and then think a bit more before I can make a deeper comments.

)...
);
std::apply(reinterpret_cast<TFunc>(data), params);
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think my main worry is this part. We already have two loop versions:

  • The legacy that is unwrapped here.
  • A new loop signature, which is largely the same (but adds more info and an error return). This is not accessible through PyUFunc_FromFuncAndDataAndSignatureAndIdentity, but we should aim to replace it sooner rather than later internally.
    • This signature is very much open to change, but I would like to be sure that updating existing code is easy eventually.
    • We don't need a "legacy" loop, we only have to support it when downstream gives us one.
    • This new-loop can be specialized for specific strides (in most ufunc calls) for speed, so that selection of SIMD can happen outside of the inner-loop itself.

Since you use the static void *data, the compiler cannot possibly optimize this away either. The actual (new) ufunc signature currently doesn't even include static data (but we can add that back in some form, it is likely useful also for generic heavy functions, where we currently just pass a function pointer also).

Note that one thing I would like mid-term, is to move the current if (stride[0] == 0), etc. branches out of the inner-loop.
The machinery exists, but we have to start using it. This refactor should help with that since it should make our ufunc generation just a bit less of a mess hopefully, and to get there we should update most/all ufunc signatures to the new-style and adapt the ufunc generation a fair bit.


I guess you want C++ iterators and I expect they are really helpful and maybe necessary for later SIMD machinery improvements since those make iteration more painful.

But I do need to ask what the alternatives are? Would it be really be terrible to build an iterator (or multiple iterators) inside the inner-loop with a one-line per-argument boilerplate? That way you dog-feed the C-signature explicitly but still iterate in whatever way we like. Also, I do think it would make it much clearer to have both code styles side-by-side without getting super confused. And we will always have both side-by-side, because the C-signature is still our public API (unless we have the guts to change that, which may be a path, but probably a more difficult and slower one than I would like right now).


One more question I have but not for Sayed explicitly. I suspected we would aim to just remove the Python generation completely, even if we end up with a rather long C++ file (that could be generated via an adapted old file once for reference).

But to me, that would be another advantage of starting of with not modifying the signature at all, because then we can do the two steps separately. (Yes, this requires a bit of double work, but on the plus side you can do it in chunks.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new loop signature, which is largely the same (but adds more info and an error return).

PyArrayMethod_Context, and int rather then void as return, well not sure how useful PyArrayMethod_Context is but it can be supported through C++ also new C++ signature can support either bool or int for handling errors even if I think that testing PyErr_Occurred() is quite enough or maybe we can enable C++ exception in some stages.

This is not accessible through PyUFunc_FromFuncAndDataAndSignatureAndIdentity, but we should aim to replace it sooner rather than later internally.

If we are moving to C++ then what the point of continue enhancing C interface? Building a new interface in C++ to count more on static optimization should significantly speed up the internal calls.

This signature is very much open to change, but I would like to be sure that updating existing code is easy eventually.

The current approach provides the best start for the highest layer but it costs extra later for sure we can enhance it later at least by cutting off the legacy layer.

This new-loop can be specialized for specific strides (in most ufunc calls) for speed, so that selection of SIMD can happen outside of the inner-loop itself.

I'm not sure how it's going to be useful for SIMD kernels, we usually support both contiguous and non-contiguous memory access as long as there's no memory overlapping. Specialize within the inner loop is much easier and can save binary size too, For example, we can easily implement a C++ meta class or function that takes two operators one for scalar operation and another for SIMD intrinsic then it's generating all routined branches including SIMD vector memory access, unrolling, reduction and scalar fallback.

Since you use the static void *data, the compiler cannot possibly optimize this away either.

I don't think it's necessary to support it for C++, there're better patterns.

Note that one thing I would like mid-term, is to move the current if (stride[0] == 0), etc. branches out of the inner-loop.

If counting on meta functions to generate these branches wouldn't be robust then we can discuss moving to specialize these branches through UFunc API, after this pr gets merged I will provide real examples in this matter.

I guess you want C++ iterators and I expect they are really helpful and maybe necessary for later SIMD machinery improvements since those make iteration more painful.

It is also important for type safety.

But I do need to ask what the alternatives are?
That way you dog-feed the C-signature explicitly but still iterate in whatever way we like.

Avoid using C++ functions pointers and wraps the C UFunc API through meta programming, this solution is zero-cost overhead but unfortunately not that flexible and increases the complexity of the code especially when exporting internal symbols for CPU runtime dispatching.

And we will always have both side-by-side, because the C-signature is still our public API (unless we have the guts to change that, which may be a path, but probably a more difficult and slower one than I would like right now).

I agree but nothing is perfect, maybe provides inline functions that can serve both directions when it's needs until C++ dominates.

suspected we would aim to just remove the Python generation

Yes, but we also have to get rid of C template sources too. all at once seems more convenient to me. The new signature works only internally and can be improved later at any time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made changes to the code, to remove the runtime overhead and pass the inner loops as objects rather than function pointers (fully meta) please take a look at commit 1969494 to realize the differences:

However, once we deal with runtime dispatch or any exported symbols the overhead will come back again. I still prefer the previous method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another suggestion to C++ signature is to change the output indicator from reference to non-const:

-(Slice<T> in0, Slice<T> in1, Slice<T> &out)
+(const Slice<T> in0, const Slice<T> in1, Slice<T> out) 

The number of inputs and outputs are deduced from the signature rather than specifying it manually in case we decided later to support non-static signatures similar to function overloading in C++.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't have time to look at this for a few weeks. But since Marten commented: Yes, we should revive this.

FWIW, I am still a bit unsure about the right choice of inner-loop. Being more used to C++ these days helps, though.

My main question I guess remains a bit: I think you wrote this under the assumption to make the full thing C++. I lack a bit of clarity that this is a good choice, since we must keep a first class public C-API.

I suppose I can rephrase this into: Please assume that you will never be able to remove the step to translate the current C call into a C++ internal call. Would you reconsider this API choice? If not, then probably fine. But if yes, what would it look like?
(Even if it is just something creating a MultiIter<...>, whether more complicated or not to do the C/C++ layer conversion.)

(I would have been very happy, and still would be happy, to tweak the C signature. But moving where the C/C++ interface remains a big unclear thing to me that I don't know if it could or should be done. If others are more clear on it, that would be fine...)

using Bool = std::conditional_t<sizeof(bool) == sizeof(unsigned char),
bool, Holder<unsigned char, kBool>>;
/// single-precision floating-point type, compatible with IEEE-754.
using Float = float;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Holder used for long but not float?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, only the following data types are required to be distinguished:

using UByte = Holder<unsigned char, kUByte>;

using Bool = std::conditional_t<sizeof(bool) == sizeof(unsigned char),
    bool, Holder<unsigned char, kBool>>;

using LongDouble = std::conditional_t<kLongDoubleIsDouble,
      Holder<double, kLongDouble>, long double>;

using CLongDouble = std::conditional_t<kLongDoubleIsDouble,
      Holder<complex<double>, kCLongDouble>, complex<long double>>;

To forbid the use of built-in datatypes directly, using long instead of Long should break the build when kTypeID is involved:

https://github.com/numpy/numpy/blob/2beb25fa4e90923e7b260408bf45b81fe238cf44/numpy/core/src/common/npstd.hpp#L174C4-L176

Allowing raw integer built-in data types isn't type-safe when it comes to determining the data type IDs (template specialization), since they are commonly used as alias almost for all the generic std alais including fixed width data types.

* This type will be set to `void` when `npy_longdouble` is not defined
* as `long double`.
* Note: this class doesn't handel reference counting
* via RAII or raise any kind of exceptions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It seems a bit of a trap at the moment, because Object(1); leaks a reference, but Object(some_obj) doesn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting RAII should clean up the code, I'm going to support it. My reason at the moment is that we disable C++ exceptions by default which means there's no support for STD runtime including memory operators e.g. new/delete.

@mattip
Copy link
Member

mattip commented May 18, 2023

This work aims to greatly reduce reliance on the C API, in order to facilitate the complete elimination of C code in the future and count on modern C++.

Is this a goal we have agreed upon and appears in a NEP somewhere? I think some of the design principles of NumPy rely on C and the Python C-API, but we should have a discussion around this. We need to maintain much of the C-API, which parts are we going to remove?

@seiko2plus
Copy link
Member Author

Is this a goal we have agreed upon and appears in a NEP somewhere?

No, but this is a defacto once you count on C++ you will keep reducing the reliance on C code, over time C API will become just a wrapper for the C++ code.

We need to maintain much of the C-API

what needs is just avoid breaking the compatibility.

which parts are we going to remove?

In general, any new code should be written in C++ and should run semi-standalone. The rule is simple, just avoid exporting C symbols from C++ code as far as we can.

We need to maintain much of the C-API, which parts are we going to remove?

The current step is to reimplement the current inner loops of umath in fully C++ without exporting any symbols to C.

And the step after (not necessary in the short term) would be re-implement the current UFunc API in C++ and provides a wrapper for C without breaking computability. The purpose mainly provides static optimizations, but the C API wouldn't get benefits from this change only from the C++ side.

@Mousius Mousius self-assigned this Feb 12, 2024
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through all the commits mostly out of curiosity, and because with a first taste of C++ in the FFT routines, I thought I might as well try to see how things could be done better. As a results, my comments are probably mostly rather useless (and most are about spelling of comments anyway).

Anyway, mostly just wanted to comment that this looks really nice!

But maybe my general impression is not: overall, this looked really nice! I think I would have understood what was happening here far faster than the time it took to understand the .c.src files. And of course nice not to rely on pre-processing. It also seemed the real tricky parts would be nicely hidden from someone just trying to code up a new ufunc.

I guess the main surprise was to see the inner loop functions replaced by a template with a operator(). Is that just to show how that would look? Is there a big advantage to one or the other? FWIW, to me, the indirection made things less clear, though I think in the end all that is really required is a nice worked example of how to do things.

* }
* }
*
* template<typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but then I changed it to meta objects rather than function pointers due to @seberg concerns, I kept the doc as-is since we still haven't decided to favor one another.

Function pointers:

template<typename T>
void Add(np::Slice<T> in0, np::Slice<T> in1, np::Slice<T> &out)
{...}

UFuncObject add_obj(
    "add", "this doc of add function",
    Add<Float>, Add<Double>
);

Meta objects:

template <typename T>
struct AddOP {
    void operator ()(Slice<T> in0, Slice<T> in1, Slice<T> &out)
    {...}
};
UFuncObject<AddOP<Float>, AddOP<Double>> add_obj(
    "add", "this doc of add function"
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that lambda can be supported too and also both styles above at the same time, however for simplicity of code I left only one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem best to pick just one. I tried to read the discussion with Stefan more on top, but it left me unclear what the advantage of function pointer vs meta object might be. Is it that one could quite easily use meta objects to wrap old-style ufunc loops?

Anyway, I think both styles will work, and I really like how either allows one to focus on what the loop does rather than on how the iteration works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is the meta object should be able to optimize things away (so that the pointer arrays don't need to be copied around on every call). But I am not sure if I understood it right in the call.
This was indeed my main concern: That we for-ever keep translating different function signature styles. If this is all optimized away, then that doesn't matter much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the advantage of function pointer vs meta object might be

Passing the function pointer as Sebastian explained will add an extra call for each time the inner loop is executed since the C++ function pointer will be stored on C ufunc data arbitrarily while the current approach (Meta-object function) is statically wrapping the C legacy function which allows the compiler to optimize it out.

Choosing function pointers provides more flexibility when it comes to sharing functions among different translation units which I would personally prefer also it improves the readability a bit, later on, we can cut extra calls by letting C++ take the lead over C code and initialize its own PyType_Slot.

Anyway, I'm still happy with meta-objects if this going to resolve this discussion and we can later discuss moving to C++ function pointers once we get rid of template sources and the umath python generator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, both, that really helps!

constexpr size_t idx = Index_<TFunc>(
std::make_index_sequence<kNumOverloads>{}
);
printf("%d\n", idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug statement left in by mistake?

);
static_assert(
((ufunc_helper::FuncDeduce<TFuncOverloads>::kNumOut == kNumOut_) && ...),
"Overloaded functions must have the same number of output,"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output -> outputs

);
static_assert(
(ufunc_helper::FuncDeduce<TFuncOverloads>::kIsSlice && ...),
"Only np::Slice<T> is supported as parmater type at current moment"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at current moment -> at the moment (or at present)

*/
class Object {
public:
/// Constract from PyObject.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construct (also below).

PyDict_SetItemString(mdict, "_ones_like", ones_like_obj);
Py_DECREF(ones_like_obj);

static UFuncObject copysign_obj(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really does prove the point how nice this is!

if (InitOperators(d) < 0) {
return -1;
}
if (InitOperations(d) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to InitOperatorsCpp or so - that will help people looking at this understand why there are two similarly named functions (and find the relevant file where they are defined more easily).

((std::is_function_v<std::remove_pointer_t<TFuncOverloads>>) && ...),
"Expected function pointers"
);
// static_assert(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out because not sure what the best way is?

}
else {
a = 1;
struct OnesLikeOP {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit unclear, at least just following the commits, why this indirection was made. I liked the first iteration better in that one is just writing an inner loop function, which seems more natural. But I guess this where documentation would come in...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious 😸 a meta object of templated functions doesn't seem functionally different from the struct variant, which probably means I'm missing something obvious.

@mhvk
Copy link
Contributor

mhvk commented Feb 14, 2024

p.s. Over in the FFT routines, we had to deal with exceptions because they could be raised in pocketfft and having seen that, it did make me wonder whether it isn't better and just do that, rather than stick to how things are handled in C. But almost certainly much better not gotten into for this PR!

EDIT: now read a bit of the google style guide and the arguments against exceptions are fairly persuasive... Anyway, definitely ignore this comment for this PR!

  This another method for defining the inner loops of UFunc,
  Fully meta and doesn't add any overheads as long as there's
  no exported symbols involved e.g. runtime dispatching
@jorenham
Copy link
Member

jorenham commented Jan 8, 2026

@seiko2plus any chance this could be revived? Otherwise we should probably close it.

@jorenham jorenham added the 57 - Close? Issues which may be closable unless discussion continued label Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement 57 - Close? Issues which may be closable unless discussion continued C++ Related to introduction and use of C++ in the NumPy code base

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants