Optimize Object::cast_to by assuming no virtual and multiple inheritance, gaining 7x throughput over dynamic_cast.#103708
Conversation
Object::cast_to 7x by assuming no virtual and multiple inheritance, gaining 8x throughput over dynamic_cast.Object::cast_to by assuming no virtual and multiple inheritance, gaining 7x throughput over dynamic_cast.
|
We have some multiple inheritance in the openxr module (which I've argued for removing!), but I'm not exactly sure how it would be affected by this change. Anytime we do
Hm, maybe. GDExtension classes are always "attached" to an instance of their nearest parent class that exists within the engine, and any In godot-cpp, we'd maybe want to change the way we do This would require some experimentation. I could also see a potential affect on one GDExtension using classes from another GDExtension, but that's not a use-case we officially support at the moment anyway - we'd just need to keep this in mind when adding support for it. |
I think as long as the multiple inheritance doesn't lead to
Interesting thought! Yeah, with some luck, it could actually be backwards compatible 😄 Speaking of this, this PR would definitely need to expose this function to GDExtension for overriding. Not sure how to do that. |
7c2be69 to
94b4eb5
Compare
|
I searched the issues for On this branch, after exchanging It's a somewhat wonky comparison, because it's not a stress test (although the CPU use is, as the author correctly claimed, unexpectedly high). But I think it can be concluded that the optimization works as expected in this case. |
|
This is great! Maybe we can test on different compilers .. Having said that, I'm not sure why a compiler wouldn't inline the inherited calls (at least in release as you say) as they should be able to be evaluated at compile time. I'm pretty sure we must rely on this elsewhere for inheritance macros (I haven't studied them). I just did a test in a particular intensive scene tree traversal (for #103685) I was doing by just implementing a bitfield in The unoptimized version has to traverse the entire (Obviously your approach involves no storage on the node as it works through virtual system, so is better for general use, but this was just a simple bottleneck test.) I'd like to also try this in 3.x .. we have no GDExtension there so no binary compatibility to worry about (I'm assuming that's part of the issue with GDExtension?). For recompiled modules I think they would work fine with it. |
I had a look... unexpectedly there's still a ton of optimization potential: Looks like even retrieving the
This was my first thought as well, but this would of course only optimize specific classes (if I'm understanding your approach right).
Agreed. |
|
Reminds me of the discussion here #80612 (comment) which mentions that this dynamic cast prevents compiling without RTTI |
This PR notably still uses RTTI, though it does make a step in a direction of Object working without it. I also have been considering the possibility of getting rid of RTTI too, but it seems Godot makes use of one or two libraries that do use it. I'm not sure how safe it would be to try to enable RTTI just for those. But thats probably a discussion for another time. |
71f4d0c to
af222d5
Compare
af222d5 to
d70e6ff
Compare
FYI, we recently got rid of the multiple inheritance in the openxr module in PR #104087 There's still a |
Hm, would it crash then when calling |
No, it should merely return |
No it won't crash, but cast will always fail for this case. e.g, I have added |
|
I found a good way to assert the |
d70e6ff to
47b7e00
Compare
…tance, gaining 8x throughput over `dynamic_cast`. Add `-Wvirtual-inheritance` to compiler warnings as a sanity check.
47b7e00 to
dd9dc75
Compare
|
Ok, I just added a fallback to The alternative would have been to use |
|
Thanks! |
| "-Wshadow", | ||
| "-Wno-misleading-indentation", | ||
| # For optimized Object::cast_to / object.inherits_from() | ||
| "-Wvirtual-inheritance", |
There was a problem hiding this comment.
Does that mean we can't use any thirdparty header using this? I'm using FastNoise2 in my module, which uses virtual inheritance in a header. It triggers this warning and fails compilation with Werror.
There was a problem hiding this comment.
It just means the headers will need to be wrapped with pragmas disabling the warning. A few places in the repo are already doing this, though the D3D12 module is probably the most obvious example
|
I'm getting a warning about the flag for every c file compiled:
|
|
Was just looking at backporting this to 3.x, and it turns out that as far as I can see, the backup path in 3.x (when The only slight difference seems to be it compares a pointer to a static per class, instead of using UPDATE also finding the same perf advantages on 3.x on Linux desktop, the scene tree traversal in the FTI is 2-3x faster using the existing static approach (rather than |
|
From disassembling |
|
It might have been mentioned here (or in the linked talk) already, and I realize we're not disabling RTTI despite this change, but note that it's possible to hack together a custom It's something along these lines: template <typename T>
const void* my_custom_type_id() {
#ifdef _MSC_VER
return static_cast<const void*>(__FUNCSIG__);
#else
return static_cast<const void*>(__PRETTY_FUNCTION__);
#endif
}Not sure how it holds up outside of GCC/Clang/MSVC though, if that's a concern. |
|
That's an interesting approach! |
Object::cast_tois used to check if an object is of a certain type. Its current implementation usesdynamic_castto accomplish this.dynamic_castis notoriously slow, and can be optimized under certain assumptions.Object::cast_tois called ~2500 times across the codebase, so this should affect many areas of the engine, including scene tree traversal.(This implementation is almost as fast as #103693, but applies to all Object classes, not just final classes.)
Explanation
To find out why
dynamic_castis so slow, I watched this cppcon talk by Arthur O'Dwyer, but TL;DR is:For
Object, we can assume simple, linear inheritance. In essence, we should only need to check 2 or 3 values for everything we pass in.To implement this simplification, I used the
virtualmethod trick as in the talk, but simplified it further to just ignore all the problematic parts that could slow it down. The compiler will inline calls and simplify to a few int comparisons, making the implementation extremely fast.Benchmarks
I benchmarked a performance difference of about
7xfor both hits and misses:Code
(notably out-of-project for quick iteration)
This printed:
Edit: I now also tested in-project differences in speed, it's comparable to the example above:
Code
This printed:
539ms(test_old)126ms(test_new)Caveats
This optimizations bars us from virtual inheritance for
Objectderived classes.I don't think this is bad. For one, the current
GDCLASSsystem doesn't support it anyway. In addition, it would arguably be bad code to do so. In any case, I added-Wvirtual-inheritanceto compiler warnings as a sanity check.I think this change would break compatibility with GDExtension, as they would now be required to override
derives_frominGDCLASSas well. (Edit: Or it may not practically, see #103708 (comment)).It's possible that other compilers fare worse than clang, especially without optimization, because parent calls to
derives_frommay not be inlined. I'm not sure if it's possible toFORCE_INLINEa virtual function call (or at least, a static call to a virtual function).This PR also seems to increase binary size by ~800kb for me. I think it's worth it, and I might be able to cut that down again in a follow up PR.