Skip to content

Optimize Object::cast_to by assuming no virtual and multiple inheritance, gaining 7x throughput over dynamic_cast.#103708

Merged
akien-mga merged 1 commit intogodotengine:masterfrom
Ivorforce:linear-casts-fast-virtual-casts-bad
Mar 28, 2025
Merged

Optimize Object::cast_to by assuming no virtual and multiple inheritance, gaining 7x throughput over dynamic_cast.#103708
akien-mga merged 1 commit intogodotengine:masterfrom
Ivorforce:linear-casts-fast-virtual-casts-bad

Conversation

@Ivorforce
Copy link
Copy Markdown
Member

@Ivorforce Ivorforce commented Mar 6, 2025

Object::cast_to is used to check if an object is of a certain type. Its current implementation uses dynamic_cast to accomplish this. dynamic_cast is notoriously slow, and can be optimized under certain assumptions.

Object::cast_to is called ~2500 times across the codebase, so this should affect many areas of the engine, including scene tree traversal.

(This implementation is almost as fast as #103693, but applies to all Object classes, not just final classes.)

Explanation

To find out why dynamic_cast is so slow, I watched this cppcon talk by Arthur O'Dwyer, but TL;DR is:

  • It needs to account for virtual subclasses
  • It needs to account for multiple inheritance
  • It needs to account for access modifiers
  • Implementations are suboptimal

For Object, we can assume simple, linear inheritance. In essence, we should only need to check 2 or 3 values for everything we pass in.
To implement this simplification, I used the virtual method trick as in the talk, but simplified it further to just ignore all the problematic parts that could slow it down. The compiler will inline calls and simplify to a few int comparisons, making the implementation extremely fast.

Benchmarks

I benchmarked a performance difference of about 7x for both hits and misses:

Code

(notably out-of-project for quick iteration)

struct Object {
	virtual ~Object() = default;
	virtual void a() {}
	virtual bool derives_from(const std::type_info &type_info) {
		return typeid(decltype(this)) == type_info;
	}
};

struct B : public Object {
	void a() override {}
	virtual bool derives_from(const std::type_info &type_info) {
		return typeid(decltype(this)) == type_info || Object::derives_from(type_info);
	}
};

struct C : public B {
	void a() override {}
	__attribute__ ((noinline)) void test() {}
	virtual bool derives_from(const std::type_info &type_info) {
		return typeid(decltype(this)) == type_info || B::derives_from(type_info);
	}
};

template <typename T>
static T *cast_to_old(Object *p_object) {
	return dynamic_cast<T *>(p_object);
}

template <typename T>
__attribute__ ((noinline)) void test_old(Object *a) {
	for (int i = 0; i < 100000000; ++i) {
		T *c = cast_to_old<T>(a);
		if (c) {
			c->test();
		}
	}
}

template <typename T>
static T *cast_to_virtual(Object *p_object) {
	return p_object->derives_from(typeid(T)) ? static_cast<T *>(p_object) : nullptr;
}

template <typename T>
__attribute__ ((noinline)) void test_virtual(Object *a) {
	for (int i = 0; i < 100000000; ++i) {
		T *c = cast_to_virtual<T>(a);
		if (c) {
			c->test();
		}
	}
}

int main()
{
	{
		auto t0 = std::chrono::high_resolution_clock::now();
		test_old<C>(new C());
		auto t1 = std::chrono::high_resolution_clock::now();
		std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";
	}
	{
		auto t0 = std::chrono::high_resolution_clock::now();
		test_old<C>(new B());
		auto t1 = std::chrono::high_resolution_clock::now();
		std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";
	}


	{
		auto t0 = std::chrono::high_resolution_clock::now();
		test_virtual<C>(new C());
		auto t1 = std::chrono::high_resolution_clock::now();
		std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";
	}
	{
		auto t0 = std::chrono::high_resolution_clock::now();
		test_virtual<C>(new B());
		auto t1 = std::chrono::high_resolution_clock::now();
		std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";
	}
}

This printed:

(old implementation)
973ms (hit: cast C to C)
760ms (miss: cast B to C)

(new implementation)
110ms (hit: cast C to C)
90ms (miss: cast B to C)

Edit: I now also tested in-project differences in speed, it's comparable to the example above:

Code
__attribute__ ((noinline)) void test(Object *obj) {

}

template <typename T>
static T *cast_to_old(Object *p_object) {
	return dynamic_cast<T *>(p_object);
}

template <typename T>
__attribute__ ((noinline)) void test_old(Object *a) {
	for (int i = 0; i < 100000000; ++i) {
		T *c = cast_to_old<T>(a);
		if (c) {
			test(c);
		}
	}
}

template <typename T>
__attribute__ ((noinline)) void test_new(Object *a) {
	for (int i = 0; i < 100000000; ++i) {
		T *c = Object::cast_to<T>(a);
		if (c) {
			test(c);
		}
	}
}

int test_main(int argc, char *argv[]) {
	Object *obj = memnew(Node3D);

	{
		auto t0 = std::chrono::high_resolution_clock::now();
		test_old<Node3D>(obj);
		auto t1 = std::chrono::high_resolution_clock::now();
		std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";
	}
	{
		auto t0 = std::chrono::high_resolution_clock::now();
		test_new<Node3D>(obj);
		auto t1 = std::chrono::high_resolution_clock::now();
		std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << "ms\n";
	}

This printed:

  • 539ms (test_old)
  • 126ms (test_new)

Caveats

This optimizations bars us from virtual inheritance for Object derived classes.
I don't think this is bad. For one, the current GDCLASS system doesn't support it anyway. In addition, it would arguably be bad code to do so. In any case, I added -Wvirtual-inheritance to compiler warnings as a sanity check.

I think this change would break compatibility with GDExtension, as they would now be required to override derives_from in GDCLASS as well. (Edit: Or it may not practically, see #103708 (comment)).

It's possible that other compilers fare worse than clang, especially without optimization, because parent calls to derives_from may not be inlined. I'm not sure if it's possible to FORCE_INLINE a virtual function call (or at least, a static call to a virtual function).

This PR also seems to increase binary size by ~800kb for me. I think it's worth it, and I might be able to cut that down again in a follow up PR.

@Ivorforce Ivorforce requested review from a team as code owners March 6, 2025 16:36
@Ivorforce Ivorforce changed the title Optimize Object::cast_to 7x by assuming no virtual and multiple inheritance, gaining 8x throughput over dynamic_cast. Optimize Object::cast_to by assuming no virtual and multiple inheritance, gaining 7x throughput over dynamic_cast. Mar 6, 2025
@dsnopek
Copy link
Copy Markdown
Contributor

dsnopek commented Mar 6, 2025

We have some multiple inheritance in the openxr module (which I've argued for removing!), but I'm not exactly sure how it would be affected by this change. Anytime we do Object::cast_to<T>() where T is an Object type, it seems like it'd be fine. Then we could maybe do a couple of dynamic_cast<T>()s for the non-Object parent class? That would maybe allow the multiple inheritance to keep working. Or, we could stop doing that :-)

I think this change would break compatibility with GDExtension, as they would now be required to override derives_from in GDCLASS as well.

Hm, maybe. GDExtension classes are always "attached" to an instance of their nearest parent class that exists within the engine, and any Object::cast_to<T>() within the engine would only address classes within the engine, so it might be fine?

In godot-cpp, we'd maybe want to change the way we do Object::cast_to<T>() to match the engine, but what we have now (using dynamic_cast<T>()) might continue working as-is?

This would require some experimentation.

I could also see a potential affect on one GDExtension using classes from another GDExtension, but that's not a use-case we officially support at the moment anyway - we'd just need to keep this in mind when adding support for it.

@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented Mar 6, 2025

We have some multiple inheritance in the openxr module (which #101951 (comment)!), but I'm not exactly sure how it would be affected by this change.

I think as long as the multiple inheritance doesn't lead to Object twice, i.e. the other inheritors are secondary, it should be fine. Which it should, given GDCLASS only supports one superclass parameter.

Hm, maybe. GDExtension classes are always "attached" to an instance of their nearest parent class that exists within the engine, and any Object::cast_to() within the engine would only address classes within the engine, so it might be fine?
In godot-cpp, we'd maybe want to change the way we do Object::cast_to() to match the engine, but what we have now (using dynamic_cast()) might continue working as-is?

Interesting thought! Yeah, with some luck, it could actually be backwards compatible 😄

Speaking of this, this PR would definitely need to expose this function to GDExtension for overriding. Not sure how to do that.

@Ivorforce Ivorforce force-pushed the linear-casts-fast-virtual-casts-bad branch 2 times, most recently from 7c2be69 to 94b4eb5 Compare March 6, 2025 18:58
@Ivorforce
Copy link
Copy Markdown
Member Author

I searched the issues for dynamic_cast and randomly selected #64398 as a candidate to test real-world impact.
The reproducer profiled 10% impact of dynamic_cast for me:

1.14 Gc   13.5%	1.14 Gc	 	 Node3D::get_global_transform() const
1.11 Gc   13.1%	1.11 Gc	 	 Node::get_child_count(bool) const
908.59 Mc   10.7%	908.59 Mc	 	 Basis::invert()
892.64 Mc   10.6%	892.64 Mc	 	 __dynamic_cast
715.91 Mc    8.5%	715.91 Mc	 	 Node3DEditorViewport::_calculate_spatial_bounds(Node3D const*, bool, Transform3D const*)
521.04 Mc    6.1%	521.04 Mc	 	 __cxxabiv1::__si_class_type_info::search_below_dst(__cxxabiv1::__dynamic_cast_info*, void const*, int, bool) const
493.45 Mc    5.8%	493.45 Mc	 	 Transform3D::affine_inverse() const
461.81 Mc    5.4%	461.81 Mc	 	 Transform3D::operator*=(Transform3D const&)
447.50 Mc    5.3%	447.50 Mc	 	 AABB::merge_with(AABB const&)
183.30 Mc    2.1%	183.30 Mc	 	 Node::can_process() const

On this branch, after exchanging Object::cast_to, the entry disappears:

1.10 Gc   11.4%	1.10 Gc	 	 Node::get_child_count(bool) const
995.57 Mc   10.3%	995.57 Mc	 	 Node3DEditorViewport::_calculate_spatial_bounds(Node3D const*, bool, Transform3D const*)
796.07 Mc    8.2%	796.07 Mc	 	 Basis::invert()
541.20 Mc    5.6%	541.20 Mc	 	 Node3D::get_global_transform() const
465.41 Mc    4.8%	465.41 Mc	 	 Transform3D::affine_inverse() const
435.89 Mc    4.5%	435.89 Mc	 	 Transform3D::operator*=(Transform3D const&)
344.45 Mc    3.5%	344.45 Mc	 	 AABB::merge_with(AABB const&)
230.18 Mc    2.3%	230.18 Mc	 	 DynamicBVH::optimize_incremental(int)
188.47 Mc    1.9%	188.47 Mc	 	 Node::can_process() const

It's a somewhat wonky comparison, because it's not a stress test (although the CPU use is, as the author correctly claimed, unexpectedly high).

But I think it can be concluded that the optimization works as expected in this case.

@lawnjelly
Copy link
Copy Markdown
Member

lawnjelly commented Mar 9, 2025

This is great! Maybe we can test on different compilers ..
can https://godbolt.org/ help here maybe to see the generated code?

Having said that, I'm not sure why a compiler wouldn't inline the inherited calls (at least in release as you say) as they should be able to be evaluated at compile time. I'm pretty sure we must rely on this elsewhere for inheritance macros (I haven't studied them).

I just did a test in a particular intensive scene tree traversal (for #103685) I was doing by just implementing a bitfield in Node (as I was only interested in checking for Spatial and VisualInstance) and this was a boatload faster than cast_to(). Increased my overall traversal speed of the SceneTree by 1/3 (and I suspect some of the rest might be cache related), so that confirms there is a lot of low hanging fruit here.

The unoptimized version has to traverse the entire SceneTree twice per frame, which currently is taking 700 or so usecs, down to around 400 with the bitfield changes. That's a lot in the 16ms allowance, so I'd like to optimize this further, but any reduction in the cast_to costs is welcome.

(Obviously your approach involves no storage on the node as it works through virtual system, so is better for general use, but this was just a simple bottleneck test.)

I'd like to also try this in 3.x .. we have no GDExtension there so no binary compatibility to worry about (I'm assuming that's part of the issue with GDExtension?). For recompiled modules I think they would work fine with it.

@Ivorforce
Copy link
Copy Markdown
Member Author

This is great! Maybe we can test on different compilers ..
can https://godbolt.org/ help here maybe to see the generated code?

Having said that, I'm not sure why a compiler wouldn't inline the inherited calls (at least in release as you say) as they should be able to be evaluated at compile time. I'm pretty sure we must rely on this elsewhere for inheritance macros (I haven't studied them).

I had a look... unexpectedly there's still a ton of optimization potential:
https://godbolt.org/z/b4W3jcqPv

Looks like even retrieving the typeid for comparison is complicated.
Someone suggested comparing pointers first to shortcut the test if the compiler allows, but still doing the 'safe' comparison otherwise.
This all seems unnecessarily complicated to me; it seems like there should be a typeid that's just an integer value or something.
This seems out of scope for this PR; while this is not optimal, it's still far better than dynamic_cast. We could think about further improvements in a future PR.

I was doing by just implementing a bitfield in Node

This was my first thought as well, but this would of course only optimize specific classes (if I'm understanding your approach right).
Still, both could be used if bitfields are needed to optimize specific cases!

I'd like to also try this in 3.x .. we have no GDExtension there so no binary compatibility to worry about (I'm assuming that's part of the issue with GDExtension?). For recompiled modules I think they would work fine with it.

Agreed.

@Jordyfel
Copy link
Copy Markdown
Contributor

Reminds me of the discussion here #80612 (comment) which mentions that this dynamic cast prevents compiling without RTTI

@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented Mar 17, 2025

Reminds me of the discussion here #80612 (comment) which mentions that this dynamic cast prevents compiling without RTTI

This PR notably still uses RTTI, though it does make a step in a direction of Object working without it.

I also have been considering the possibility of getting rid of RTTI too, but it seems Godot makes use of one or two libraries that do use it. I'm not sure how safe it would be to try to enable RTTI just for those. But thats probably a discussion for another time.

@Ivorforce Ivorforce force-pushed the linear-casts-fast-virtual-casts-bad branch 3 times, most recently from 71f4d0c to af222d5 Compare March 27, 2025 10:02
@Ivorforce Ivorforce requested a review from a team as a code owner March 27, 2025 10:02
@Ivorforce Ivorforce removed request for a team March 27, 2025 10:29
@Ivorforce Ivorforce force-pushed the linear-casts-fast-virtual-casts-bad branch from af222d5 to d70e6ff Compare March 27, 2025 11:07
@dsnopek
Copy link
Copy Markdown
Contributor

dsnopek commented Mar 27, 2025

We have some multiple inheritance in the openxr module (which I've argued for removing!), but I'm not exactly sure how it would be affected by this change

FYI, we recently got rid of the multiple inheritance in the openxr module in PR #104087

There's still a dynamic_cast<T>(..) in a bit of the deprecated code, but I think we could actually remove that, now that the multiple inheritance is gone. Not super important unless we get to a point where we want to remove RTTI from all the non-thirdparty Godot code, though

@akien-mga
Copy link
Copy Markdown
Member

akien-mga commented Mar 27, 2025

I assume the call to Object::cast_to would now fail compiling in cases where cast_to is used on Objects which aren't GDCLASS? (like https://github.com/godotengine/godot/pull/104660/files#r2016547193)

It will fail compiling only if base class missing GDCLASS, but not if derived class missing it (which is the case for DisplayServer derived classes for example).

Hm, would it crash then when calling p_object->_derives_from? If so we should likely guard against that (with some static check if we can).

@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented Mar 27, 2025

Hm, would it crash then when calling p_object->_derives_from? If so we should likely guard against that (with some static check if we can).

No, it should merely return false for any class in the inheritance chain that doesn't use GDCLASS.

@bruvzg
Copy link
Copy Markdown
Member

bruvzg commented Mar 27, 2025

Hm, would it crash then when calling p_object->_derives_from?

No it won't crash, but cast will always fail for this case. e.g, I have added GDCLASS in #104660, since otherwise cast to DisplayServerMacOS was always false.

@Ivorforce Ivorforce marked this pull request as draft March 27, 2025 14:14
@Ivorforce
Copy link
Copy Markdown
Member Author

I found a good way to assert the GDCLASS assumption is true. This uncovered several potential regressions. I'll quickly fix those up.

@Ivorforce Ivorforce force-pushed the linear-casts-fast-virtual-casts-bad branch from d70e6ff to 47b7e00 Compare March 27, 2025 14:38
@Ivorforce Ivorforce marked this pull request as ready for review March 27, 2025 14:38
…tance, gaining 8x throughput over `dynamic_cast`.

Add `-Wvirtual-inheritance` to compiler warnings as a sanity check.
@Ivorforce Ivorforce force-pushed the linear-casts-fast-virtual-casts-bad branch from 47b7e00 to dd9dc75 Compare March 27, 2025 14:39
@Ivorforce
Copy link
Copy Markdown
Member Author

Ivorforce commented Mar 27, 2025

Ok, I just added a fallback to dynamic_cast for now if GDCLASS isn't used for T.

The alternative would have been to use GDCLASS for all the affected objects, but I'd rather not decide that now.
I'm fairly confident now that all assumptions have been sufficiently asserted, and there isn't a big risk of regression left. Thanks for raising concerns in time!

Copy link
Copy Markdown
Member

@AThousandShips AThousandShips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new additions with is_same_v make sense, as it must (as per the static_assert) derive from Object, and therefore self_type must be valid

I think not enforcing GDCLASS unnecessarily is the better way to go

@akien-mga akien-mga merged commit d01ffe4 into godotengine:master Mar 28, 2025
20 checks passed
@Ivorforce Ivorforce deleted the linear-casts-fast-virtual-casts-bad branch March 28, 2025 13:41
@akien-mga
Copy link
Copy Markdown
Member

Thanks!

"-Wshadow",
"-Wno-misleading-indentation",
# For optimized Object::cast_to / object.inherits_from()
"-Wvirtual-inheritance",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we can't use any thirdparty header using this? I'm using FastNoise2 in my module, which uses virtual inheritance in a header. It triggers this warning and fails compilation with Werror.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just means the headers will need to be wrapped with pragmas disabling the warning. A few places in the repo are already doing this, though the D3D12 module is probably the most obvious example

@ibrahn
Copy link
Copy Markdown
Contributor

ibrahn commented Mar 30, 2025

I'm getting a warning about the flag for every c file compiled:

[  3%] Compiling thirdparty/glad/glx.c ...
cc1: warning: command-line option '-Wvirtual-inheritance' is valid for C++/ObjC++ but not for C
[  3%] Compiling platform/linuxbsd/wayland/protocol/wayland.gen.c ...
cc1: warning: command-line option '-Wvirtual-inheritance' is valid for C++/ObjC++ but not for C
[  3%] Compiling platform/linuxbsd/wayland/protocol/viewporter.gen.c ...
cc1: warning: command-line option '-Wvirtual-inheritance' is valid for C++/ObjC++ but not for C
[  3%] Compiling platform/linuxbsd/wayland/protocol/fractional_scale.gen.c ...
cc1: warning: command-line option '-Wvirtual-inheritance' is valid for C++/ObjC++ but not for C

gcc --version is gcc (Debian 12.2.0-14) 12.2.0

@lawnjelly
Copy link
Copy Markdown
Member

lawnjelly commented Mar 31, 2025

Was just looking at backporting this to 3.x, and it turns out that as far as I can see, the backup path in 3.x (when NO_SAFE_CAST is defined, which is on Android and web), already uses an analogous system.

The only slight difference seems to be it compares a pointer to a static per class, instead of using typeid keyword, but I suspect the results will be pretty much the same (will test this). I'll do some perf testing but if so we can either use the existing system on all platforms, or convert the typeid approach from here (if it e.g. optimizes better).

UPDATE also finding the same perf advantages on 3.x on Linux desktop, the scene tree traversal in the FTI is 2-3x faster using the existing static approach (rather than dynamic_cast).

@Ivorforce
Copy link
Copy Markdown
Member Author

From disassembling typeid code I'm guessing that the static approach will work better.
The reason I haven't proposed this yet is because I've been trying to find a way to use the vtable for this, which would avoid needing to dynamic call for same-class cases. But it seems so far that c++ doesn't like my idea, so we can probably move forward with statics soon.

@mihe
Copy link
Copy Markdown
Contributor

mihe commented Mar 31, 2025

It might have been mentioned here (or in the linked talk) already, and I realize we're not disabling RTTI despite this change, but note that it's possible to hack together a custom typeid by exploiting __PRETTY_FUNCTION__/__FUNCSIG__, as they're guaranteed to be unique for every instantiation of a template function.

It's something along these lines:

template <typename T>
const void* my_custom_type_id() {
#ifdef _MSC_VER
    return static_cast<const void*>(__FUNCSIG__);
#else
    return static_cast<const void*>(__PRETTY_FUNCTION__);
#endif
}

Not sure how it holds up outside of GCC/Clang/MSVC though, if that's a concern.

@Ivorforce
Copy link
Copy Markdown
Member Author

That's an interesting approach!
I'm not sure it's needed though; I've created a PR today that just uses the existing class_ptr to avoid RTTI as well. Given what both (class_ptr and your suggestion) do, they'll probably compile to the same binary code (a uint64_t equality test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.