Enhance SpinLock#85167
Conversation
b29354a to
8e1132e
Compare
221abf0 to
8bc78ab
Compare
Is the AMD Zen Note that if this is implemented, this will need to be tested by someone else as I don't have the hardware to test it. |
I didn't know about it, to be honest. However, judging by the spec, it seems that Now I'm wondering if having to branch to the |
8bc78ab to
14f66bc
Compare
|
I'd like to add my two cents saying that I'm eager to see this class being improved as the current version can really be dangerous for performance. Note that despite the improvements, Mutex is still a better choice on modern hardware and systems in most cases, even for short locks. |
14f66bc to
c5a3b6b
Compare
c5a3b6b to
2652397
Compare
|
|
2652397 to
f545d40
Compare
8addc1f to
d0fb2fc
Compare
|
I'm bumping this because I've added benchmark results to the description. |
d0fb2fc to
d5a24f9
Compare
|
Going back to the CPU pause approach, once realized thread yield is not appropriate for this. I've also applied @bruvzg's suggestions. In the future we may want to get rid of spinlocks altogether, but for now let's at least have a better one. Other CPU-specific instructions such as the |
8bc703d to
03069b4
Compare
There was a problem hiding this comment.
Tested locally (rebased on top of master b00e1cb), it works as expected.
Benchmark
PC specifications
- CPU: Intel Core i9-13900K
- GPU: NVIDIA GeForce RTX 4090
- RAM: 64 GB (2×32 GB DDR5-5800 C30)
- SSD: Solidigm P44 Pro 2 TB
- OS: Linux (Fedora 41)
Using an optimized editor build for testing (production=yes lto=full).
A is master, B is this PR rebased on master.
{
"category": "Core > Object Db",
"name": "12 Threads Full Contention",
"results": {
"time": {
"a": 2297.0,
"b": 1186.0,
"a_div_b": 1.936762225969646
}
}
}
{
"category": "Core > Object Db",
"name": "12 Threads Half Contention",
"results": {
"time": {
"a": 1155.0,
"b": 605.3,
"a_div_b": 1.9081447216256404
}
}
}
{
"category": "Core > Object Db",
"name": "12 Threads Little Contention",
"results": {
"time": {
"a": 233.0,
"b": 128.8,
"a_div_b": 1.8090062111801242
}
}
}
{
"category": "Core > Object Db",
"name": "12 Threads Slope Contention",
"results": {
"time": {
"a": 1064.0,
"b": 599.5,
"a_div_b": 1.774812343619683
}
}
}
{
"category": "Core > Object Db",
"name": "2 Threads Full Contention",
"results": {
"time": {
"a": 83.19,
"b": 76.33,
"a_div_b": 1.0898729202148565
}
}
}
{
"category": "Core > Object Db",
"name": "2 Threads Half Contention",
"results": {
"time": {
"a": 50.19,
"b": 45.45,
"a_div_b": 1.104290429042904
}
}
}
{
"category": "Core > Object Db",
"name": "2 Threads Little Contention",
"results": {
"time": {
"a": 26.51,
"b": 28.54,
"a_div_b": 0.9288717589348284
}
}
}
{
"category": "Core > Object Db",
"name": "4 Threads Full Contention",
"results": {
"time": {
"a": 275.8,
"b": 214.9,
"a_div_b": 1.2833876221498373
}
}
}
{
"category": "Core > Object Db",
"name": "4 Threads Half Contention",
"results": {
"time": {
"a": 148.8,
"b": 119.2,
"a_div_b": 1.2483221476510067
}
}
}
{
"category": "Core > Object Db",
"name": "4 Threads Little Contention",
"results": {
"time": {
"a": 39.28,
"b": 38.06,
"a_div_b": 1.0320546505517603
}
}
}
{
"category": "Core > Object Db",
"name": "4 Threads Slope Contention",
"results": {
"time": {
"a": 199.8,
"b": 163.7,
"a_div_b": 1.220525351252291
}
}
}
{
"category": "Core > Object Db",
"name": "8 Threads Full Contention",
"results": {
"time": {
"a": 1018.0,
"b": 601.0,
"a_div_b": 1.6938435940099834
}
}
}
{
"category": "Core > Object Db",
"name": "8 Threads Half Contention",
"results": {
"time": {
"a": 520.0,
"b": 319.8,
"a_div_b": 1.6260162601626016
}
}
}
{
"category": "Core > Object Db",
"name": "8 Threads Little Contention",
"results": {
"time": {
"a": 118.2,
"b": 78.52,
"a_div_b": 1.5053489556800816
}
}
}
{
"category": "Core > Object Db",
"name": "8 Threads Slope Contention",
"results": {
"time": {
"a": 551.2,
"b": 356.9,
"a_div_b": 1.5444101989352763
}
}
}03069b4 to
360ed2b
Compare
clayjohn
left a comment
There was a problem hiding this comment.
Looks good to me. The comments from bruvzg and mrsaturnsan have both been resolved. So I think we can go ahead with this.
|
Thanks! |
|
This changeset causes Godot to crash on startup on Linux when using builds compiled with AVX2 ISAs enabled (AMD Excavator, Intel Haswell or newer) git checkout 01ad56da38fa10949ff7878d4d6a5d6c195eff39
scons target=editor ccflags="-march=x86-64-v3"
./bin/godot.linuxbsd.editor.x86_64This occurred on an AMD Ryzen 9 7900X. Edit 2: This seems to be a cacheline issue. Adding Stacktrace 1 (stock) Stacktrace 2 (inline everywhere) The segfault occurs when zeroing the atomic lock data which it does by moving a zero from the ymm register. |
|
@RandomShaper AMD stated in their optimization guide that the compiler can automatically compile std_mutex into the mwaitx hardware instructions. |
|
As a note, that's not guaranteed. On POSIX it will be delegated to pthread_mutex_{lock,unlock}, which calls Edit: Looking over the assembly, this seems to do |
|
Removing the
Either this attribute is ignored or alignas does not behave as expected. Edit: Reading up on the alignas keyword, it apparently cannot exceed Edit 2: 😅 now it's segfaulting in AVX-512 CPU targets, resolved by bumping the alignment to 128. The particular sensitivity to cache line size is frustrating. Edit 3: It still crashes, just at random points now. This is very unstable on x86-64-v3 and v4. |
|
@yretenai. thank you for your findings. Can you give me stack traces of the crashes? I'm wondering if there are cases where the |
|
Built with Built with Reminder: the code works without alignment. The compiler will attempt to ensure std::atomic is aligned to the required alignment by introducing padding structures (which it seems to fail to do when forced alignment is introduced,) but this may not per-se be aligned to the cache line as demonstrated here: Also: Enforcing alignment with |
As advised by https://rigtorp.se/spinlock/ and other resources, a good spinlock should do the following at least (and the following points are what this PR adds):
2024-10-23:
As a disclaimer, I haven't benchmarked this for either performance nor power usage. Maybe the Production team can lend a hand.Using a benchmark I've written for this (godotengine/godot-benchmarks#95), these are the results on a Windows 10 PC, before (
a) and after (b) this PR (unit is time, lower is better; note how this PR highly improves the contended cases):2024-11-04: Benchmarked after going back to the CPU pause approach: