Skip to content

Parker interface & std.Mutex performance improvements#3585

Merged
andrewrk merged 17 commits intoziglang:masterfrom
kprotty:adaptive_lock
Nov 8, 2019
Merged

Parker interface & std.Mutex performance improvements#3585
andrewrk merged 17 commits intoziglang:masterfrom
kprotty:adaptive_lock

Conversation

@kprotty
Copy link
Contributor

@kprotty kprotty commented Nov 4, 2019

std.Mutex uses a simple locking scheme for linux, relies on CriticalSection for windows and falls back to spinlocking on other platforms. This PR acts as two parts towards improving it:

1) Adaptive Locking

For high contention cases, eager blocking mutexes incur a penalty of a syscall when they may not need to. In order to address this, the mutex can spin for a little bit trying to acquire the lock similar to a spinlock before deciding to block. This improves performance when the time spent in the critical section is minimal and acquiring/releasing is done frequently. The implementation chosen for this was that of lock_futex.go from Golang 1.13 as it provides a nice balance between spinning and deciding to block (another possibility could be rust/webkit parking_lot). Because this implementation only needs a futex interface, it can be reused:

2) Parker API

Most synchronization primitives such as Mutexes, RwLocks, Condvars, Events and Semaphores can be built upon atomic instructions and futexes for handling blocking. Another point of this PR was to setup a cross-platform futex (Parker) interface in which other primitives as listed above could be built off of. The default one provided is ThreadParker which differs from the current blocking implementation scheme:

  • On windows, it detects at runtime whether to use WaitOnAddress (supported since Win8+ and most similar to linux futex) or NT Keyed Events (supported since WinXP+ and is the inner backing of CriticalSection). This allows the distinction between std.Mutex and std.StaticallyInitializedMutex to dissapear as it can now be initialized statically
  • On posix platforms, it uses pthread_cond_t for synchronization which also supports static initialization. This fairs better for longer blocking critical sections compared to the spinlock default of the current std.Mutex implementation.
  • On linux, it still uses linux_futex so not many improvements besides adaptive spinning there

Results & Future Implications

Because the Parker now has a standardized interface, one could replace ThreadParker with something like AsyncParker and reuse the synchronization primitive code for std.event synchronization objects. In order to demonstrate this, I've provided some example code for AsyncParker as well as a naive benchmark to test the performance of std.Mutex in comparison to this new adaptive mutex: https://github.com/kprotty/zig-adaptive-lock/ The results for high contended, small critical section cases are promising:

  • Windows 10: 7-8x speedup
  • MacOSX: 19-22x speedup
  • Linux (Pthread): 2x speedup
  • Linux (Futex): 2x speedup

@kprotty
Copy link
Contributor Author

kprotty commented Nov 4, 2019

Theres currently a bug on this where on arm & i386 windows, the address being passed into ThreadParker mysteriously changes to another address and I can't figure out why. Could this be related to #3433 ?

@LemonBoy
Copy link
Contributor

LemonBoy commented Nov 5, 2019

Theres currently a bug on this where on arm & i386 windows, the address being passed into ThreadParker mysteriously changes to another address and I can't figure out why.

Because you've declared WakeByAddressSingle (and WaitOnAddress, together with the declarations in KeyedEvent) to be extern meaning that means that the C calling convention is required. But it's Win32 API we're talking about here and those use the stdcall calling convention! Replacing extern with stdcallcc makes the test pass :)

The AArch64 failure you see in the CI is a bug in Qemu that's been fixed ~4years ago (EXCP_YIELD is indeed defined as 0x10004 in the source code).

@kprotty
Copy link
Contributor Author

kprotty commented Nov 5, 2019

@LemonBoy No wonder! Tried doing extern stdcallcc for the fields but it was a syntax error. The arm bug was really throwing me off too. Thanks

@andrewrk
Copy link
Member

andrewrk commented Nov 5, 2019

I'll look into using a newer Ubuntu for the CI in order to get a newer qemu

@kprotty kprotty marked this pull request as ready for review November 7, 2019 13:54
Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this significant contribution!

This is close to being mergeable; I just have a few things for you to consider.

@Rocknest Rocknest mentioned this pull request Nov 7, 2019
pub fn sched_yield() void {
switch (builtin.os) {
.windows => _ = windows.kernel32.SwitchToThread(),
else => assert(system.sched_yield() == 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of asserting just ignore the result?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sched_yield is defined to always succeed, which it is on Linux, then assert is correct here. If it can possibly fail, then we need to have that reflected in the error set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because sched_yield error'ing out means it will always error out like in FreeBSD's case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, OK so we have to treat this as a fallible API now. That's fine. I propose to make the error set:

pub const SchedYieldError = error{
    /// The system is not configured to allow yielding
    SystemCannotYield,
};

pub fn sched_yield() SchedYieldError!void {
    if (builtin.os == .windows) {
        // The return value has to do with how many other threads there are; it is not
        // an error condition on Windows.
        _ = windows.kernel32.SwitchToThread();
        return;
    }
    switch (errno(system.sched_yield()) {
        0 => return,
        ENOSYS => return error.SystemCannotYield,
        else => return error.SystemCannotYield,
    }
}

Normally we have Unexpected as another possibility, but in this case it's unclear why this distinguishing would be beneficial.

Having the ENOSYS and else on different lines here is useful when looking at an error return trace.

The callsite of os.sched_yield can now choose how to deal with this possible error. Probably by ignoring it. But that's why we have these 2 layers of abstraction; the os.zig file is pretty low level, and exposes all the functionality of the underlying syscall, including errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants