sync: Mutex performance collapses with high concurrency

Some of the Go services I work with run on large machines, often with `runtime.NumCPU` reporting around 20–40 logical CPUs. They typically leave GOMAXPROCS at its default (high) value, so a single server process can use the entire machine.

The services handle inbound HTTP requests. For each inbound request, the services make some number outbound calls, wait for the responses, do calculations, and possibly make additional outbound calls. In these services, GOMAXPROCS is constant, the arrival rate of requests is subject to external forces, and the resulting concurrency—how many requests is the process handling at this moment—is controlled by how quickly the process responds to each request. (This sets them apart from worker-style programs that might poll a queue for new jobs.)

Starting with the Go 1.9 release, I've seen this type of service occasionally encounter sudden and severe lock contention.

I suspect this is related to the "starvation" mode added in Go 1.9, described in the [release notes](https://golang.org/doc/go1.9#sync) as "Mutex is now more fair". Just as Go 1.9's "starvation mode" change traded raw performance for more consistent behavior (fixing cases of pathological tail latency), our services need more consistent performance here. The applications sometimes use third-party packages that use `sync.Mutex` to protect internal state, and our options for responding to sudden contention there are limited.

This looks a bit different from what's described in [_Non-scalable locks are dangerous_](https://www.kernel.org/doc/ols/2012/ols2012-zeldovich.pdf) since `sync.Mutex` only briefly spins before registering with a runtime semaphore. Those semaphores appear to be sharded, and to be protected by MCS locks.

When the microbenchmark's performance collapses, the execution trace shows that handing off the lock (from when the call to `Unlock` starts through to when the next goroutine's call to `Lock` is about to return) takes about 60–70µs.

CC @aclements

### What version of Go are you using (`go version`)?

I've run the benchmark with go1.8.7, go1.9.7, go1.12.7, and go1.13beta1 on linux/amd64.

### Does this issue reproduce with the latest release?

Yes

### What operating system and processor architecture are you using (`go env`)?

The benchmark data is from an amd64 / x86_64 machine running Linux 4.14.

### What did you do?

I built a micro-benchmark that is parallelizable except for a specific critical section. The default parameters of this benchmark are for each task to take about 350µs and to hold the shared lock for 1% of that time, about 3.5µs. 

I ran the benchmark on a machine with 96 logical cores. I left GOMAXPROCS at the default of 96, and I varied the number of worker goroutines from 1 to 96. This mirrors the behavior of our RPC services, where the program is allowed to use all available hardware but the effective concurrency varies as the program runs.

### What did you expect to see?

Since the critical section is short (1% of task cost), I expect that two goroutines will be able to process work at nearly double the speed of a single worker, and that 10 goroutines will have nearly 10x performance. (A speedup of 9x or 8x would also be acceptable.) I expect that adding worker goroutines will have diminishing returns, particularly once there's a goroutine per physical core (48 in this case) or when the rate at which the whole process completes tasks gets close to the length of the critical section. I expect that adding a worker goroutine won't significantly decrease the aggregate processing speed.

### What did you see instead?

The following chart shows the results of running the test with a few versions of Go, each normalized so the performance of the single-goroutine case is defined as 1.

Go 1.8 has nearly linear performance up to 41 workers, finishing work at about 35x the rate of a single goroutine. After that, performance ranges from 28x to 37x.

The behavior of Go 1.9 is very different. It has nearly linear performance up to 45 workers (38x baseline), but with 46 workers its performance plummets to 12x baseline. At 51 workers it's at about 6x baseline, and after that its performance is about 5x baseline.

For Go 1.12 and Go 1.13 (beta), the benchmark includes user annotations via the runtime/trace package. Their performance is nearly linear until 28 workers (24x baseline), and have a more gradual fall back to 5x baseline with 44 workers.

The behavior of Go 1.8 is what I expect. With Go 1.9 and later, this sort of performance collapse means that a temporary increase in concurrency to a server can lead to a significant decrease in its performance, which causes a more permanent increase in concurrency.

(The parameters I used for the benchmark demonstrated collapse around concurrency of 32 to 48, but they can be tuned to show collapse at other points.)

<img width="1384" alt="sync.Mutex collapse" src="https://user-images.githubusercontent.com/230685/63392722-9da17800-c36c-11e9-8f91-8406965efbe2.png">

This next image shows additional results.

Using a channel as a lock and building a scalable / non-collapsing MCS lock both have poor but consistent performance. They show linear increase until concurrency and speedup are both at 5 or 6, and then continue to have speedup of 5 up through concurrency of 96.

However, adding an explicit call to `runtime.Gosched` when releasing the lock results in consistent and good performance (similar to Go 1.8's behavior described above) for all lock types: `sync.Mutex`, channel, and MCS.

<img width="1377" alt="alternative locks" src="https://user-images.githubusercontent.com/230685/63392755-c6c20880-c36c-11e9-9414-9795b7f92611.png">

<details><summary><code>$ cat ./lock_test.go</code></summary><br>

```
package repro

import (
	"flag"
	"fmt"
	"runtime"
	"strconv"
	"strings"
	"sync"
	"sync/atomic"
	"testing"
	"unsafe"
)

var (
	cost        = flag.Int("cost", 1000, "loop cost")
	ratio       = flag.Float64("ratio", 0.01, "fration of loops with lock held")
	concurrency = flag.String("concurrency", "", "number of concurrent workers")

	sink interface{}
)

func BenchmarkLock(b *testing.B) {

	const loopCount = 1000

	testcase := func(workers int, mu sync.Locker, loopCost, withLock int, sem semaphore) func(b *testing.B) {
		return func(b *testing.B) {
			var (
				wg sync.WaitGroup

				buf = make([]byte, loopCost)
				out byte

				start = make(chan struct{})
			)

			for i := 0; i < workers; i++ {
				work := b.N / workers
				if i < b.N-(b.N/workers)*workers {
					// division rounded down, now's the time to round up
					work++
				}

				wg.Add(1)
				go func() {
					defer wg.Done()

					<-start

					sem.Acquire()
					defer sem.Release()

					for i := 0; i < work; i++ {
						var v byte
						for k := 0; k < loopCount-withLock; k++ {
							for l := range buf {
								v += buf[l]
							}
						}
						mu.Lock()
						for k := 0; k < withLock; k++ {
							for k := range buf {
								v += buf[k]
							}
						}
						out = v
						mu.Unlock()
					}
				}()
			}

			b.ResetTimer()
			close(start)
			wg.Wait()

			sink = out
		}
	}

	suite := func(newLock func() sync.Locker) func(b *testing.B) {
		return func(b *testing.B) {
			workers := 4 * runtime.GOMAXPROCS(0)
			cost, ratio := *cost, *ratio
			var concurrencies []int
			for _, v := range strings.Split(*concurrency, ",") {
				n, err := strconv.Atoi(v)
				if err == nil && n > 0 {
					concurrencies = append(concurrencies, n)
				}
			}
			if len(concurrencies) == 0 {
				concurrencies = []int{workers}
			}
			withLock := int(ratio * loopCount)
			for _, concurrency := range concurrencies {
				b.Run(fmt.Sprintf("cost=%d,ratio=%0.3f,concurrency=%d", cost, float64(withLock)/loopCount, concurrency),
					func(b *testing.B) {
						b.Run("basic", testcase(workers, func() sync.Locker {
							mu := newLock()
							mu = &lockTracer{Locker: mu}
							return mu
						}(), cost, withLock, make(semaphore, concurrency)))
						b.Run("gosched", testcase(workers, func() sync.Locker {
							mu := newLock()
							mu = &lockTracer{Locker: mu}
							mu = &lockYielder{Locker: mu}
							return mu
						}(), cost, withLock, make(semaphore, concurrency)))
					})
			}
		}
	}

	b.Run("sync.Mutex", suite(func() sync.Locker { return new(sync.Mutex) }))
	b.Run("mcs", suite(func() sync.Locker { return new(mcs) }))
	b.Run("chan", suite(func() sync.Locker { return newChannelLock() }))
}

type semaphore chan struct{}

func (s semaphore) Acquire() {
	if s != nil {
		s <- struct{}{}
	}
}

func (s semaphore) Release() {
	if s != nil {
		<-s
	}
}

type lockYielder struct {
	sync.Locker
}

func (ly *lockYielder) Lock()   { ly.Locker.Lock() }
func (ly *lockYielder) Unlock() { ly.Locker.Unlock(); runtime.Gosched() }

type lockTracer struct {
	sync.Locker
	handoffEnd func()
}

type channelLock chan struct{}

func newChannelLock() channelLock { return make(chan struct{}, 1) }
func (mu channelLock) Lock()      { mu <- struct{}{} }
func (mu channelLock) Unlock()    { <-mu }

type mcs struct {
	tail *qnode

	holder *qnode // protected by the lock itself
}

type qnode struct {
	next *qnode

	wg sync.WaitGroup
}

func (mu *mcs) Lock() {
	node := mu.lock()
	mu.holder = node
}

func (mu *mcs) Unlock() {
	node := mu.holder
	mu.holder = nil
	mu.unlock(node)
}

func (mu *mcs) lock() *qnode {
	self := new(qnode)
	self.wg.Add(1)
	prev := (*qnode)(atomic.SwapPointer((*unsafe.Pointer)(unsafe.Pointer(&mu.tail)), unsafe.Pointer(self)))
	if prev != nil {
		atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&prev.next)), unsafe.Pointer(self)) // prev.next = self
		// wait for prev node to unlock us
		self.wg.Wait()
	}
	return self
}

func (mu *mcs) unlock(node *qnode) {
	for {
		next := (*qnode)(atomic.LoadPointer((*unsafe.Pointer)(unsafe.Pointer(&node.next))))
		if next != nil {
			// known successor, unblock their call to Lock
			next.wg.Done()
			return
		}

		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&mu.tail)), unsafe.Pointer(node), nil) {
			// no known successor, and in fact there's no successor at all
			return
		}

		// successor hasn't finished adding themselves to the queue
	}
}
```

</details>

<details><summary><code>$ cat ./lock_trace_go111_test.go</code></summary><br>

```
// +build go1.11

package repro

import (
	"context"
	"runtime/trace"
)

func (lt *lockTracer) Lock() {
	wait := trace.StartRegion(context.Background(), "wait-lock")
	lt.Locker.Lock()
	wait.End()
	if lt.handoffEnd != nil {
		lt.handoffEnd()
		lt.handoffEnd = nil
	}
}

func (lt *lockTracer) Unlock() {
	_, handoff := trace.NewTask(context.Background(), "handoff")
	lt.handoffEnd = handoff.End
	wait := trace.StartRegion(context.Background(), "wait-unlock")
	lt.Locker.Unlock()
	wait.End()
}
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: Mutex performance collapses with high concurrency #33747

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sync: Mutex performance collapses with high concurrency #33747

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?