Skip to content

jemalloc: deadlocks under go1.12 (at least under darwin) #35620

@tbg

Description

@tbg

The following script results in a wedged single node cluster most of the time (if not immediately, then after a dozen or so seconds). I don't think that the specific script here plays a role, but it's an issue in Go 1.12 and not before.

#!/usr/bin/env bash
set -euo pipefail

export COCKROACH_ENGINE_MAX_SYNC_DURATION=60s

killall -9 cockroach || true
rm -rf cockroach-data* || true
#for i in 0 1 2 3; do
for i in 0; do
  ./cockroach start --vmodule=rocksdb=5 --max-offset 10ms --insecure --host 127.0.0.1 --port $((26257+i)) --http-port $((8080+i)) --background --store "cockroach-data${i}" --join 127.0.0.1:26257
  if [ $i -eq 0 ]; then ./cockroach init --insecure; fi
done
echo "
SET CLUSTER SETTING kv.range_merge.queue_enabled = false;
CREATE TABLE IF NOT EXISTS data (id INT PRIMARY KEY);
ALTER TABLE data SPLIT AT SELECT i FROM generate_series(1, 1000) AS g(i);
" | ./cockroach sql --insecure

I've lldb'ed the deadlocks to look like the following (in various C calls, they all look similar once they hit jemalloc)

(lldb) bt
* thread #29
  * frame #0: 0x00007fff75e2c36a libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff75ed8c57 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 140
    frame #2: 0x0000000005fd3bbb cockroach`je_arena_choose_hard [inlined] je_malloc_mutex_lock at mutex.h:97 [opt]
    frame #3: 0x0000000005fd3baf cockroach`je_arena_choose_hard at jemalloc.c:614 [opt]
    frame #4: 0x000000000601d106 cockroach`je_tcache_get_hard [inlined] je_arena_choose_impl at jemalloc_internal.h:909 [opt]
    frame #5: 0x000000000601d0fc cockroach`je_tcache_get_hard [inlined] je_arena_choose at jemalloc_internal.h:918 [opt]
    frame #6: 0x000000000601d0fc cockroach`je_tcache_get_hard at tcache.c:314 [opt]
    frame #7: 0x0000000005fd4cf1 cockroach`je_malloc [inlined] je_tcache_get at tcache.h:237 [opt]
    frame #8: 0x0000000005fd4cdd cockroach`je_malloc [inlined] je_ialloc at jemalloc_internal.h:1079 [opt]
    frame #9: 0x0000000005fd4cdd cockroach`je_malloc [inlined] ialloc_body at jemalloc.c:1605 [opt]
    frame #10: 0x0000000005fd4c68 cockroach`je_malloc at jemalloc.c:1644 [opt]
    frame #11: 0x00007fff75e9e807 libsystem_malloc.dylib`malloc_zone_malloc + 103
    frame #12: 0x00007fff75e9e783 libsystem_malloc.dylib`malloc + 24
    frame #13: 0x00007fff73418f48 libc++abi.dylib`operator new(unsigned long) + 40
    frame #14: 0x0000000006070d12 cockroach`::DBNewBatch() at db.cc:479 [opt]
    frame #15: 0x0000000005f3d9e1 cockroach`_cgo_399a283c451a_Cfunc_DBNewBatch + 33
    frame #16: 0x000000000405e180 cockroach`runtime.asmcgocall + 112
    frame #17: 0x0000000004033930 cockroach`runtime.startTheWorldWithSema + 624
    frame #18: 0x0000000004caa3fe cockroach`github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBNewBatch + 78
    frame #19: 0x0000000004cc10e9 cockroach`github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).ensureBatch.func1 + 121
    frame #20: 0x0000000004cb35cc cockroach`github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).ensureBatch + 60
    frame #21: 0x0000000004cb4576 cockroach`github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).NewIterator + 534
    frame #22: 0x0000000004c96648 cockroach`github.com/cockroachdb/cockroach/pkg/storage/engine.MVCCGet + 104
    frame #23: 0x0000000004c96232 cockroach`github.com/cockroachdb/cockroach/pkg/storage/engine.MVCCGetProto + 226
    frame #24: 0x0000000004e0bd6c cockroach`github.com/cockroachdb/cockroach/pkg/storage/abortspan.(*AbortSpan).Get + 220
    frame #25: 0x000000000530c4e5 cockroach`github.com/cockroachdb/cockroach/pkg/storage.checkIfTxnAborted + 197
    frame #26: 0x000000000531cd6f cockroach`github.com/cockroachdb/cockroach/pkg/storage.evaluateBatch + 4015
    frame #27: 0x0000000005352a3b cockroach`github.com/cockroachdb/cockroach/pkg/storage.(*Replica).evaluateWriteBatchWithLocalRetries + 507

What sticks out to my greenhorn eye in the release notes for Go 1.12 is that

libSystem is now used when making syscalls on Darwin, ensuring forward-compatibility with future versions of macOS and iOS. The switch to libSystem triggered additional App Store checks for private API usage. Since it is considered private, syscall.Getdirentries now always fails with ENOSYS on iOS.

We're using a fairly ancient version of jemalloc (#17013), so if this is a bug in that version, it's been around forever and is only being tickled reliably now.

Either way, we need to figure this out before we move to Go 1.12. One first step should be seeing whether this applies to Linux. My expectation would be that Linux binaries work just fine, but I haven't verified this yet.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions