Skip to content

c-deps: bump jemalloc to pick up Darwin deadlock fix#35789

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
petermattis:pmattis/bump-jemalloc
Mar 17, 2019
Merged

c-deps: bump jemalloc to pick up Darwin deadlock fix#35789
craig[bot] merged 1 commit intocockroachdb:masterfrom
petermattis:pmattis/bump-jemalloc

Conversation

@petermattis
Copy link
Collaborator

Under go1.12, cockroach can deadlock deep within jemalloc. An upstream
change in jemalloc fixed this in the 5.1.0 release, but we're hesitant
to bump the jemalloc version as the last time we tried we ran into slow
memory leaks. It is unclear why this deadlock started to appear in
go1.12 but does not manifest with earlier versions of go. Perhaps the
switch to perform system calls via libSystem is to blame. Perhaps
there was a change to how fork is handled in go1.12.

Fixes #35620

Release note: None

@petermattis petermattis requested review from a team, bdarnell and tbg March 15, 2019 15:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator Author

Depends on cockroachdb/jemalloc#1. Note that cockroachdb/jemalloc#1 only changes code that is used on Darwin, so this is relatively safe (unless we suspect Darwin builds are being used in production somewhere). The only concern would be if this disrupts development on Mac's somehow.

@petermattis
Copy link
Collaborator Author

The alternative to this change would be to disable jemalloc on Darwin.

@tbg
Copy link
Member

tbg commented Mar 15, 2019

LGTM. Thanks for figuring this out.

@petermattis
Copy link
Collaborator Author

@bdarnell Let me know your thoughts on this.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm surprised this makes a difference because we don't use fork() (and IIRC go is generally not compatible with fork-without-exec), but if we've empirically verified that it fixes the problem I'm happier backporting this than upgrading jemalloc at this stage.

@petermattis
Copy link
Collaborator Author

I'm surprised this makes a difference because we don't use fork()

We fork when using --background (don't we?), though an exec happens almost immediately. I'll note that I can reproduce the deadlock without this PR even without --background. I can't reproduce a deadlock at all with this patch. Hmm, looking at dtruss ./cockroach I see there are several fork system calls. Note sure where those are coming from.

@petermattis
Copy link
Collaborator Author

dtruss -t fork -s ./cockroach ... prints the stacktrace when the fork system call is invoked:

              libsystem_kernel.dylib`__fork+0xb
              cockroach`runtime.syscall+0x1f
              cockroach`runtime.asmcgocall+0x70
              cockroach`syscall.rawSyscall+0x4e
              cockroach`syscall.forkAndExecInChild+0xe2
              cockroach`syscall.forkExec+0x360
              cockroach`os.startProcess+0x262
              cockroach`os.StartProcess+0x7c
              cockroach`os/exec.(*Cmd).Start+0x4a1
              cockroach`github.com/cockroachdb/cockroach/vendor/github.com/shirou/gopsutil/internal/common.Invoke.CommandWithContext+0xf1
              cockroach`github.com/cockroachdb/cockroach/vendor/github.com/shirou/gopsutil/internal/common.(*Invoke).CommandWithContext+0x96
              cockroach`github.com/cockroachdb/cockroach/vendor/github.com/shirou/gopsutil/net.IOCountersWithContext+0x112
              cockroach`github.com/cockroachdb/cockroach/pkg/server/status.getSummedNetStats+0x69
              cockroach`github.com/cockroachdb/cockroach/pkg/server/status.(*RuntimeStatSampler).SampleEnvironment+0x3e7
              cockroach`github.com/cockroachdb/cockroach/pkg/server.(*Server).startSampleEnvironment.func2+0x19c
              cockroach`github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1+0xfb
              cockroach`runtime.goexit+0x1

Note that with go1.11 system calls on Darwin didn't use libSystem:

              cockroach`syscall.RawSyscall+0x31
              cockroach`syscall.forkExec+0x365
              cockroach`syscall.StartProcess+0x64
              cockroach`os.startProcess+0x213
              cockroach`os.StartProcess+0x7c
              cockroach`os/exec.(*Cmd).Start+0x4bc
              cockroach`github.com/cockroachdb/cockroach/vendor/github.com/shirou/gopsutil/internal/common.Invoke.CommandWithContext+0xd7
              cockroach`github.com/cockroachdb/cockroach/vendor/github.com/shirou/gopsutil/internal/common.(*Invoke).CommandWithContext+0x96
              cockroach`github.com/cockroachdb/cockroach/vendor/github.com/shirou/gopsutil/net.IOCountersWithContext+0x112
              cockroach`github.com/cockroachdb/cockroach/pkg/server/status.getSummedNetStats+0x74
              cockroach`github.com/cockroachdb/cockroach/pkg/server/status.NewRuntimeStatSampler+0x5be
              cockroach`github.com/cockroachdb/cockroach/pkg/server.NewServer+0x1f34
              cockroach`github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2+0x64
              cockroach`github.com/cockroachdb/cockroach/pkg/cli.runStart.func3+0x11c
              cockroach`runtime.goexit+0x1

I wonder if that makes a difference.

Under go1.12, `cockroach` can deadlock deep within jemalloc. An upstream
change in jemalloc fixed this in the 5.1.0 release, but we're hesitant
to bump the jemalloc version as the last time we tried we ran into slow
memory leaks. It is unclear why this deadlock started to appear in
go1.12 but does not manifest with earlier versions of go. Perhaps the
switch to perform system calls via `libSystem` is to blame. Perhaps
there was a change to how fork is handled in go1.12.

Fixes cockroachdb#35620

Release note: None
@petermattis petermattis force-pushed the pmattis/bump-jemalloc branch from d67feb4 to e252684 Compare March 17, 2019 00:24
@bdarnell
Copy link
Contributor

Oh, I think I get it now. I thought fork+exec was safe because no memory is allocated in the child process before the exec system call. But thanks to a post-fork hook used here, there is activity in the malloc system even if the code between fork+exec is restricted to non-allocating signal-safe functions.

I also forgot about how some of our metric-collection libraries work by shelling out to subprocesses on macos, so we do fork+exec all the time even without --background.

@petermattis
Copy link
Collaborator Author

bors r=tbg,bdarnell

craig bot pushed a commit that referenced this pull request Mar 17, 2019
35789: c-deps: bump jemalloc to pick up Darwin deadlock fix r=tbg,bdarnell a=petermattis

Under go1.12, `cockroach` can deadlock deep within jemalloc. An upstream
change in jemalloc fixed this in the 5.1.0 release, but we're hesitant
to bump the jemalloc version as the last time we tried we ran into slow
memory leaks. It is unclear why this deadlock started to appear in
go1.12 but does not manifest with earlier versions of go. Perhaps the
switch to perform system calls via `libSystem` is to blame. Perhaps
there was a change to how fork is handled in go1.12.

Fixes #35620

Release note: None

Co-authored-by: Peter Mattis <petermattis@gmail.com>
@craig
Copy link
Contributor

craig bot commented Mar 17, 2019

Build succeeded

@craig craig bot merged commit e252684 into cockroachdb:master Mar 17, 2019
@petermattis petermattis deleted the pmattis/bump-jemalloc branch March 17, 2019 16:52
@petermattis
Copy link
Collaborator Author

Should this be backported to 2.1, or are we close enough to the 19.1 release to not bother?

@bdarnell
Copy link
Contributor

I think this should be backported (and also the PRs linked in #35197). The next release from the 2.1 branch may not come until after 19.1.0 (so they'll never go into homebrew), but there will still be more 2.1 releases and we'll want them to be able to build with newer versions of go (we've done this for past releases)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

jemalloc: deadlocks under go1.12 (at least under darwin)

4 participants