Skip to content

stability: OOM on delta #10427

@bdarnell

Description

@bdarnell

One node on delta (running 9ce8833) ran out of memory, with a large spike in goroutine count right before the end. This node had also experienced problems with the previous build (when prevote was enabled)

image

The logs show a lot of intent resolutions being cancelled in the command queue:

W161103 01:21:41.071928 22449127 storage/replica.go:1240  [n8,s16,r2341/7:/Table/55
/1/197{22994…-34433…}] context canceled while in command queue: ResolveIntent [/Tab
le/55/1/1973297837063008645/"986e18e6-d1fc-48b9-8153-27e0148c025d"/5778676/0,/Min),
 Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min
,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min)

and raftMu being held for over a second at a time (across multiple ranges).

This node had also experienced problems under the previous build, which we had blamed on prevote.

Many goroutines are blocked at replica.go:1221 (waiting for cmdQMu) for very long times. There are 10k+ of these but panicparse isn't grouping them all together for some reason):

1173: semacquire [15~264 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
    sync                sema.go:47                  runtime_Semacquire(#42751)
    sync                mutex.go:85                 (*Mutex).Lock(#42750)
    syncutil            timedmutex.go:84            (*TimedMutex).Lock(#32315)
    storage             replica.go:1221             (*Replica).beginCmds(#32314, #254751, *, *, *, *, 0)
    storage             replica.go:1628             (*Replica).tryAddWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             replica.go:1603             (*Replica).addWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             replica.go:1104             (*Replica).Send(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             store.go:2333               (*Store).Send(#1687, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             stores.go:187               (*Stores).Send(#804, #254751, *, 0, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    server              node.go:843                 (*Node).batchInternal.func1(0, 0)
    stop                stopper.go:245              (*Stopper).RunTaskWithErr(#751, *, 0, 0)
    server              node.go:854                 (*Node).batchInternal(#519, #254751, *, *, *, *, 0x603fa7)
    server              node.go:874                 (*Node).Batch(#519, #254751, *, *, 0, 0xac1, 0xac1)
    roachpb             api.pb.go:1512              _Internal_Batch_Handler(#41, #519, #254751, *, *, 0, 0, 0, 0, *)
    grpc                server.go:611               (*Server).processUnaryRPC(#691, #110, *, *, #726, #75, 0, 0, 0)
    grpc                server.go:769               (*Server).handleStream(#691, #110, *, *, 0)
    grpc                server.go:422               (*Server).serveStreams.func1.1(*, #691, #110, *, *)

It looks like the command queue has gotten large:

1: runnable [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
    interval            interval.go:678             (*Node).doMatch(#54501, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34715)
    interval            interval.go:692             (*Node).doMatch(#54500, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34714)
    interval            interval.go:680             (*Node).doMatch(#54502, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34716)
    interval            interval.go:680             (*Node).doMatch(#54503, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34717)
    interval            interval.go:692             (*Node).doMatch(#54499, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34713)
    interval            interval.go:692             (*Node).doMatch(#54498, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34712)
    interval            interval.go:680             (*Node).doMatch(#54504, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34718)
    interval            interval.go:692             (*Node).doMatch(#44087, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34711)
    interval            interval.go:692             (*Node).doMatch(#76407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34710)
    interval            interval.go:692             (*Node).doMatch(#64129, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #4909)
    interval            interval.go:692             (*Node).doMatch(#41627, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #28365)
    interval            interval.go:680             (*Node).doMatch(#19169, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34719)
    interval            interval.go:692             (*Node).doMatch(#40511, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #35269)
    interval            interval.go:680             (*Node).doMatch(#47831, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #30706)
    interval            interval.go:680             (*Node).doMatch(#58539, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #13753)
    interval            interval.go:692             (*Node).doMatch(#10407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #58350)
    interval            interval.go:680             (*Node).doMatch(#46061, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #9663)
    interval            interval.go:673             (*Tree).DoMatching(#18943, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, 0x1)
    storage             command_queue.go:338        (*CommandQueue).getOverlaps(#18943, #64177, 0, 0x1, #64177, 0x1, 0x1, #244168, 0x287c7, 0x2e400)
    storage             command_queue.go:177        (*CommandQueue).getWait(#18943, #104238, #57186, 0x1f7, 0x1f7, 0, 0, 0)
    storage             replica.go:1222             (*Replica).beginCmds(#32008, #254751, #81445, #72445, #22867, #49198, 0)
    storage             replica.go:1628             (*Replica).tryAddWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             replica.go:1603             (*Replica).addWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             replica.go:1104             (*Replica).Send(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             store.go:2333               (*Store).Send(#873, #254751, #81444, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             stores.go:187               (*Stores).Send(#804, #254751, #81443, 0, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    server              node.go:843                 (*Node).batchInternal.func1(0, 0)
    stop                stopper.go:245              (*Stopper).RunTaskWithErr(#751, #22868, 0, 0)
    server              node.go:854                 (*Node).batchInternal(#519, #254751, #81441, #72444, #81441, #46953, 0x603fa7)
    server              node.go:874                 (*Node).Batch(#519, #254751, #81441, #72444, 0, 0xad1, 0xad1)
    roachpb             api.pb.go:1512              _Internal_Batch_Handler(#41, #519, #254751, #81440, #69183, 0, 0, 0, 0, #255031)
    grpc                server.go:611               (*Server).processUnaryRPC(#691, #110, #1601, #45897, #726, #75, 0, 0, 0)
    grpc                server.go:769               (*Server).handleStream(#691, #110, #1601, #45897, 0)
    grpc                server.go:422               (*Server).serveStreams.func1.1(#1519, #691, #110, #1601, #45897)

This is similar to what I saw with race builds in #10388.

Metadata

Metadata

Assignees

Labels

S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions