stability: OOM on delta

One node on delta (running 9ce883365) ran out of memory, with a large spike in goroutine count right before the end. This node had also experienced problems with the previous build (when prevote was enabled)

![image](https://cloud.githubusercontent.com/assets/160562/19956018/4978051a-a1c2-11e6-9d3e-bbddf32e44cf.png)

The logs show a lot of intent resolutions being cancelled in the command queue:

```
W161103 01:21:41.071928 22449127 storage/replica.go:1240  [n8,s16,r2341/7:/Table/55
/1/197{22994…-34433…}] context canceled while in command queue: ResolveIntent [/Tab
le/55/1/1973297837063008645/"986e18e6-d1fc-48b9-8153-27e0148c025d"/5778676/0,/Min),
 Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min
,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min)
```

and `raftMu` being held for over a second at a time (across multiple ranges).

This node had also experienced problems under the previous build, which we had blamed on prevote. 

Many goroutines are blocked at replica.go:1221 (waiting for cmdQMu) for very long times. There are 10k+ of these but panicparse isn't grouping them all together for some reason):

```
1173: semacquire [15~264 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
    sync                sema.go:47                  runtime_Semacquire(#42751)
    sync                mutex.go:85                 (*Mutex).Lock(#42750)
    syncutil            timedmutex.go:84            (*TimedMutex).Lock(#32315)
    storage             replica.go:1221             (*Replica).beginCmds(#32314, #254751, *, *, *, *, 0)
    storage             replica.go:1628             (*Replica).tryAddWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             replica.go:1603             (*Replica).addWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             replica.go:1104             (*Replica).Send(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             store.go:2333               (*Store).Send(#1687, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             stores.go:187               (*Stores).Send(#804, #254751, *, 0, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    server              node.go:843                 (*Node).batchInternal.func1(0, 0)
    stop                stopper.go:245              (*Stopper).RunTaskWithErr(#751, *, 0, 0)
    server              node.go:854                 (*Node).batchInternal(#519, #254751, *, *, *, *, 0x603fa7)
    server              node.go:874                 (*Node).Batch(#519, #254751, *, *, 0, 0xac1, 0xac1)
    roachpb             api.pb.go:1512              _Internal_Batch_Handler(#41, #519, #254751, *, *, 0, 0, 0, 0, *)
    grpc                server.go:611               (*Server).processUnaryRPC(#691, #110, *, *, #726, #75, 0, 0, 0)
    grpc                server.go:769               (*Server).handleStream(#691, #110, *, *, 0)
    grpc                server.go:422               (*Server).serveStreams.func1.1(*, #691, #110, *, *)
```

It looks like the command queue has gotten large:

```
1: runnable [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
    interval            interval.go:678             (*Node).doMatch(#54501, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34715)
    interval            interval.go:692             (*Node).doMatch(#54500, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34714)
    interval            interval.go:680             (*Node).doMatch(#54502, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34716)
    interval            interval.go:680             (*Node).doMatch(#54503, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34717)
    interval            interval.go:692             (*Node).doMatch(#54499, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34713)
    interval            interval.go:692             (*Node).doMatch(#54498, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34712)
    interval            interval.go:680             (*Node).doMatch(#54504, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34718)
    interval            interval.go:692             (*Node).doMatch(#44087, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34711)
    interval            interval.go:692             (*Node).doMatch(#76407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34710)
    interval            interval.go:692             (*Node).doMatch(#64129, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #4909)
    interval            interval.go:692             (*Node).doMatch(#41627, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #28365)
    interval            interval.go:680             (*Node).doMatch(#19169, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34719)
    interval            interval.go:692             (*Node).doMatch(#40511, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #35269)
    interval            interval.go:680             (*Node).doMatch(#47831, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #30706)
    interval            interval.go:680             (*Node).doMatch(#58539, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #13753)
    interval            interval.go:692             (*Node).doMatch(#10407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #58350)
    interval            interval.go:680             (*Node).doMatch(#46061, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #9663)
    interval            interval.go:673             (*Tree).DoMatching(#18943, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, 0x1)
    storage             command_queue.go:338        (*CommandQueue).getOverlaps(#18943, #64177, 0, 0x1, #64177, 0x1, 0x1, #244168, 0x287c7, 0x2e400)
    storage             command_queue.go:177        (*CommandQueue).getWait(#18943, #104238, #57186, 0x1f7, 0x1f7, 0, 0, 0)
    storage             replica.go:1222             (*Replica).beginCmds(#32008, #254751, #81445, #72445, #22867, #49198, 0)
    storage             replica.go:1628             (*Replica).tryAddWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             replica.go:1603             (*Replica).addWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             replica.go:1104             (*Replica).Send(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             store.go:2333               (*Store).Send(#873, #254751, #81444, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             stores.go:187               (*Stores).Send(#804, #254751, #81443, 0, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    server              node.go:843                 (*Node).batchInternal.func1(0, 0)
    stop                stopper.go:245              (*Stopper).RunTaskWithErr(#751, #22868, 0, 0)
    server              node.go:854                 (*Node).batchInternal(#519, #254751, #81441, #72444, #81441, #46953, 0x603fa7)
    server              node.go:874                 (*Node).Batch(#519, #254751, #81441, #72444, 0, 0xad1, 0xad1)
    roachpb             api.pb.go:1512              _Internal_Batch_Handler(#41, #519, #254751, #81440, #69183, 0, 0, 0, 0, #255031)
    grpc                server.go:611               (*Server).processUnaryRPC(#691, #110, #1601, #45897, #726, #75, 0, 0, 0)
    grpc                server.go:769               (*Server).handleStream(#691, #110, #1601, #45897, 0)
    grpc                server.go:422               (*Server).serveStreams.func1.1(#1519, #691, #110, #1601, #45897)
```

This is similar to what I saw with race builds in #10388. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: OOM on delta #10427

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

stability: OOM on delta #10427

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions