-
Notifications
You must be signed in to change notification settings - Fork 4.1k
stability: OOM on delta #10427
Copy link
Copy link
Closed
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Description
One node on delta (running 9ce8833) ran out of memory, with a large spike in goroutine count right before the end. This node had also experienced problems with the previous build (when prevote was enabled)
The logs show a lot of intent resolutions being cancelled in the command queue:
W161103 01:21:41.071928 22449127 storage/replica.go:1240 [n8,s16,r2341/7:/Table/55
/1/197{22994…-34433…}] context canceled while in command queue: ResolveIntent [/Tab
le/55/1/1973297837063008645/"986e18e6-d1fc-48b9-8153-27e0148c025d"/5778676/0,/Min),
Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min
,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min)
and raftMu being held for over a second at a time (across multiple ranges).
This node had also experienced problems under the previous build, which we had blamed on prevote.
Many goroutines are blocked at replica.go:1221 (waiting for cmdQMu) for very long times. There are 10k+ of these but panicparse isn't grouping them all together for some reason):
1173: semacquire [15~264 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
sync sema.go:47 runtime_Semacquire(#42751)
sync mutex.go:85 (*Mutex).Lock(#42750)
syncutil timedmutex.go:84 (*TimedMutex).Lock(#32315)
storage replica.go:1221 (*Replica).beginCmds(#32314, #254751, *, *, *, *, 0)
storage replica.go:1628 (*Replica).tryAddWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
storage replica.go:1603 (*Replica).addWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
storage replica.go:1104 (*Replica).Send(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
storage store.go:2333 (*Store).Send(#1687, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
storage stores.go:187 (*Stores).Send(#804, #254751, *, 0, 0, #246, 0x7, 0x8ec, 0, 0, ...)
server node.go:843 (*Node).batchInternal.func1(0, 0)
stop stopper.go:245 (*Stopper).RunTaskWithErr(#751, *, 0, 0)
server node.go:854 (*Node).batchInternal(#519, #254751, *, *, *, *, 0x603fa7)
server node.go:874 (*Node).Batch(#519, #254751, *, *, 0, 0xac1, 0xac1)
roachpb api.pb.go:1512 _Internal_Batch_Handler(#41, #519, #254751, *, *, 0, 0, 0, 0, *)
grpc server.go:611 (*Server).processUnaryRPC(#691, #110, *, *, #726, #75, 0, 0, 0)
grpc server.go:769 (*Server).handleStream(#691, #110, *, *, 0)
grpc server.go:422 (*Server).serveStreams.func1.1(*, #691, #110, *, *)
It looks like the command queue has gotten large:
1: runnable [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
interval interval.go:678 (*Node).doMatch(#54501, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34715)
interval interval.go:692 (*Node).doMatch(#54500, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34714)
interval interval.go:680 (*Node).doMatch(#54502, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34716)
interval interval.go:680 (*Node).doMatch(#54503, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34717)
interval interval.go:692 (*Node).doMatch(#54499, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34713)
interval interval.go:692 (*Node).doMatch(#54498, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34712)
interval interval.go:680 (*Node).doMatch(#54504, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34718)
interval interval.go:692 (*Node).doMatch(#44087, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34711)
interval interval.go:692 (*Node).doMatch(#76407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34710)
interval interval.go:692 (*Node).doMatch(#64129, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #4909)
interval interval.go:692 (*Node).doMatch(#41627, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #28365)
interval interval.go:680 (*Node).doMatch(#19169, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34719)
interval interval.go:692 (*Node).doMatch(#40511, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #35269)
interval interval.go:680 (*Node).doMatch(#47831, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #30706)
interval interval.go:680 (*Node).doMatch(#58539, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #13753)
interval interval.go:692 (*Node).doMatch(#10407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #58350)
interval interval.go:680 (*Node).doMatch(#46061, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #9663)
interval interval.go:673 (*Tree).DoMatching(#18943, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, 0x1)
storage command_queue.go:338 (*CommandQueue).getOverlaps(#18943, #64177, 0, 0x1, #64177, 0x1, 0x1, #244168, 0x287c7, 0x2e400)
storage command_queue.go:177 (*CommandQueue).getWait(#18943, #104238, #57186, 0x1f7, 0x1f7, 0, 0, 0)
storage replica.go:1222 (*Replica).beginCmds(#32008, #254751, #81445, #72445, #22867, #49198, 0)
storage replica.go:1628 (*Replica).tryAddWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
storage replica.go:1603 (*Replica).addWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
storage replica.go:1104 (*Replica).Send(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
storage store.go:2333 (*Store).Send(#873, #254751, #81444, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
storage stores.go:187 (*Stores).Send(#804, #254751, #81443, 0, 0, #245, 0x5, 0x8d4, 0, 0, ...)
server node.go:843 (*Node).batchInternal.func1(0, 0)
stop stopper.go:245 (*Stopper).RunTaskWithErr(#751, #22868, 0, 0)
server node.go:854 (*Node).batchInternal(#519, #254751, #81441, #72444, #81441, #46953, 0x603fa7)
server node.go:874 (*Node).Batch(#519, #254751, #81441, #72444, 0, 0xad1, 0xad1)
roachpb api.pb.go:1512 _Internal_Batch_Handler(#41, #519, #254751, #81440, #69183, 0, 0, 0, 0, #255031)
grpc server.go:611 (*Server).processUnaryRPC(#691, #110, #1601, #45897, #726, #75, 0, 0, 0)
grpc server.go:769 (*Server).handleStream(#691, #110, #1601, #45897, 0)
grpc server.go:422 (*Server).serveStreams.func1.1(#1519, #691, #110, #1601, #45897)
This is similar to what I saw with race builds in #10388.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
