storage: TestNodeLivenessSetDecommissioning fatals under stress

While experimenting with shortening the closed timestamp target_duration in #39643 I encountered failures while running `TestNodeLivenessSetDecomissioning` under stress or stressrace leads to a fatal event. The full logs are attached below. The failure readily reproduces after 1000s of runs. 
I've reproduced this failure as far back as 9c8e44e29c6aeec77564af32ccbb21b14f97ce57 so this is not new. It generally takes thousands of runs under stress to reproduce.

Off of 1a48f102d35ba8dabfb5b8eac7ad66b6bb9e4248:

```
I190814 13:16:32.434790 1641 storage/store_snapshot.go:786  [replicate,s1,r6/1:/Table/{SystemCon…-11}] sending LEARNER snapshot 0243daa0 at applied index 16                                                                                                                   
I190814 13:16:32.435061 1641 storage/store_snapshot.go:829  [replicate,s1,r6/1:/Table/{SystemCon…-11}] streamed snapshot to (n2,s2):2: kv pairs: 44, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s                                                                            
I190814 13:16:32.435720 2327 storage/replica_raftstorage.go:823  [s2,r6/2:{-}] applying LEARNER snapshot at index 16 (id=0243daa0, encoded size=8206, 1 rocksdb batches, 0 log entries)                                                                                        
I190814 13:16:32.436232 2327 storage/replica_raftstorage.go:829  [s2,r6/2:/Table/{SystemCon…-11}] applied LEARNER snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]                                                                                                 
I190814 13:16:32.437799 1641 storage/replica_command.go:1198  [replicate,s1,r6/1:/Table/{SystemCon…-11}] change replicas (ADD_REPLICA (n2,s2):2): existing descriptor r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, (n2,s2):2LEARNER, next=3, gen=1]  
```

Off of 9c8e44e29c6aeec77564af32ccbb21b14f97ce57: 
```
I190814 14:29:54.295686 69 gossip/gossip.go:396  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:32816" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 
W190814 14:29:54.303664 1805 storage/store.go:1532  [s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W190814 14:29:54.374987 69 gossip/gossip.go:1500  [n2] no incoming or outgoing connections
I190814 14:29:54.375132 69 gossip/gossip.go:396  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:46215" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 
I190814 14:29:54.375260 2393 gossip/client.go:126  [n2] started gossip client to 127.0.0.1:32816
W190814 14:29:54.489461 69 gossip/gossip.go:1500  [n3] no incoming or outgoing connections
I190814 14:29:54.489523 69 gossip/gossip.go:396  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:36787" > attrs:<> locality:<> ServerVersion:<major_val:0 minor_val:0 patch:0 unstable:0 > build_tag:"" started_at:0 
I190814 14:29:54.489717 2886 gossip/client.go:126  [n3] started gossip client to 127.0.0.1:32816
I190814 14:29:54.492072 69 storage/client_test.go:493  gossip network initialized
I190814 14:29:54.493488 2235 storage/store_snapshot.go:784  [replicate,s1,r6/1:/Table/{SystemCon…-11}] sending PREEMPTIVE snapshot b207e0e8 at applied index 13
I190814 14:29:54.493643 2235 storage/store_snapshot.go:827  [replicate,s1,r6/1:/Table/{SystemCon…-11}] streamed snapshot to (n2,s2):?: kv pairs: 43, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s
I190814 14:29:54.494146 2736 storage/replica_raftstorage.go:810  [s2,r6/?:{-}] applying PREEMPTIVE snapshot at index 13 (id=b207e0e8, encoded size=7969, 1 rocksdb batches, 0 log entries)
I190814 14:29:54.494378 69 util/stop/stopper.go:544  quiescing; tasks left:
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
I190814 14:29:54.494400 69 util/stop/stopper.go:544  quiescing; tasks left:
1      [async] intent_resolver_gc_batcher
I190814 14:29:54.518555 3241 storage/node_liveness.go:476  [liveness-hb] heartbeat failed on epoch increment; retrying
I190814 14:29:54.519511 2736 storage/replica_raftstorage.go:816  [s2,r6/?:/Table/{SystemCon…-11}] applied PREEMPTIVE snapshot in 25ms [clear=0ms batch=0ms entries=0ms commit=24ms]
I190814 14:29:54.520617 2235 storage/replica_command.go:882  [replicate,s1,r6/1:/Table/{SystemCon…-11}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, next=2, gen=0]
I190814 14:29:54.523179 2235 storage/replica_raft.go:397  [replicate,s1,r6/1:/Table/{SystemCon…-11},txn=d4bd9269] proposing ADD_REPLICA((n2,s2):2): updated=[(n1,s1):1 (n2,s2):2] next=3
I190814 14:29:54.524870 2235 storage/store_snapshot.go:784  [replicate,s1,r6/1:/Table/{SystemCon…-11}] sending PREEMPTIVE snapshot 9f6cfeed at applied index 15
I190814 14:29:54.525063 2235 storage/store_snapshot.go:827  [replicate,s1,r6/1:/Table/{SystemCon…-11}] streamed snapshot to (n3,s3):?: kv pairs: 45, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s
I190814 14:29:54.526385 2792 storage/replica_raftstorage.go:810  [s3,r6/?:{-}] applying PREEMPTIVE snapshot at index 15 (id=9f6cfeed, encoded size=8151, 1 rocksdb batches, 0 log entries)
I190814 14:29:54.527153 2792 storage/replica_raftstorage.go:816  [s3,r6/?:/Table/{SystemCon…-11}] applied PREEMPTIVE snapshot in 1ms [clear=0ms batch=0ms entries=0ms commit=1ms]
I190814 14:29:54.528129 2235 storage/replica_command.go:882  [replicate,s1,r6/1:/Table/{SystemCon…-11}] change replicas (ADD_REPLICA (n3,s3):3): read existing descriptor r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, (n2,s2):2, next=3, gen=0]
F190814 14:29:54.529882 3204 storage/replica_raft.go:698  [s2,r6/2:{-}] setting non-zero HardState.Commit on uninitialized replica [n2,s2,r6/2:{-}]. HS={Term:6 Vote:0 Commit:15 XXX_unrecognized:[]}
```

[TestNodeLivenessSetDecommissioning-logs.txt](https://github.com/cockroachdb/cockroach/files/3501594/TestNodeLivenessSetDecommissioning-logs.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: TestNodeLivenessSetDecommissioning fatals under stress #39660

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: TestNodeLivenessSetDecommissioning fatals under stress #39660

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions