-
Notifications
You must be signed in to change notification settings - Fork 4.1k
stability: entire delta cluster stuck, not serving any SQL traffic #10602
Description
Delta has been failing to serve requests for most of the last 19 hours, having only 3 good hours from 8-11 UTC this morning.
The logs are full of "context deadline exceeded" errors.
There are a ton (> 1000) of repeated logs like this in a row, all for the same range/replica, spammed such that each is came less than a hundred microseconds after the last:
W161110 16:39:04.701610 506 storage/gc_queue.go:218 [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] unable to resolve intents of committed txn on gc: context deadline exceeded
Those are followed by a ton of errors about an inability to push a transaction, with the context being the same range/replica. These are spammed even faster, coming 10s of microseconds apart:
W161110 16:39:04.727940 4935162 storage/gc_queue.go:628 [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] push of txn id=cf175f6e key=/Table/55/1/2699716940960131706/"bd72518a-c1ae-4f98-a1d3-8c40d4f6fe43"/7751851/0 rw=false pri=0.00868472 iso=SERIALIZABLE stat=PENDING epo=0 ts=1478513462.086455407,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false failed: context deadline exceeded
In the one case I looked at most closely, there was another different error mixed in the middle every couple hundred lines:
W161110 16:39:04.791403 4935416 storage/gc_queue.go:628 [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] push of txn "sql/executor.go:546 sql txn implicit" id=be657d16 key=/Table/55/1/8718809091224052977/"8a16506f-3592-4c78-a50d-b1b83f015480"/6148471/0 rw=true pri=0.01430796 iso=SERIALIZABLE stat=PENDING epo=0 ts=1478241838.382343181,0 orig=1478241838.382343181,0 max=1478241838.456109681,0 wto=false rop=false failed: context deadline exceeded
Once that stops, there are a bunch of "transferring raft leadership" messages about different ranges before the pattern starts over again for a different range/replica.
I'll check out a profile of the node next.