Skip to content

stability: entire delta cluster stuck, not serving any SQL traffic #10602

@a-robinson

Description

@a-robinson

Delta has been failing to serve requests for most of the last 19 hours, having only 3 good hours from 8-11 UTC this morning.

The logs are full of "context deadline exceeded" errors.

There are a ton (> 1000) of repeated logs like this in a row, all for the same range/replica, spammed such that each is came less than a hundred microseconds after the last:

W161110 16:39:04.701610 506 storage/gc_queue.go:218  [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] unable to resolve intents of committed txn on gc: context deadline exceeded

Those are followed by a ton of errors about an inability to push a transaction, with the context being the same range/replica. These are spammed even faster, coming 10s of microseconds apart:

W161110 16:39:04.727940 4935162 storage/gc_queue.go:628  [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] push of txn id=cf175f6e key=/Table/55/1/2699716940960131706/"bd72518a-c1ae-4f98-a1d3-8c40d4f6fe43"/7751851/0 rw=false pri=0.00868472 iso=SERIALIZABLE stat=PENDING epo=0 ts=1478513462.086455407,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false failed: context deadline exceeded

In the one case I looked at most closely, there was another different error mixed in the middle every couple hundred lines:

W161110 16:39:04.791403 4935416 storage/gc_queue.go:628  [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] push of txn "sql/executor.go:546 sql txn implicit" id=be657d16 key=/Table/55/1/8718809091224052977/"8a16506f-3592-4c78-a50d-b1b83f015480"/6148471/0 rw=true pri=0.01430796 iso=SERIALIZABLE stat=PENDING epo=0 ts=1478241838.382343181,0 orig=1478241838.382343181,0 max=1478241838.456109681,0 wto=false rop=false failed: context deadline exceeded

Once that stops, there are a bunch of "transferring raft leadership" messages about different ranges before the pattern starts over again for a different range/replica.

I'll check out a profile of the node next.

Metadata

Metadata

Assignees

Labels

S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions