-
Notifications
You must be signed in to change notification settings - Fork 4.1k
user cluster has three wedged timeseries ranges #17524
Description
Forked from #17478 (comment).
We have access to this cluster via a Cisco WebEx session (you'll need a chrome plugin) at a meeting link that can be activated by Gitter user @HeikoOnnebrink (he'll supervise while you are connected).
It's a 9node cluster running inside of Docker on CoreOS. I've only looked more closely at r104 (see Archive.zip below), which stopped working on 7/31 and then saw some more activity on 8/3: The problematic member here is node1. In grepping the logs, I saw that node1 briefly got the lease on 7/31. The next activity is on 8/3, when it receives 3-4 snapshots including almost no log entries.
There are two other ranges that are perhaps not comparable. In particular, one of the two has a 120mb raft log.
I have no bandwidth to investigate this further. It's fairly tedious due to the remote connection and the fact that this is a 9node cluster. Still, we should follow through and gather what we can. I tried to enable lower-level raft logging via access to /debug/vmodule/raft=8 but somehow it didn't work. That plus grep for r104/ should turn up something.
Inlined my initial investigation comment below:
I should also mention that the reason node 5 is unwilling to become raft leader is because it doesn't have an entry for itself in its own progress map (prs map[uint64]*Progress), which make it un-"promotable". This seems pretty surprising, but I'm not familiar with the expectations for it.
I thought the
Progressis only populated on the Raft leader. Which piece of code are you talking about?I'm currently looking at a user's cluster who also lost his timeseries ranges. Data attached.
There's definitely a problem with the quota pool on that cluster. @irfansharif, any thoughts on the below? It might be an artifact of the ranges being horked, which the attached range pages can hopefully illustrate.
Archive.zip
goroutine 35976949 [chan receive, 6999 minutes]:
github.com/cockroachdb/cockroach/pkg/util/timeutil.(*Timer).Reset(0xc422777740, 0xdf8475800)
/go/src/github.com/cockroachdb/cockroach/pkg/util/timeutil/timer.go:89 +0x9f
github.com/cockroachdb/cockroach/pkg/storage.(*quotaPool).acquire(0xc420e88c30, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x23b, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/quota_pool.go:198 +0x69a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).maybeAcquireProposalQuota(0xc420441180, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x23b, 0x8, 0x14d7740974383c7f)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:899 +0xd9
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose(0xc420441180, 0x7fbd9cbebff8, 0xc420e5c3f0, 0x14d76b9159239640, 0x0, 0x0, 0x0, 0x700000007, 0xc, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2817 +0x69c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc4204