-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: unexpected IO overload in admission/follower-overload/presplit-control #82109
Description
The admission/follower-overload/presplit-control test is introduced in #81516.
Despite the name, no overload should occur in it. We run two workloads:
- 4mb/s of kv0 spread out evenly across leaseholders n1 and n2 (n3 pure follower)
- a trickle workload of kv50 hitting n3 as leaseholder (n1 n2 pure followers)
- when run on AWS (which is what I am doing), it runs on standard 125mb/s 3000iops gp3 volumes
The expectation is that this "just runs fine until out of disk". The reality is that a couple of hours in, n1 and n2 end up in admission control due to their LSM inverting.
I've dug at this quite a bit but think that I have ruled out (or at least not seen evidence of)
- throughput and IOPS overload (the disk on n3 gets way more read mb/s and write mb/s despite n1 and n2 having been bumped to 6k iops)
- CPU overload (CPU is mostly idle on all machines, despite a somewhat higher baseline on n1 and n2, which is expected due to kv0)
I recorded my detailed analyses as looms. They're a couple minutes each. There's also a slack thread that might pick up some more minute details as the day goes by but anything major will be recorded here.
part 1: finding out that the cluster is in bad health and looking atr graphs https://www.loom.com/share/5d594ac594f64dd3bd6b50b8b653ca33
look at lsm visualizer for https://gist.github.com/tbg/7d47cc3b6bfc5d9721579822a372447e https://gist.github.com/tbg/c1552f8d92583c91f9996323608c647e https://gist.github.com/tbg/83d49ce2c205121b17b32948de1720b8: https://www.loom.com/share/2e236668c1bb4a67b52df5b64e0c231f
with both the IOPS upped and n1,n2 restarted with compaction concurrency of 8, both of them still seem overloaded: https://www.loom.com/share/8ac8b33d082645ce9aff780eeedd00cb
It's too early to really tell, though, since the leases still have to balance out. I will comment in a few hours when a new steady state has been reached.
Related to #79215
Jira issue: CRDB-16213
Epic CRDB-15069