storage: unexpected IO overload in admission/follower-overload/presplit-control

The admission/follower-overload/presplit-control test is introduced in https://github.com/cockroachdb/cockroach/pull/81516.

Despite the name, no overload should occur in it. We run two workloads:

- 4mb/s of kv0 spread out evenly across leaseholders n1 and n2 (n3 pure follower)
- a trickle workload of kv50 hitting n3 as leaseholder (n1 n2 pure followers)
- when run on AWS (which is what I am doing), it runs on standard 125mb/s 3000iops gp3 volumes

The expectation is that this "just runs fine until out of disk". The reality is that a couple of hours in, n1 and n2 end up in admission control due to their LSM inverting.

I've dug at this quite a bit but think that I have ruled out (or at least not seen evidence of)

- throughput and IOPS overload (the disk on n3 gets way more read mb/s and write mb/s despite n1 and n2 having been bumped to 6k iops)
- CPU overload (CPU is mostly idle on all machines, despite a somewhat higher baseline on n1 and n2, which is expected due to kv0)

I recorded my detailed analyses as looms. They're a couple minutes each. There's also a [slack thread](https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1653986683746719) that might pick up some more minute details as the day goes by but anything major will be recorded here.

part 1: finding out that the cluster is in bad health and looking atr graphs https://www.loom.com/share/5d594ac594f64dd3bd6b50b8b653ca33

look at lsm visualizer for https://gist.github.com/tbg/7d47cc3b6bfc5d9721579822a372447e https://gist.github.com/tbg/c1552f8d92583c91f9996323608c647e https://gist.github.com/tbg/83d49ce2c205121b17b32948de1720b8: https://www.loom.com/share/2e236668c1bb4a67b52df5b64e0c231f

with both the IOPS upped and n1,n2 restarted with compaction concurrency of 8, both of them still seem overloaded: https://www.loom.com/share/8ac8b33d082645ce9aff780eeedd00cb

It's too early to really tell, though, since the leases still have to balance out. I will comment in a few hours when a new steady state has been reached.

Related to #79215 

Jira issue: CRDB-16213

Epic CRDB-15069

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: unexpected IO overload in admission/follower-overload/presplit-control #82109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: unexpected IO overload in admission/follower-overload/presplit-control #82109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions