-
Notifications
You must be signed in to change notification settings - Fork 4.1k
roachtest: lack of crdb process isolation makes for noisy tests #62010
Description
Describe the problem
In our tpccbench tests (most recently in #59424) we’re seeing that the instances the test runs on have a 'cliff point', beyond which they're completely hosed. What tpccbench does is that it pushes the instances to their limits to see what our efficiency is. Because we're possibly overloading the machines, we're subject to OOMs, CPU starvation, etc. tpccbench wants to get to that point to figure out where to back off from.
Sometimes when the roachtest does get into overload territory, it also breaks roachtest itself. The fact that we go right until the edge (and sometimes over as we’re probing for the right warehouse count!), we make the nodes inoperable for diagnostic purposes. The CPU is completely starved out, there’s no working memory, and possibly even disk space is affected. The SSH connection could get dropped, and the entire VM might get disappeared. There isn’t a good isolation between the thing being tested (i.e. the crdb node running at the limit) and the testing harness itself (i.e. roachprod, which doesn"t play well with an unhealthy gce VM, unhealthy because the crdb process is hogging all the resources).
All that just creates a ton of noise from a benchmark that really should only tell us what our efficiency is. As we’re rethinking roachtests, we suggest the infrastructure itself should sandbox the test execution using cgroups, containers, or similar. That would reduce the amount of noise these tests generate, as they're designed to operate at the "overload" territory. This isolation might also come in handy as we work on general admission control and want to make sure that the KV layers protects itself. Presumably the workload thrown against it would be the kind to push the crdb process into overload territory. We'd want to distinguish between failures of the crdb process, and failures of the test itself.
There's also some discussion over at #61901.
Jira issue: CRDB-2697