admission: lack of intra-tenant prioritization for IO work

**Describe the problem**

For write-heavy workloads we don’t have throughput isolation even with replication admission control (https://github.com/cockroachdb/cockroach/issues/95563). In the experiment below the two red annotations are an index backfill starting and ending (it fails) where the throughput of write-heavy workload affected by index backfill and p99 > 10s. It starts happening once IO tokens start being exhausted on one of the follower nodes, which once it starts happening, we see large build up of regular requests waiting for admission. The wait time is also large (p75-normal-pri=10s, mean-normal-pri=12+s).

![image](https://user-images.githubusercontent.com/10536690/214067787-27037e92-ccd3-43c2-802b-cf50d7547095.png)

![image](https://user-images.githubusercontent.com/10536690/214067899-921eb077-29d8-459f-a660-0f43f724cc80.png)

Ignore the foreground throughput continuing to collapse after the index backfill ends. It’s due to a SQL bug: https://github.com/cockroachdb/cockroach/issues/95324 (our memory monitor ends up leaking reservations, which is also what ends up stopping the index backfill). Some discussion in this internal [slack thread](https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1673902855058809).

**To Reproduce**

The write-heavy workload:
```
$ roachprod run $CLUSTER:10 -- ./cockroach workload init kv --splits 5000
$ roachprod run $CLUSTER:10 -- ./cockroach workload run kv --min-block-bytes=262144 --max-block-bytes=262144 --concurrency=256 --splits=5000 --read-percent=0 --max-rate=250 --ramp 1m --duration 1h --tolerate-errors (roachprod pgurl $CLUSTER:1-9)
```

Run against a cluster loaded up with 100k customers for TPC-E; we're using a 9-node CRDB cluster here. To kick off the backfill:
```
$ roachprod sql $CLUSTER:1 -- -e "CREATE INDEX idx_"(date -u +%Y%m%d_T%H%M%S)" ON tpce.cash_transaction (ct_dts);"
```

**Expected behavior**

For throughput not to be affected.

**Additional context**

Theory: logical admission rate of normal-pri requests gets affected in aggregate. Whenever bulk-pri requests get logically admitted (there are no normal-pri requests at that instant in wait queues), they deduct a large number of IO tokens (these are large AddSSTables), so when a normal-pri request does come soon after, it has to wait.  So the question is whether we are frequently getting unlucky due to burstiness of regular traffic. If the token bucket has 1 byte token and we admit a 2MB AddSSTable, now we have -(2MB-1) tokens and it may be a while before that token count becomes positive. Perhaps here too we use different IO tokens for {regular,elastic} work, with deductions from the latter not affecting the former.

Internal notes:
- It's unlikely to be due to https://github.com/cockroachdb/cockroach/issues/91509 since the latencies are much higher. I wonder whether this is because each AddSSTable is large in number of bytes. 
- When we do rerun these experiments it will be worth looking at what the token replenishment rate is (over 15s) and how that compares to the size of a single AddSSTable.
- It could also be because we when we start write shaping, the regular admission rate drops to 50%. Relates to https://github.com/cockroachdb/cockroach/issues/91519.

Jira issue: CRDB-23672

Epic CRDB-25469

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: lack of intra-tenant prioritization for IO work #95678

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

admission: lack of intra-tenant prioritization for IO work #95678

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions