-
Notifications
You must be signed in to change notification settings - Fork 4.1k
admission: lack of intra-tenant prioritization for IO work #95678
Description
Describe the problem
For write-heavy workloads we don’t have throughput isolation even with replication admission control (#95563). In the experiment below the two red annotations are an index backfill starting and ending (it fails) where the throughput of write-heavy workload affected by index backfill and p99 > 10s. It starts happening once IO tokens start being exhausted on one of the follower nodes, which once it starts happening, we see large build up of regular requests waiting for admission. The wait time is also large (p75-normal-pri=10s, mean-normal-pri=12+s).
Ignore the foreground throughput continuing to collapse after the index backfill ends. It’s due to a SQL bug: #95324 (our memory monitor ends up leaking reservations, which is also what ends up stopping the index backfill). Some discussion in this internal slack thread.
To Reproduce
The write-heavy workload:
$ roachprod run $CLUSTER:10 -- ./cockroach workload init kv --splits 5000
$ roachprod run $CLUSTER:10 -- ./cockroach workload run kv --min-block-bytes=262144 --max-block-bytes=262144 --concurrency=256 --splits=5000 --read-percent=0 --max-rate=250 --ramp 1m --duration 1h --tolerate-errors (roachprod pgurl $CLUSTER:1-9)
Run against a cluster loaded up with 100k customers for TPC-E; we're using a 9-node CRDB cluster here. To kick off the backfill:
$ roachprod sql $CLUSTER:1 -- -e "CREATE INDEX idx_"(date -u +%Y%m%d_T%H%M%S)" ON tpce.cash_transaction (ct_dts);"
Expected behavior
For throughput not to be affected.
Additional context
Theory: logical admission rate of normal-pri requests gets affected in aggregate. Whenever bulk-pri requests get logically admitted (there are no normal-pri requests at that instant in wait queues), they deduct a large number of IO tokens (these are large AddSSTables), so when a normal-pri request does come soon after, it has to wait. So the question is whether we are frequently getting unlucky due to burstiness of regular traffic. If the token bucket has 1 byte token and we admit a 2MB AddSSTable, now we have -(2MB-1) tokens and it may be a while before that token count becomes positive. Perhaps here too we use different IO tokens for {regular,elastic} work, with deductions from the latter not affecting the former.
Internal notes:
- It's unlikely to be due to admission: token bucket in kvStoreTokenGranter should be replenished every 1ms #91509 since the latencies are much higher. I wonder whether this is because each AddSSTable is large in number of bytes.
- When we do rerun these experiments it will be worth looking at what the token replenishment rate is (over 15s) and how that compares to the size of a single AddSSTable.
- It could also be because we when we start write shaping, the regular admission rate drops to 50%. Relates to admission: ioLoadListener compaction token calculation is too abrupt #91519.
Jira issue: CRDB-23672
Epic CRDB-25469

