Skip to content

admission: lack of intra-tenant prioritization for IO work #95678

@irfansharif

Description

@irfansharif

Describe the problem

For write-heavy workloads we don’t have throughput isolation even with replication admission control (#95563). In the experiment below the two red annotations are an index backfill starting and ending (it fails) where the throughput of write-heavy workload affected by index backfill and p99 > 10s. It starts happening once IO tokens start being exhausted on one of the follower nodes, which once it starts happening, we see large build up of regular requests waiting for admission. The wait time is also large (p75-normal-pri=10s, mean-normal-pri=12+s).

image

image

Ignore the foreground throughput continuing to collapse after the index backfill ends. It’s due to a SQL bug: #95324 (our memory monitor ends up leaking reservations, which is also what ends up stopping the index backfill). Some discussion in this internal slack thread.

To Reproduce

The write-heavy workload:

$ roachprod run $CLUSTER:10 -- ./cockroach workload init kv --splits 5000
$ roachprod run $CLUSTER:10 -- ./cockroach workload run kv --min-block-bytes=262144 --max-block-bytes=262144 --concurrency=256 --splits=5000 --read-percent=0 --max-rate=250 --ramp 1m --duration 1h --tolerate-errors (roachprod pgurl $CLUSTER:1-9)

Run against a cluster loaded up with 100k customers for TPC-E; we're using a 9-node CRDB cluster here. To kick off the backfill:

$ roachprod sql $CLUSTER:1 -- -e "CREATE INDEX idx_"(date -u +%Y%m%d_T%H%M%S)" ON tpce.cash_transaction (ct_dts);"

Expected behavior

For throughput not to be affected.

Additional context

Theory: logical admission rate of normal-pri requests gets affected in aggregate. Whenever bulk-pri requests get logically admitted (there are no normal-pri requests at that instant in wait queues), they deduct a large number of IO tokens (these are large AddSSTables), so when a normal-pri request does come soon after, it has to wait. So the question is whether we are frequently getting unlucky due to burstiness of regular traffic. If the token bucket has 1 byte token and we admit a 2MB AddSSTable, now we have -(2MB-1) tokens and it may be a while before that token count becomes positive. Perhaps here too we use different IO tokens for {regular,elastic} work, with deductions from the latter not affecting the former.

Internal notes:

Jira issue: CRDB-23672

Epic CRDB-25469

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions