Investigation into #44501 and #44504 have uncovered that the core of the problem seems to be creating the index that stores all the columns.
To reproduce:
roachprod create $CLUSTER -n 4 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod stage $CLUSTER:1-3 cockroach
roachprod stage $CLUSTER:4 workload
roachprod start $CLUSTER:1-3
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2500 --db=tpcc --checks=false"
roachprod run $CLUSTER:4 "./workload run tpcc --ramp=5m --warehouses=2500 --active-warehouses=2000 --split --scatter {pgurl:1-3}"
After the ramp period, run in another shell
roachprod sql $CLUSTER:3
> use tpcc;
> create unique index on customer (c_w_id, c_d_id, c_id) storing (c_first, c_middle, c_last, c_street_1, c_street_2, c_city, c_state, c_zip, c_phone, c_since, c_credit, c_credit_lim, c_discount, c_balance, c_ytd_payment, c_payment_cnt, c_delivery_cnt, c_data);
After some time, large p99 latency spikes can be witnessed, sometimes going up to multiple seconds.
Epic CRDB-8816
Jira issue: CRDB-5120
Investigation into #44501 and #44504 have uncovered that the core of the problem seems to be creating the index that stores all the columns.
To reproduce:
After the ramp period, run in another shell
After some time, large p99 latency spikes can be witnessed, sometimes going up to multiple seconds.
Epic CRDB-8816
Jira issue: CRDB-5120