roachtest: add admission-control/database-drop#104051
roachtest: add admission-control/database-drop#104051craig[bot] merged 4 commits intocockroachdb:masterfrom
Conversation
This author forgot that struct copies weren't deep enough. Release note: None
Adding a bit more commentary, throwing in another concurrent index backfill for free, and fighting some open CRDB bugs unrelated to the test. Release note: None
Like the API intended. Only noticed in the subsequent commit when introducing a second use of disk snapshots that happened to use "tpcc-100k" in its name, somewhere. Release note: None
2eaec31 to
c6af62f
Compare
c6af62f to
2aa2046
Compare
|
Data dump for run: prometheus-snapshot.tar.gz Visualize with: #!/bin/sh
set -eu
rm -rf data/snapshots
tar xf prometheus-snapshot.tar.gz
snapdir=$(find data/snapshots -mindepth 1 -maxdepth 1 -type d)
promyml=$(mktemp)
chmod -R o+rw "${snapdir}" "${promyml}"
cat <<EOF > "${promyml}"
global:
scrape_interval: 10s
scrape_timeout: 5s
EOF
set -x
# Worked as of v2.33.1 so you can hard-code that if necessary.
docker run --privileged -p 9090:9090 \
-v "${promyml}:/etc/prometheus/prometheus.yml" -v "${PWD}/${snapdir}:/prometheus" \
prom/prometheus:latest \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.enable-admin-api |
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 6 of 6 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @irfansharif, and @smg260)
pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 130 at r2 (raw file):
} promCfg := &prometheus.Config{}
has the promCfg block moved down here because we don't care about prometheus during the snapshot generation phase?
pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 170 at r2 (raw file):
defer db.Close() // Gently wind up AddSST concurrency, to increase the
nit: "Gently wind up" suggested to me that there would be a loop that would be slowly increasing the value of this cluster setting as time elapses.
Is this comment trying to say that we could have set it to something very high, like 100, but are only increasing it by a small amount, from 1 to 3, since that is sufficient for causing follower overload? Is there any harm in increasing it further -- I would imagine we would want a test that would fail unless we have replication admission control, and wouldn't start passing simply because disks or machines became slightly faster.
Part of cockroachdb#89208. We've seen incidents where a large index being dropped caused a moderate latency/throughput impact on foreground load. Index/table/database drop all share the same underlying code, we make metadata only changes on the descriptor, wait out the GC TTL, and issue range deletion tombstones over the now-deleted keyspace. On subsequent pebble compactions, storage space is reclaimed. See pkg/sql/schema_changer.go for where all this work originates. This particular test makes two copies of the 4TiB replicated TPC-E 100k dataset, runs foreground workload against one and drops the other. It makes use of disk snapshots of course, because we're impatient. Release note: None
2aa2046 to
9d7c9fb
Compare
irfansharif
left a comment
There was a problem hiding this comment.
TFTR!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @sumeerbhola)
pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 130 at r2 (raw file):
Previously, sumeerbhola wrote…
has the promCfg block moved down here because we don't care about prometheus during the snapshot generation phase?
Yes.
pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 170 at r2 (raw file):
Reworded.
Is this comment trying to say that we could have set it to something very high, like 100, but are only increasing it by a small amount, from 1 to 3, since that is sufficient for causing follower overload?
Yes. Note that this is a concurrency limit, not a rate limit. And 3 is quite sufficient for us. This test could be rewritten to use fewer nodes so more concentration in terms of follower writes, which I'll do in a later pass.
Is there any harm in increasing it further -- I would imagine we would want a test that would fail unless we have replication admission control, and wouldn't start passing simply because disks or machines became slightly faster.
I observed some uninteresting bulk adder timeouts when cranking up the concurrency, and I didn't investigate. Just noting that this and all other admission-control tests aren't pass/fail ones somewhat intentionally. I still haven't plugged meaningful numbers into roachperf so regressions are observed but data for all these runs are present in the test-eng grafana for the last 90 days, which is quite good to find regressions. As for the replication admission control specific test, I think something more targeted would be better. I have #81516 on my TODO list to polish and merge.
|
Build succeeded: |
roachtest: add admission-control/database-drop
Part of #89208. We've seen incidents where a large index being dropped
caused a moderate latency/throughput impact on foreground load.
Index/table/database drop all share the same underlying code, we make
metadata only changes on the descriptor, wait out the GC TTL, and issue
range deletion tombstones over the now-deleted keyspace. On subsequent
pebble compactions, storage space is reclaimed. See
pkg/sql/schema_changer.go for where all this work originates.
This particular test makes two copies of the 4TiB replicated TPC-E 100k
dataset, runs foreground workload against one and drops the other. It
makes use of disk snapshots of course, because we're impatient.
roachprod/gce: use prefix search in snapshot listing API
Like the API intended. Only noticed in the subsequent commit when
introducing a second use of disk snapshots that happened to use
"tpcc-100k" in its name, somewhere.
roachtest: update index-backfill roachtest
Adding a bit more commentary, throwing in another concurrent index
backfill for free, and fighting some open CRDB bugs unrelated to the
test.
roachtest: fix race with concurrent map writes
This author forgot that struct copies weren't deep enough.