Skip to content

roachtest: add admission-control/database-drop#104051

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
irfansharif:230527.ac-idx-backfill-roachtest
May 31, 2023
Merged

roachtest: add admission-control/database-drop#104051
craig[bot] merged 4 commits intocockroachdb:masterfrom
irfansharif:230527.ac-idx-backfill-roachtest

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

roachtest: add admission-control/database-drop

Part of #89208. We've seen incidents where a large index being dropped
caused a moderate latency/throughput impact on foreground load.
Index/table/database drop all share the same underlying code, we make
metadata only changes on the descriptor, wait out the GC TTL, and issue
range deletion tombstones over the now-deleted keyspace. On subsequent
pebble compactions, storage space is reclaimed. See
pkg/sql/schema_changer.go for where all this work originates.

This particular test makes two copies of the 4TiB replicated TPC-E 100k
dataset, runs foreground workload against one and drops the other. It
makes use of disk snapshots of course, because we're impatient.

roachprod/gce: use prefix search in snapshot listing API

Like the API intended. Only noticed in the subsequent commit when
introducing a second use of disk snapshots that happened to use
"tpcc-100k" in its name, somewhere.

roachtest: update index-backfill roachtest

Adding a bit more commentary, throwing in another concurrent index
backfill for free, and fighting some open CRDB bugs unrelated to the
test.

roachtest: fix race with concurrent map writes

This author forgot that struct copies weren't deep enough.

This author forgot that struct copies weren't deep enough.

Release note: None
Adding a bit more commentary, throwing in another concurrent index
backfill for free, and fighting some open CRDB bugs unrelated to the
test.

Release note: None
Like the API intended. Only noticed in the subsequent commit when
introducing a second use of disk snapshots that happened to use
"tpcc-100k" in its name, somewhere.

Release note: None
@irfansharif irfansharif requested a review from a team May 29, 2023 16:44
@irfansharif irfansharif requested a review from a team as a code owner May 29, 2023 16:44
@irfansharif irfansharif requested review from herkolategan and smg260 and removed request for a team May 29, 2023 16:44
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 230527.ac-idx-backfill-roachtest branch from 2eaec31 to c6af62f Compare May 29, 2023 16:53
@irfansharif irfansharif force-pushed the 230527.ac-idx-backfill-roachtest branch from c6af62f to 2aa2046 Compare May 29, 2023 17:26
@irfansharif
Copy link
Copy Markdown
Contributor Author

Data dump for run: prometheus-snapshot.tar.gz

Visualize with:

#!/bin/sh
set -eu

rm -rf data/snapshots
tar xf prometheus-snapshot.tar.gz
snapdir=$(find data/snapshots -mindepth 1 -maxdepth 1 -type d)
promyml=$(mktemp)
chmod -R o+rw "${snapdir}" "${promyml}"

cat <<EOF > "${promyml}"
global:
  scrape_interval: 10s
  scrape_timeout: 5s
EOF

set -x
# Worked as of v2.33.1 so you can hard-code that if necessary.
docker run --privileged -p 9090:9090 \
    -v "${promyml}:/etc/prometheus/prometheus.yml" -v "${PWD}/${snapdir}:/prometheus" \
    prom/prometheus:latest \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/prometheus \
    --web.enable-admin-api

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 6 of 6 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @irfansharif, and @smg260)


pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 130 at r2 (raw file):

			}

			promCfg := &prometheus.Config{}

has the promCfg block moved down here because we don't care about prometheus during the snapshot generation phase?


pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 170 at r2 (raw file):

					defer db.Close()

					// Gently wind up AddSST concurrency, to increase the

nit: "Gently wind up" suggested to me that there would be a loop that would be slowly increasing the value of this cluster setting as time elapses.
Is this comment trying to say that we could have set it to something very high, like 100, but are only increasing it by a small amount, from 1 to 3, since that is sufficient for causing follower overload? Is there any harm in increasing it further -- I would imagine we would want a test that would fail unless we have replication admission control, and wouldn't start passing simply because disks or machines became slightly faster.

Part of cockroachdb#89208. We've seen incidents where a large index being dropped
caused a moderate latency/throughput impact on foreground load.
Index/table/database drop all share the same underlying code, we make
metadata only changes on the descriptor, wait out the GC TTL, and issue
range deletion tombstones over the now-deleted keyspace. On subsequent
pebble compactions, storage space is reclaimed. See
pkg/sql/schema_changer.go for where all this work originates.

This particular test makes two copies of the 4TiB replicated TPC-E 100k
dataset, runs foreground workload against one and drops the other. It
makes use of disk snapshots of course, because we're impatient.

Release note: None
@irfansharif irfansharif force-pushed the 230527.ac-idx-backfill-roachtest branch from 2aa2046 to 9d7c9fb Compare May 31, 2023 15:51
Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @sumeerbhola)


pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 130 at r2 (raw file):

Previously, sumeerbhola wrote…

has the promCfg block moved down here because we don't care about prometheus during the snapshot generation phase?

Yes.


pkg/cmd/roachtest/tests/admission_control_index_backfill.go line 170 at r2 (raw file):
Reworded.

Is this comment trying to say that we could have set it to something very high, like 100, but are only increasing it by a small amount, from 1 to 3, since that is sufficient for causing follower overload?

Yes. Note that this is a concurrency limit, not a rate limit. And 3 is quite sufficient for us. This test could be rewritten to use fewer nodes so more concentration in terms of follower writes, which I'll do in a later pass.

Is there any harm in increasing it further -- I would imagine we would want a test that would fail unless we have replication admission control, and wouldn't start passing simply because disks or machines became slightly faster.

I observed some uninteresting bulk adder timeouts when cranking up the concurrency, and I didn't investigate. Just noting that this and all other admission-control tests aren't pass/fail ones somewhat intentionally. I still haven't plugged meaningful numbers into roachperf so regressions are observed but data for all these runs are present in the test-eng grafana for the last 90 days, which is quite good to find regressions. As for the replication admission control specific test, I think something more targeted would be better. I have #81516 on my TODO list to polish and merge.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 31, 2023

Build succeeded:

@craig craig bot merged commit d39caae into cockroachdb:master May 31, 2023
@irfansharif irfansharif deleted the 230527.ac-idx-backfill-roachtest branch May 31, 2023 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants