Skip to content

[WIP] workload/schemachange: concurrent randomized schema change workload#46402

Closed
petermattis wants to merge 1 commit intocockroachdb:masterfrom
petermattis:pmattis/workload-schemachange
Closed

[WIP] workload/schemachange: concurrent randomized schema change workload#46402
petermattis wants to merge 1 commit intocockroachdb:masterfrom
petermattis:pmattis/workload-schemachange

Conversation

@petermattis
Copy link
Copy Markdown
Collaborator

Randomly generate (concurrent) schema changes. The tables intentionally
contain no actual data as the focus here is on stressing the machinery
around schema changes, not the machinery around backfills.

Release note: None

Release justification: non-production code changes. This is only test
code.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

This is the concurrent schema change workload I've been fiddling with. I got it to the point where it should do something useful. Perhaps it has as it very quickly hits a state where a query is hanging which shouldn't be hanging. To reproduce, create a local single-node roachprod cluster:

~ roachprod cluster create -n 1
~ roachprod cluster start

Then run this new workload against the cluster:

~ bin/workload run schemachange --init --concurrency=1 --max-ops=5
I200321 19:57:32.952206 1 workload/cli/run.go:326  DEPRECATION: the --init flag on "workload run" will no longer be supported after 19.2

  SELECT table_name
    FROM [SHOW TABLES]
   WHERE table_name LIKE 'table%'
ORDER BY random()
   LIMIT 1;

NOOP: setColumnType -> no rows in result set

  SELECT table_name
    FROM [SHOW TABLES]
   WHERE table_name LIKE 'table%'
ORDER BY random()
   LIMIT 1;

The number of queries it takes before wedging varies. Note that in this run, it happened after two queries. And we never even performed a schema change! We can see the hung query via show queries:

~ roachprod sql local -- -e 'show queries'
              query_id             | node_id |            session_id            | user_name |              start               |                                             query                                             | client_address  | application_name | distributed |   phase
-----------------------------------+---------+----------------------------------+-----------+----------------------------------+-----------------------------------------------------------------------------------------------+-----------------+------------------+-------------+------------
  15fe69fcb654e7b00000000000000001 |       1 | 15fe69fcb5e2a5b00000000000000001 | root      | 2020-03-21 19:57:32.99266+00:00  | SELECT table_name FROM [SHOW TABLES] WHERE table_name LIKE 'table%' ORDER BY random() LIMIT 1 | 127.0.0.1:64796 |                  |    false    | executing

We can also experience a hung show tables ourselves by running roachprod sql local -- -e "show tables".

Goroutines indicate something blocked way down in kv land:

goroutine 2189 [select]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait.(*Queue).MaybeWaitForPush(0xc0003f6e10, 0x82e3680, 0xc0008c6360, 0xc003b3c3c0, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait/queue.go:513 +0xf95
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).maybeInterceptReq(0xc0006a6cc0, 0x82e3680, 0xc0008c6360, 0x0, 0x15fe69fcb70841c0, 0x0, 0x0, 0x0, 0xc000a55780, 0x1, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:189 +0xa9
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc0006a6cc0, 0x82e3680, 0xc0008c6360, 0xc00064f5e0, 0x0, 0x15fe69fcb70841c0, 0x0, 0x0, 0x0, 0xc000a55780, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:142 +0xb8
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc0006a6cc0, 0x82e3680, 0xc0008c6360, 0xc00064f5e0, 0x0, 0x15fe69fcb70841c0, 0x0, 0x0, 0x0, 0xc000a55780, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:121 +0xfb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc0006f6e00, 0x82e3680, 0xc0008c6360, 0xc000a55880, 0x7bdafe0, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:192 +0x270
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc0006f6e00, 0x82e3680, 0xc0008c6330, 0x6, 0xc000a55880, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:94 +0x68a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc0006f6e00, 0x82e3680, 0xc0008c6330, 0x15fe69fcb70841c0, 0x0, 0x100000001, 0x1, 0x0, 0x6, 0x0, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:36 +0x91
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc0006f7500, 0x82e3680, 0xc0008c62d0, 0x15fe69fcb70841c0, 0x0, 0x100000001, 0x1, 0x0, 0x6, 0x0, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:204 +0x6c0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000b939a0, 0x82e3680, 0xc0008c62d0, 0x15fe69fcb70841c0, 0x0, 0x100000001, 0x1, 0x0, 0x6, 0x0, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:188 +0xed
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x82e3680, 0xc0008c62d0, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:925 +0x201
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc00073a000, 0x82e3680, 0xc0008c62d0, 0x7a42e14, 0x10, 0xc003191f40, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:302 +0x140
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc00001f900, 0x82e3680, 0xc0008c62d0, 0xc000a55800, 0xc0008c62d0, 0xc0008c6090, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:913 +0x194
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc00001f900, 0x82e3680, 0xc0008c62a0, 0xc000a55800, 0xc0008c6210, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:951 +0x9f

Just realized I haven't pulled in a day or two. Perhaps this is already fixed.

@petermattis
Copy link
Copy Markdown
Collaborator Author

Just realized I haven't pulled in a day or two. Perhaps this is already fixed.

Just rebased on top of 94bef65 and the hung show tables query is still happening. Goroutine dump shows a goroutine blocked in MaybeWaitForPush similar to the above. I don't see any other goroutines doing anything interesting.

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Mar 21, 2020

Good idea! We've been in need of this form of randomized testing around schema changes for a while. Do you picture extending this to include non-empty tables in order to stress the backfill machinery as well?

I haven't dug into this at all, but I did try to get the load gen running and saw the same stall that you are reporting. I can also easily reproduce on v19.2.4. It seems like this has already found something interesting to explore.

@petermattis
Copy link
Copy Markdown
Collaborator Author

petermattis commented Mar 21, 2020 via email

@ajwerner
Copy link
Copy Markdown
Contributor

I'd expect #46170 to fix the hangs for the operations in this test. #46384 seems important for operations which create and drop database, but it doesn't seem like this test does that. Let me try to repro and see if we missed some other resolution cases.

@ajwerner
Copy link
Copy Markdown
Contributor

I'd totally expect this to happen in 19.2 and earlier. We should backport the above fixes.

@ajwerner
Copy link
Copy Markdown
Contributor

Alright, I went and read the code and I'm not convinced this is a bug. The hangs are always during an explicit transaction when looking up the next operation (L280). That randOp() call attempts to read the schema in order to look up the next operation to perform from a different transaction which blocks on the transaction the workload holds open.

There are two options:

  1. Use BEGIN PRIORITY HIGH; ... ; COMMIT; for randOp()
  2. Thread the tx through to randOp()

@ajwerner
Copy link
Copy Markdown
Contributor

I'd note that 2. is probably better than 1. for a couple of reasons. It would also test interacting with changes made during the transaction and it would mean that we don't always necessarily push all of these transactions.

@petermattis petermattis force-pushed the pmattis/workload-schemachange branch from 1ecb0a0 to 9b8ecfe Compare March 22, 2020 15:40
@petermattis
Copy link
Copy Markdown
Collaborator Author

Ah, thanks for the eagle-eyes, Andrew. I was suspicious I was doing something wrong given how easy this was to reproduce. I've gone ahead and plumbed tx everywhere and can now successfully run 1000 randomized schema changes. I haven't yet tested with concurrency, and there are likely other bugs in this test code (views never seem to be created).

@petermattis
Copy link
Copy Markdown
Collaborator Author

Ok, --concurrency=1 seems to be working ok modulo deficiencies in the schema changes, but I can get things wedged up with --concurrency=2:

~ bin/workload run schemachange --init --concurrency=2 --verbose=false --max-ops=1000

The wedging happens very quickly (within a few seconds).

~ roachprod sql local -- -d schemachange -e 'show queries'
              query_id             | node_id |            session_id            | user_name |              start               |        query         | client_address  | application_name | distributed |   phase
-----------------------------------+---------+----------------------------------+-----------+----------------------------------+----------------------+-----------------+------------------+-------------+------------
  15feabc84c3d16500000000000000001 |       1 | 15feabc84c35ee480000000000000001 | root      | 2020-03-22 16:03:15.641555+00:00 | SHOW CLUSTER QUERIES | 127.0.0.1:54971 | $ cockroach sql  |    false    | executing
  15feabc74bb964a00000000000000001 |       1 | 15feabc73c55ecb80000000000000001 | root      | 2020-03-22 16:03:11.338257+00:00 | DROP TABLE table17   | 127.0.0.1:54963 |                  |    false    | executing
(2 rows)
~ roachprod sql local -- -d schemachange -e 'select active_queries,last_active_query from [show sessions]'
                             active_queries                             |                                       last_active_query
------------------------------------------------------------------------+------------------------------------------------------------------------------------------------
                                                                        | CREATE DATABASE IF NOT EXISTS schemachange
                                                                        | COMMIT TRANSACTION
                                                                        | COMMIT TRANSACTION
  SELECT active_queries, last_active_query FROM [SHOW CLUSTER SESSIONS] |
                                                                        | SELECT table_name FROM [SHOW TABLES] WHERE table_name LIKE 'table%' ORDER BY random() LIMIT 1
  DROP TABLE table17                                                    | SELECT table_name FROM [SHOW TABLES] WHERE table_name LIKE 'table%' ORDER BY random() LIMIT 1
(6 rows)

It's possible my code is still doing something wrong. I'll poke around this some more this afternoon.

@ajwerner
Copy link
Copy Markdown
Contributor

I suspect now you're hitting the issue fixed by #46384. Let me pull your branch now and see.

@petermattis
Copy link
Copy Markdown
Collaborator Author

I haven't tried on top of #46384 yet, but what I'm seeing is multiple goroutines that look like:

github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait.(*Queue).MaybeWaitForPush(0xc0031523c0, 0x82ec140, 0xc0046b1290, 0xc0047da000, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait/queue.go:513 +0xf95
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).maybeInterceptReq(0xc003144180, 0x82ec140, 0xc0046b1290, 0x0, 0x15feb0705a1b5a28, 0x0, 0x0, 0x0, 0xc004e52a80, 0x1, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:189 +0xa9
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc003144180, 0x82ec140, 0xc0046b1290, 0xc004897dc0, 0x0, 0x15feb0705a1b5a28, 0x0, 0x0, 0x0, 0xc004e52a80, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:142 +0xb8
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc003144180, 0x82ec140, 0xc0046b1290, 0x0, 0x0, 0x15feb0705a1b5a28, 0x0, 0x0, 0x0, 0xc004e52a80, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:121 +0xfb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc000694e00, 0x82ec140, 0xc0046b1290, 0xc004e52b80, 0x7be1c88, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:213 +0x36f
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc000694e00, 0x82ec140, 0xc0046b1260, 0x6, 0xc004e52b80, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:94 +0x68a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(0xc000694e00, 0x82ec140, 0xc0046b1260, 0x15feb0705a1b5a28, 0x0, 0x100000001, 0x1, 0x0, 0x6, 0x0, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:36 +0x91
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc0010b6000, 0x82ec140, 0xc0046b1200, 0x15feb0705a1b5a28, 0x0, 0x100000001, 0x1, 0x0, 0x6, 0x0, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:204 +0x6c0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc0008740a0, 0x82ec140, 0xc0046b1200, 0x15feb0705a1b5a28, 0x0, 0x100000001, 0x1, 0x0, 0x6, 0x0, ...)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:188 +0xed
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x82ec140, 0xc0046b1200, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:925 +0x201
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000766000, 0x82ec140, 0xc0046b1200, 0x7a4949c, 0x10, 0xc007480490, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:302 +0x140
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000024000, 0x82ec140, 0xc0046b1200, 0xc004e52b00, 0xc0046b1200, 0xc0052b7f80, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:913 +0x194
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000024000, 0x82ec140, 0xc0046b11d0, 0xc004e52b00, 0xc0053fc0c0, 0x0, 0x0)
	/Users/pmattis/Development/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:951 +0x9f

@petermattis
Copy link
Copy Markdown
Collaborator Author

I suspect now you're hitting the issue fixed by #46384. Let me pull your branch now and see.

No success running on top of #46384. The same wedging still occurs.

@ajwerner
Copy link
Copy Markdown
Contributor

ajwerner commented Mar 22, 2020

Nice catch. If you wait 5 minutes you'll find that it unblocks. Seems like somewhere we're not releasing a schema lease in the right place. Digging deeper.

@petermattis
Copy link
Copy Markdown
Collaborator Author

Nice catch. If you want 5 minutes you'll find that it unblocks. Seems like somewhere we're not releasing a schema lease in the right place. Digging deeper.

Heh, well that was the point of writing this test: finding bugs. Fingers-crossed that this is a real one and not too difficult to fix. FYI, I'm not going to have much time to work on this in the early part of this week.

…load

Randomly generate (concurrent) schema changes. The tables intentionally
contain no actual data as the focus here is on stressing the machinery
around schema changes, not the machinery around backfills.

Release note: None

Release justification: non-production code changes. This is only test
code.
@petermattis petermattis force-pushed the pmattis/workload-schemachange branch from 9b8ecfe to 59924e9 Compare March 24, 2020 19:09
@spaskob
Copy link
Copy Markdown
Contributor

spaskob commented Mar 25, 2020

At the schema change meeting yesterday we discussed that if you don't mind Peter I should try to merge the workload part of this PR to unblock further work on roachtests for schema changes.

@petermattis
Copy link
Copy Markdown
Collaborator Author

Superseded by #46632

@petermattis petermattis deleted the pmattis/workload-schemachange branch March 27, 2020 13:07
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Apr 9, 2020
…nced

This commit is to help @spaskob avoid a hang observed while working on
turning cockroachdb#46632 (workload/schemachange) into a roachtest. The idea is
that as a stopgap the roachtest can issue:

```
SET CLUSTER SETTING sql.lease_manager.remove_lease_once_deferenced = true;
```

The hang was discovered while looking at cockroachdb#46402.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants