-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Bug Report: VReplicationStreamState falls out of sync with --workflow status on resume #15337
Description
Overview of the Issue
In a VReplication workflow (e.g., MoveTables), the --workflow status and VReplicationStreamState statuses fall out of sync when resuming the workflow after an interruption:
| Timeline | --workflow status |
VReplicationStreamState |
|---|---|---|
1. Initial --workflow create |
Copying |
Copying |
2. Interruption with --workflow stop |
Stopped |
Stopped |
3. Resumption with --workflow start |
Copying |
Running |
| 4. After we are done copying but still running | Running |
Running |
This Copying vs Running mismatch in VReplicationStreamState when restarting a workflow can lead to faulty assumptions in monitoring and reporting (e.g., thinking copying is done when it really isn't).
Reproduction Steps
-
Spin up a new cluster, e.g.,
./examples/local/101_initial_cluster.sh -
Insert enough data such that
VReplicationtakes long enough to capture stats, e.g.,mysql < examples/common/insert_commerce_data.sqland
mysql -e "insert into customer (email) select email from customer"☝️ doubles the rows on every run.1 20,971,520 rows is enough for our purposes.
-
Spin up additional tablets in preparation for
VReplication, e.g.,./examples/local/201_customer_tablets.sh -
Begin the
VReplication, e.g.,vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer create --source-keyspace commerce --tables 'customer,corder' -
Observe the following statuses:
a.--workflow status$ vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer status --format=json … "shard_streams": { "customer/0": { "streams": [ { "id": 1, "tablet": { "cell": "zone1", "uid": 200 }, "source_shard": "commerce/0", "position": "64425eca-d1e1-11ee-a4e7-ddc645075491:1-64", "status": "Copying", "info": "VStream Lag: 0s" }b.
VReplicationStreamState:$ curl -s http://localhost:15200/debug/vars | grep "VReplicationStreamState" "VReplicationStreamState": {"commerce2customer.1": "Copying"},Expected: The states match (
CopyingandCopying). -
Stop the workflow, e.g.,
vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer stop --format=json -
Resume the workflow, e.g.,
vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer start --format=json -
Observe the statuses again:
a.--workflow status$ vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer status --format=json … "customer/0": { "streams": [ { "id": 1, "tablet": { "cell": "zone1", "uid": 200 }, "source_shard": "commerce/0", "position": "64425eca-d1e1-11ee-a4e7-ddc645075491:1-65", "status": "Copying", "info": "VStream Lag: -1s; ; Tx time: Thu Feb 22 16:51:59 2024." }b.
VReplicationStreamState$ curl -s http://localhost:15200/debug/vars | grep "VReplicationStreamState" "VReplicationStreamState": {"commerce2customer.1": "Running"},⚠️ Unexpected:VReplicationStreamStatesaysRunningwhenworkflow statusis stillCopying. -
(Optional) Allow the copy to finish (i.e., wait a few minutes while the workflow is running). Observe both statuses are
Runningas expected.
Binary Version
vtgate version Version: 20.0.0-SNAPSHOT (Git revision 27be9166e1ace2708a158e9faf220cf156569e50 branch 'main') built on Thu Feb 22 17:00:55 PST 2024 by tyler@local using go1.22.0 darwin/amd64Operating System and Environment details
- macOS 14.3.1 (23D60)
- Darwin 23.3.0
- arm64Log Fragments
See above.Footnotes
-
Thanks @maxenglander! ↩