Skip to content

Bug Report: VReplicationStreamState falls out of sync with --workflow status on resume #15337

@tycol7

Description

@tycol7

Overview of the Issue

In a VReplication workflow (e.g., MoveTables), the --workflow status and VReplicationStreamState statuses fall out of sync when resuming the workflow after an interruption:

Timeline --workflow status VReplicationStreamState
1. Initial --workflow create Copying Copying
2. Interruption with --workflow stop Stopped Stopped
3. Resumption with --workflow start Copying Running ⚠️
4. After we are done copying but still running Running Running

This Copying vs Running mismatch in VReplicationStreamState when restarting a workflow can lead to faulty assumptions in monitoring and reporting (e.g., thinking copying is done when it really isn't).

Reproduction Steps

  1. Spin up a new cluster, e.g.,

    ./examples/local/101_initial_cluster.sh
    
  2. Insert enough data such that VReplication takes long enough to capture stats, e.g.,

    mysql < examples/common/insert_commerce_data.sql
    

    and

    mysql -e "insert into customer (email) select email from customer"
    

    ☝️ doubles the rows on every run.1 20,971,520 rows is enough for our purposes.

  3. Spin up additional tablets in preparation for VReplication, e.g.,

    ./examples/local/201_customer_tablets.sh
    
  4. Begin the VReplication, e.g.,

    vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer create --source-keyspace commerce --tables 'customer,corder'
    
  5. Observe the following statuses:
    a. --workflow status

    $ vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer status --format=json
    …
     "shard_streams": {
      "customer/0": {
        "streams": [
          {
            "id": 1,
            "tablet": {
              "cell": "zone1",
              "uid": 200
            },
            "source_shard": "commerce/0",
            "position": "64425eca-d1e1-11ee-a4e7-ddc645075491:1-64",
            "status": "Copying",
            "info": "VStream Lag: 0s"
          }
    

    b. VReplicationStreamState:

    $ curl -s http://localhost:15200/debug/vars | grep "VReplicationStreamState"
    "VReplicationStreamState": {"commerce2customer.1": "Copying"},
    

    Expected: The states match (Copying and Copying).

  6. Stop the workflow, e.g.,

    vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer stop --format=json
    
  7. Resume the workflow, e.g.,

    vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer start --format=json
    
  8. Observe the statuses again:
    a. --workflow status

    $ vtctldclient --server localhost:15999 MoveTables --target-keyspace customer --workflow commerce2customer status --format=json
    …
    "customer/0": {
    "streams": [
      {
        "id": 1,
        "tablet": {
          "cell": "zone1",
          "uid": 200
        },
        "source_shard": "commerce/0",
        "position": "64425eca-d1e1-11ee-a4e7-ddc645075491:1-65",
        "status": "Copying",
        "info": "VStream Lag: -1s; ; Tx time: Thu Feb 22 16:51:59 2024."
      }
    

    b. VReplicationStreamState

    $ curl -s http://localhost:15200/debug/vars | grep "VReplicationStreamState"
    "VReplicationStreamState": {"commerce2customer.1": "Running"},
    

    ⚠️ Unexpected: VReplicationStreamState says Running when workflow status is still Copying.

  9. (Optional) Allow the copy to finish (i.e., wait a few minutes while the workflow is running). Observe both statuses are Running as expected.

Binary Version

vtgate version Version: 20.0.0-SNAPSHOT (Git revision 27be9166e1ace2708a158e9faf220cf156569e50 branch 'main') built on Thu Feb 22 17:00:55 PST 2024 by tyler@local using go1.22.0 darwin/amd64

Operating System and Environment details

- macOS 14.3.1 (23D60)
- Darwin 23.3.0
- arm64

Log Fragments

See above.

Footnotes

  1. Thanks @maxenglander!

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions