Skip to content

Bug Report: VReplication workflows don't auto recover when source tablet's mysqld fails #13519

@mattlord

Description

@mattlord

Overview of the Issue

When a source tablet's mysqld fails, the VReplication workflow never recovers — it continues trying to replicate from the non-healthy tablet.

Secondly, when using a tablet selection preference where e.g. REPLICA tablets are always chosen when available (the default value of --tablet_types="in_order:REPLICA,PRIMARY" does this) and none of the REPLICA tablets are healthy within the shard (each one has a down mysqld) then we never attempt to use one of the secondary tablet types (in this case PRIMARY).

Reproduction Steps

Test case:

git checkout main && make build

cd examples/local

./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh

let tablet_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type replica | awk '{print $1}' | cut -d- -f2)+0; mysqlctl --tablet_uid=${tablet_uid} shutdown

# see it never recover
for _ in {1..500}; do
  vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
  sleep 1
done

Note that stopping and starting the workflow in this test case also does not help:

vtctlclient Workflow -- customer.commerce2customer stop; vtctlclient Workflow -- customer.commerce2customer start

# see it still never recover
for _ in {1..500}; do
  vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
  sleep 1
done

This is a TabletPicker issue in that we are weeding out all non-REPLICA tablets — because the tablet types are set to the default of --tablet_types="in_order:REPLICA,PRIMARY" — before we look at the tablet health — and the only thing we test when considering the tablet health is whether or not we can make a gRPC call to it and NOT whether or not the tablet actually reports itself as healthy and serving.

Binary Version

Version: 18.0.0-SNAPSHOT (Git revision 98918326587815d8e934711b817fd10630643772 branch 'main') built on Mon Jul 17 16:02:39 EDT 2023 by matt@pslord.local using go1.20.5 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

N/A

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions