Overview of the Issue
When a source tablet's mysqld fails, the VReplication workflow never recovers — it continues trying to replicate from the non-healthy tablet.
Secondly, when using a tablet selection preference where e.g. REPLICA tablets are always chosen when available (the default value of --tablet_types="in_order:REPLICA,PRIMARY" does this) and none of the REPLICA tablets are healthy within the shard (each one has a down mysqld) then we never attempt to use one of the secondary tablet types (in this case PRIMARY).
Reproduction Steps
Test case:
git checkout main && make build
cd examples/local
./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh
let tablet_uid=$(vtctldclient GetTablets --keyspace commerce --tablet-type replica | awk '{print $1}' | cut -d- -f2)+0; mysqlctl --tablet_uid=${tablet_uid} shutdown
# see it never recover
for _ in {1..500}; do
vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
sleep 1
done
Note that stopping and starting the workflow in this test case also does not help:
vtctlclient Workflow -- customer.commerce2customer stop; vtctlclient Workflow -- customer.commerce2customer start
# see it still never recover
for _ in {1..500}; do
vtctlclient Workflow -- customer.commerce2customer show | jq .ShardStatuses
sleep 1
done
This is a TabletPicker issue in that we are weeding out all non-REPLICA tablets — because the tablet types are set to the default of --tablet_types="in_order:REPLICA,PRIMARY" — before we look at the tablet health — and the only thing we test when considering the tablet health is whether or not we can make a gRPC call to it and NOT whether or not the tablet actually reports itself as healthy and serving.
Binary Version
Version: 18.0.0-SNAPSHOT (Git revision 98918326587815d8e934711b817fd10630643772 branch 'main') built on Mon Jul 17 16:02:39 EDT 2023 by matt@pslord.local using go1.20.5 darwin/arm64
Operating System and Environment details
Log Fragments
Overview of the Issue
When a source tablet's mysqld fails, the VReplication workflow never recovers — it continues trying to replicate from the non-healthy tablet.
Secondly, when using a tablet selection preference where e.g.
REPLICAtablets are always chosen when available (the default value of--tablet_types="in_order:REPLICA,PRIMARY"does this) and none of theREPLICAtablets are healthy within the shard (each one has a down mysqld) then we never attempt to use one of the secondary tablet types (in this casePRIMARY).Reproduction Steps
Test case:
Note that stopping and starting the workflow in this test case also does not help:
This is a TabletPicker issue in that we are weeding out all non-REPLICA tablets — because the tablet types are set to the default of
--tablet_types="in_order:REPLICA,PRIMARY"— before we look at the tablet health — and the only thing we test when considering the tablet health is whether or not we can make a gRPC call to it and NOT whether or not the tablet actually reports itself as healthy and serving.Binary Version
Version: 18.0.0-SNAPSHOT (Git revision 98918326587815d8e934711b817fd10630643772 branch 'main') built on Mon Jul 17 16:02:39 EDT 2023 by matt@pslord.local using go1.20.5 darwin/arm64Operating System and Environment details
Log Fragments