Skip to content

Bug Report: Pre-existing Tablet Controls breaks MoveTables SwitchTraffic #13999

@FancyFane

Description

@FancyFane

Overview of the Issue

When there are pre-populated tablet controls on the target keyspace, MoveTables SwitchTraffic will break with an error that requires manual cleanup before reads and writes can resume. This occurs, when the TabletControls has a list of denied tables rules that don't match the currently running workflow. If the workflow's tables don't match the TabletControls 1 for 1; then an error results.

Any traffic sent after this point will result in continued errors from the application until we removed the TabletControls and Refreshed the Shard State.

Related Issue: #13998

Reproduction Steps

  1. Do a MoveTables with 6 sbtest databases; SwitchTraffic, ReverseTraffic; then cancel the workflow. This will result in an environment with Tablet Controls in place on the target and no running workflow.

See Issue: #13998

$ vtctlclient --server :15999 GetShard fane_import_sharded/-80
{
...
  "tablet_controls": [
    {
      "tablet_type": 1,
      "cells": [],
      "denied_tables": [
        "sbtest1",
        "sbtest2",
        "sbtest3",
        "sbtest4",
        "sbtest5",
        "sbtest6",
        "testing"
      ],
...
}
  1. Add two new sbtest tables on your source; and start up a new workflow; NOTE when you see the matching tables you'll see tables sbtest1-8; however, the tablet controls are only for sbtest1-6.
$ vtctlclient --server :15999 Workflow fane_import_sharded.import-shard-80 show
{
	"Workflow": "import-shard-80",
	"SourceLocation": {
		"Keyspace": "fane_import_sharded_source",
		"Shards": [
			"-80"
		]
	},
	"TargetLocation": {
		"Keyspace": "fane_import_sharded",
		"Shards": [
			"-80"
		]
	},
	"MaxVReplicationLag": 1,
	"MaxVReplicationTransactionLag": 1,
	"Frozen": false,
	"ShardStatuses": {
		"-80/aws_useast1a_6-3337899395": {
			"PrimaryReplicationStatuses": [
				{
					"Shard": "-80",
					"Tablet": "aws_useast1a_6-3337899395",
					"ID": 6,
					"Bls": {
						"keyspace": "fane_import_sharded_source",
						"shard": "-80",
						"filter": {
							"rules": [
								{
									"match": "sbtest1",
									"filter": "select * from sbtest1 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "sbtest2",
									"filter": "select * from sbtest2 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "sbtest3",
									"filter": "select * from sbtest3 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "sbtest4",
									"filter": "select * from sbtest4 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "sbtest5",
									"filter": "select * from sbtest5 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "sbtest6",
									"filter": "select * from sbtest6 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "sbtest7",
									"filter": "select * from sbtest7 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "sbtest8",
									"filter": "select * from sbtest8 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
								},
								{
									"match": "testing",
									"filter": "select * from testing"
								}
							]
						}
					},
					"Pos": "7c3368f8-5412-11ee-8179-0a26551b1c25:1-1584,7c434390-5412-11ee-8c60-0a26551b1c25:1",
					"StopPos": "",
					"State": "Running",
					"DBName": "fane_import_sharded",
					"TransactionTimestamp": 0,
					"TimeUpdated": 1694815788,
					"TimeHeartbeat": 1694815788,
					"TimeThrottled": 0,
					"ComponentThrottled": "",
					"Message": "",
					"Tags": "",
					"WorkflowType": "MoveTables",
					"WorkflowSubType": "Partial",
					"CopyState": null,
					"RowsCopied": 0
				}
			],
			"TabletControls": [
				{
					"tablet_type": 1,
					"denied_tables": [
						"sbtest1",
						"sbtest2",
						"sbtest3",
						"sbtest4",
						"sbtest5",
						"testing"
					]
				}
			],
			"PrimaryIsServing": true
		}
	},
	"SourceTimeZone": "",
	"TargetTimeZone": ""
}
  1. Performing a SwitchTraffic fails:
$ vtctlclient --server :15999 MoveTables SwitchTraffic fane_import_sharded.import-shard-80     
E0915 22:10:10.097662     696 main.go:96] E0915 22:10:10.097104 traffic_switcher.go:625] allowTargetWrites failed: Code: INVALID_ARGUMENT
cannot remove tables since one or more do not exist in the denylist
E0915 22:10:10.114269     696 main.go:96] E0915 22:10:10.113676 vtctl.go:2215] 
cannot remove tables since one or more do not exist in the denylist

The following vreplication streams exist for workflow fane_import_sharded.import-shard-80:

id=6 on -80/aws_useast1a_6-3337899395: Status: Stopped. VStream Lag: 0s.

MoveTables Error: rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist
E0915 22:10:10.216399     696 main.go:105] remote error: rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist
  1. Any writes done to the keyspace from the application during this time results in an error:
$ sysbench --db-driver=mysql --threads=1 --events=0 --time=0 --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=fane_import_sharded /usr/share/sysbench/oltp_insert.lua --tables=5 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Initializing worker threads...

Threads started!

FATAL: mysql_drv_query() returned error 1105 (target: fane_import_sharded_source.-80.primary: vttablet: rpc error: code = FailedPrecondition desc = disallowed due to rule: enforce denied tables (CallerID: admin)) for query 'INSERT INTO sbtest4 (id, k, c, pad) VALUES (0, 4098, '09169823527-14773847787-63328771402-43563606289-98835554319-17838113855-09276254645-46412092895-40264640011-92712584350', '67793249909-86081288100-12979568721-26815841297-77951231372')'
FATAL: `thread_run' function failed: /usr/share/sysbench/oltp_insert.lua:61: SQL error, errno = 1105, state = 'HY000': target: fane_import_sharded_source.-80.primary: vttablet: rpc error: code = FailedPrecondition desc = disallowed due to rule: enforce denied tables (CallerID: admin)

Recovery Steps

  1. (recovery step) The way to recovery here is to remove the tablet controls and refresh the shard state on the SOURCE:
vtctldclient --server localhost:15999 SetShardTabletControl --remove fane_import_sharded_source/-80 primary; 
vtctldclient --server localhost:15999 RefreshStateByShard fane_import_sharded_source/-80;
  1. (recovery step) Now any writes from the application will continue to run.
$ sysbench --db-driver=mysql --threads=1 --events=0 --time=0 --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=fane_import_sharded /usr/share/sysbench/oltp_insert.lua --tables=5 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Initializing worker threads...

Threads started!

Binary Version

Vitess 16.0.3

Operating System and Environment details

n/a

Log Fragments

n/a

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions