Overview of the Issue
When the throttler on vttablet encounters a topo watch error (e.g. zookeeper closing the connection) it doesn’t recreate the watch on the SrvKeyspace, so it never gets any more throttler config updates.
In the vttablet logs we see this error message:
E0425 10:45:01.036900 40711 throttler.go:381] WatchSrvKeyspaceCallback error: ResilientWatch stream failed for zone1.commerce: zk: zookeeper is closing
That comes from here:
|
if err != nil { |
|
if !topo.IsErrType(err, topo.Interrupted) && !errors.Is(err, context.Canceled) { |
|
log.Errorf("WatchSrvKeyspaceCallback error: %v", err) |
|
} |
|
return false |
|
} |
The problem is that callback function returns false when called with an error, and it looks like when a callback function returns false it effectively gets removed from the list of listeners here:
|
listeners := entry.listeners |
|
entry.listeners = entry.listeners[:0] |
|
|
|
for _, callback := range listeners { |
|
if callback(entry.value, entry.lastError) { |
|
entry.listeners = append(entry.listeners, callback) |
|
} |
|
} |
Reproduction Steps
I can reproduce this easily using zookeeper as the topo server because connection errors get surfaced as watch errors. I'm not sure how to force a watch error with etcd.
- Start a cluster using zookeeper as the topo server
TOPO=zk2 ./101_initial_cluster.sh
- Tail the vttablet logs
- Enable the throttler:
vtctldclient --server localhost:15999 UpdateThrottlerConfig --enable --threshold 1.0 commerce
- After a few seconds you should see something like this in the vttablet logs:
I0425 10:43:45.794495 40711 throttler.go:425] Throttler: applying topo config: enabled:true threshold:1
I0425 10:43:45.794520 40711 throttler.go:531] Throttler: enabling
- Stop zookeeper
TOPO=zk2 CELL=zone1 ../common/scripts/zk-down.sh
- You should see something like this in the vttablet logs
E0425 10:45:01.033478 40711 watch.go:211] ResilientWatch stream failed for zone1.commerce: zk: zookeeper is closing
received a non-OK event for /vitess/zone1/keyspaces/commerce/SrvKeyspace
E0425 10:45:01.036900 40711 throttler.go:381] WatchSrvKeyspaceCallback error: ResilientWatch stream failed for zone1.commerce: zk: zookeeper is closing
- Start zookeeper
TOPO=zk2 CELL=zone1 ../common/scripts/zk-up.sh
- Modify the throttler config:
vtctldclient --server localhost:15999 UpdateThrottlerConfig --enable --threshold 2.0 commerce
- The change will never be seen by the vttablet
Binary Version
Impacts main and earlier versions
Operating System and Environment details
Log Fragments
Overview of the Issue
When the throttler on vttablet encounters a topo watch error (e.g. zookeeper closing the connection) it doesn’t recreate the watch on the SrvKeyspace, so it never gets any more throttler config updates.
In the vttablet logs we see this error message:
That comes from here:
vitess/go/vt/vttablet/tabletserver/throttle/throttler.go
Lines 379 to 384 in 68242a6
The problem is that callback function returns
falsewhen called with an error, and it looks like when a callback function returnsfalseit effectively gets removed from the list of listeners here:vitess/go/vt/srvtopo/watch.go
Lines 164 to 171 in 68242a6
Reproduction Steps
I can reproduce this easily using zookeeper as the topo server because connection errors get surfaced as watch errors. I'm not sure how to force a watch error with etcd.
TOPO=zk2 ./101_initial_cluster.shvtctldclient --server localhost:15999 UpdateThrottlerConfig --enable --threshold 1.0 commerceTOPO=zk2 CELL=zone1 ../common/scripts/zk-down.shTOPO=zk2 CELL=zone1 ../common/scripts/zk-up.shvtctldclient --server localhost:15999 UpdateThrottlerConfig --enable --threshold 2.0 commerceBinary Version
Operating System and Environment details
Log Fragments