-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Bug Report: Connection Pool Full error locks up vttablet #15745
Description
Overview of the Issue
We're randomly seeing vttablet lock up with errors like this:
PoolFull: skipped 60 log messages
Once it gets into that state, health checks start failing and query serving never recovers until I restart the whole tablet pod.
This is a complete guess, but I wonder if this is happening in k8s because when vttablet is unhealthy, that gets reported to the k8s service, which then quits routing traffic, and the health check is waiting for new queries to become healthy again.
We haven't seen any connection pool errors in ~5 years (possible that we didn't notice logs, but there was never an outage). The only flag that was removed when upgrading from v18 to v19 was --queryserver-config-query-cache-size 100. @deepthi pointed out this PR from @vmg #14034 that was a major refactor of connection pools as being the first place to look.
Maybe we're in an edge case because of our usage of message tables. Until v15, there was a separate flag/pool --queryserver-config-message-conn-pool-size for messaging connections, so maybe those aren't accounted for?
Reproduction Steps
vttablet flags
vttablet
--topo_implementation="etcd2"
--topo_global_server_address="etcd-global-client.vitess:2379"
--topo_global_root=/vitess/global
--logtostderr
--port 15002
--grpc_port 16002
--service_map "grpc-queryservice,grpc-tabletmanager,grpc-updatestream"
--grpc_prometheus
--tablet_dir "tabletdata"
--tablet-path "uscentral1-$(cat /vtdataroot/tabletdata/tablet-uid)"
--tablet_hostname "$(hostname).vttablet"
--init_keyspace "companies"
--init_shard "0"
--init_tablet_type "replica"
--health_check_interval "5s"
--mysqlctl_socket "/vtdataroot/mysqlctl.sock"
--enable_replication_reporter
--vreplication_max_time_to_retry_on_error 8760h
--init_db_name_override "companies"
--grpc_max_message_size 100000000
--restore_from_backup
--backup_storage_implementation=$VT_BACKUP_SERVICE
--gcs_backup_storage_bucket=$VT_GCS_BACKUP_STORAGE_BUCKET
--gcs_backup_storage_root=$VT_GCS_BACKUP_STORAGE_ROOT
--app_pool_size="400"
--dba_pool_size="10"
--grpc_keepalive_time="10s"
--grpc_server_keepalive_enforcement_policy_permit_without_stream="true"
--queryserver-config-max-result-size="10000"
--queryserver-config-message-postpone-cap="25"
--queryserver-config-passthrough-dmls="true"
--queryserver-config-pool-size="50"
--queryserver-config-query-timeout=600s
--queryserver-config-transaction-cap="300"Binary Version
all components v19.0.3
Percona Server v8.0.36Operating System and Environment details
GKE 1.29.3
Debian BookwormLog Fragments
E0417 17:26:06.069538 3335298 throttled.go:77] PoolFull: skipped 60 log messages
E0417 17:26:07.027101 3335298 tabletserver.go:1665] PoolFull: Code: RESOURCE_EXHAUSTED
resource pool timed out (CallerID: unsecure_grpc_client)
E0417 17:27:07.027359 3335298 throttled.go:77] PoolFull: skipped 29 log messages