Describe the bug
TL;DR Request forward errors during Active Node rotation of Vault HA configuration.
We often see these client forward request error logs during our EC2 instance rotation process which sends a sigterm signal to the running docker container hosting the Active Node. This seems to indicate that a request was sent to the Vault cluster, routed to a standby node and failed because there was no Active Node Address to route to. This re-election process is taking less than a second in most cases that we have seen and pending the incoming traffic volume we see 1-3 forward request errors during this process.
{
@level | error
@message | error during forwarded RPC request
@module | core
@timestamp | 2022-01-11T20:26:34.495207Z
error | rpc error: code = Unavailable desc = transport is closing
}
To Reproduce
Steps to reproduce the behavior:
- send sigterm to Active Node docker container
- See error pending traffic volume during Active Node being rotated
Expected behavior
The expectation here is to avoid client forward request errors during Active Node rotation. We expect to see a zero downtime Active Node rotation process when the sigterm signal is sent to the Active Node container. The current understanding is that the Active Node will graceful shutdown and the cluster will elect a new Active Node. This re-election process will cause a small moment of time in which there is no Active Node address (1-3 seconds). The expectation during this rotation is that the request will be held until there is an active node address to route to. Its our understanding that there is a write concurrency limitation which is why there can only be at most 1 Active Node Address, however we don't expect for requests to fail during this time period.
Environment:
- Vault Server Version (retrieve with
vault status):
❯ vault_staging_us status
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 5
Threshold 2
Version 1.7.8
Storage Type postgresql
Cluster Name vault-cluster-b4f624d3
Cluster ID 63ec32e7-b82b-596a-8fb5-f1a88c497ac5
HA Enabled true
HA Cluster <https://10.1.1.101:32770>
HA Mode active
Active Since 2022-01-11T20:26:34.731928591Z
- Vault CLI Version (retrieve with
vault version):
❯ vault version
Vault v1.7.1 ('917142287996a005cb1ed9d96d00d06a0590e44e+CHANGES')
- Server Operating System/Architecture:
EC2 Hosted solution using AWS ECS with Vault HA configuration using 3 nodes across 2 AZs.
Vault server configuration file(s):
cluster_addr and api_addr are pulled at time of container deployment due to dynamic private IP address. Example values can be seen above in status output.
storage "postgresql" {
connection_url = "postgres://"
table = "vault_kv_store"
ha_enabled = "true"
ha_table = "vault_ha_locks"
}
listener "tcp" {
address = "0.0.0.0:8200"
cluster = "0.0.0.0:8201"
tls_cert_file = "/vault/tls/vault.crt.pem"
tls_key_file = "/vault/tls/vault.key.pem"
}
telemetry {
dogstatsd_addr = "HostPrivateIPv4Address:8125"
prometheus_retention_time = "30s"
disable_hostname = true
}
cluster_addr = "<https://HostPrivateIPv4Address>:ClusterPort"
api_addr = "<https://HostPrivateIPv4Address>:ApiPort"
Additional context
Attached are container logs during the time of the error and Active Node rotation process. The timestamps between the Active Node entering shutdown and the moment a new node is elected is less than 1 second.
extract-2022-01-11T22_30_16.697Z.csv
Describe the bug
TL;DR Request forward errors during Active Node rotation of Vault HA configuration.
We often see these client forward request error logs during our EC2 instance rotation process which sends a sigterm signal to the running docker container hosting the Active Node. This seems to indicate that a request was sent to the Vault cluster, routed to a standby node and failed because there was no Active Node Address to route to. This re-election process is taking less than a second in most cases that we have seen and pending the incoming traffic volume we see 1-3 forward request errors during this process.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The expectation here is to avoid client forward request errors during Active Node rotation. We expect to see a zero downtime Active Node rotation process when the sigterm signal is sent to the Active Node container. The current understanding is that the Active Node will graceful shutdown and the cluster will elect a new Active Node. This re-election process will cause a small moment of time in which there is no Active Node address (1-3 seconds). The expectation during this rotation is that the request will be held until there is an active node address to route to. Its our understanding that there is a write concurrency limitation which is why there can only be at most 1 Active Node Address, however we don't expect for requests to fail during this time period.
Environment:
vault status):vault version):EC2 Hosted solution using AWS ECS with Vault HA configuration using 3 nodes across 2 AZs.
Vault server configuration file(s):
cluster_addrandapi_addrare pulled at time of container deployment due to dynamic private IP address. Example values can be seen above in status output.Additional context
Attached are container logs during the time of the error and Active Node rotation process. The timestamps between the Active Node entering shutdown and the moment a new node is elected is less than 1 second.
extract-2022-01-11T22_30_16.697Z.csv