Skip to content

Request Forward errors during Active Node rotation #13639

@jlestrada

Description

@jlestrada

Describe the bug
TL;DR Request forward errors during Active Node rotation of Vault HA configuration.

We often see these client forward request error logs during our EC2 instance rotation process which sends a sigterm signal to the running docker container hosting the Active Node. This seems to indicate that a request was sent to the Vault cluster, routed to a standby node and failed because there was no Active Node Address to route to. This re-election process is taking less than a second in most cases that we have seen and pending the incoming traffic volume we see 1-3 forward request errors during this process.

{
  @level | error
  @message | error during forwarded RPC request
  @module | core
  @timestamp | 2022-01-11T20:26:34.495207Z
  error | rpc error: code = Unavailable desc = transport is closing
}

To Reproduce
Steps to reproduce the behavior:

  1. send sigterm to Active Node docker container
  2. See error pending traffic volume during Active Node being rotated

Expected behavior
The expectation here is to avoid client forward request errors during Active Node rotation. We expect to see a zero downtime Active Node rotation process when the sigterm signal is sent to the Active Node container. The current understanding is that the Active Node will graceful shutdown and the cluster will elect a new Active Node. This re-election process will cause a small moment of time in which there is no Active Node address (1-3 seconds). The expectation during this rotation is that the request will be held until there is an active node address to route to. Its our understanding that there is a write concurrency limitation which is why there can only be at most 1 Active Node Address, however we don't expect for requests to fail during this time period.

Environment:

  • Vault Server Version (retrieve with vault status):
❯ vault_staging_us status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                2
Version                  1.7.8
Storage Type             postgresql
Cluster Name             vault-cluster-b4f624d3
Cluster ID               63ec32e7-b82b-596a-8fb5-f1a88c497ac5
HA Enabled               true
HA Cluster               <https://10.1.1.101:32770>
HA Mode                  active
Active Since             2022-01-11T20:26:34.731928591Z

  • Vault CLI Version (retrieve with vault version):
❯ vault version
Vault v1.7.1 ('917142287996a005cb1ed9d96d00d06a0590e44e+CHANGES')

  • Server Operating System/Architecture:
    EC2 Hosted solution using AWS ECS with Vault HA configuration using 3 nodes across 2 AZs.

Vault server configuration file(s):

cluster_addr and api_addr are pulled at time of container deployment due to dynamic private IP address. Example values can be seen above in status output.

storage "postgresql" {
  connection_url = "postgres://"
  table          = "vault_kv_store"
  ha_enabled     = "true"
  ha_table       = "vault_ha_locks"
}

listener "tcp" {
  address       = "0.0.0.0:8200"
  cluster       =  "0.0.0.0:8201"
  tls_cert_file = "/vault/tls/vault.crt.pem"
  tls_key_file  = "/vault/tls/vault.key.pem"
}

telemetry {
  dogstatsd_addr = "HostPrivateIPv4Address:8125"
  prometheus_retention_time = "30s"
  disable_hostname = true
}

cluster_addr = "<https://HostPrivateIPv4Address>:ClusterPort"
api_addr = "<https://HostPrivateIPv4Address>:ApiPort"

Additional context
Attached are container logs during the time of the error and Active Node rotation process. The timestamps between the Active Node entering shutdown and the moment a new node is elected is less than 1 second.

extract-2022-01-11T22_30_16.697Z.csv

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions