Request Forward errors during Active Node rotation

**Describe the bug**
TL;DR Request forward errors during Active Node rotation of Vault HA configuration.

We often see these client forward request error logs during our EC2 instance rotation process which sends a sigterm signal to the running docker container hosting the Active Node. This seems to indicate that a request was sent to the Vault cluster, routed to a standby node and failed because there was no Active Node Address to route to. This re-election process is taking less than a second in most cases that we have seen and pending the incoming traffic volume we see 1-3 forward request errors during this process.

```
{
  @level | error
  @message | error during forwarded RPC request
  @module | core
  @timestamp | 2022-01-11T20:26:34.495207Z
  error | rpc error: code = Unavailable desc = transport is closing
}

```

**To Reproduce**
Steps to reproduce the behavior:

1. send sigterm to Active Node docker container
2. See error pending traffic volume during Active Node being rotated

**Expected behavior**
The expectation here is to avoid client forward request errors during Active Node rotation. We expect to see a zero downtime Active Node rotation process when the sigterm signal is sent to the Active Node container. The current understanding is that the Active Node will graceful shutdown and the cluster will elect a new Active Node. This re-election process will cause a small moment of time in which there is no Active Node address (1-3 seconds). The expectation during this rotation is that the request will be held until there is an active node address to route to. Its our understanding that there is a write concurrency limitation which is why there can only be at most 1 Active Node Address, however we don't expect for requests to fail during this time period.

**Environment:**

- Vault Server Version (retrieve with `vault status`):

```
❯ vault_staging_us status
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                2
Version                  1.7.8
Storage Type             postgresql
Cluster Name             vault-cluster-b4f624d3
Cluster ID               63ec32e7-b82b-596a-8fb5-f1a88c497ac5
HA Enabled               true
HA Cluster               <https://10.1.1.101:32770>
HA Mode                  active
Active Since             2022-01-11T20:26:34.731928591Z

```

- Vault CLI Version (retrieve with `vault version`):

```
❯ vault version
Vault v1.7.1 ('917142287996a005cb1ed9d96d00d06a0590e44e+CHANGES')

```

- Server Operating System/Architecture:
EC2 Hosted solution using AWS ECS with Vault HA configuration using 3 nodes across 2 AZs.

Vault server configuration file(s):

`cluster_addr` and `api_addr` are pulled at time of container deployment due to dynamic private IP address. Example values can be seen above in status output.

```
storage "postgresql" {
  connection_url = "postgres://"
  table          = "vault_kv_store"
  ha_enabled     = "true"
  ha_table       = "vault_ha_locks"
}

listener "tcp" {
  address       = "0.0.0.0:8200"
  cluster       =  "0.0.0.0:8201"
  tls_cert_file = "/vault/tls/vault.crt.pem"
  tls_key_file  = "/vault/tls/vault.key.pem"
}

telemetry {
  dogstatsd_addr = "HostPrivateIPv4Address:8125"
  prometheus_retention_time = "30s"
  disable_hostname = true
}

cluster_addr = "<https://HostPrivateIPv4Address>:ClusterPort"
api_addr = "<https://HostPrivateIPv4Address>:ApiPort"

```

**Additional context**
Attached are container logs during the time of the error and Active Node rotation process. The timestamps between the Active Node entering shutdown and the moment a new node is elected is less than 1 second.

[extract-2022-01-11T22_30_16.697Z.csv](https://github.com/hashicorp/vault/files/7850265/extract-2022-01-11T22_30_16.697Z.csv)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request Forward errors during Active Node rotation #13639

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request Forward errors during Active Node rotation #13639

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions