[HA] PeerAddressAllowlistFilter rejects legitimate peers during k8s DNS-resolution race (incomplete allowlist on startup/restart)

## Summary

`PeerAddressAllowlistFilter` can build an **incomplete allowlist** when cluster peer hostnames do not resolve at the moment `resolveNow()` runs, and it then **actively rejects legitimate peers** (throwing `SecurityException`) instead of failing open during the DNS-resolution window. In Kubernetes StatefulSet deployments this is hit routinely on startup and on pod restarts, where headless-service A records are only published once a pod is `Ready` and pod IPs change across restarts.

Observed in a customer 3-node k8s cluster (26.6.1-SNAPSHOT). It both destabilizes initial cluster bootstrap and can block a restarted/wiped follower from rejoining and receiving its snapshot.

## Evidence (from customer logs)

At startup, peer hostnames fail to resolve, so they are silently dropped from the allowlist:

```
WARNI [PeerAddressAllowlistFilter] Cannot resolve cluster peer host 'splitter-2.splitter.sbu.svc.cluster.local' for Raft gRPC allowlist: ... Name does not resolve
WARNI [PeerAddressAllowlistFilter] Cannot resolve cluster peer host 'splitter-1.splitter.sbu.svc.cluster.local' for Raft gRPC allowlist: ... Name does not resolve
```

Then valid peers are rejected because their IP is not in the (incomplete) allowlist:

```
WARNI [PeerAddressAllowlistFilter] Rejecting Raft gRPC connection from non-peer address: 10.1.13.9  (allowed=[::1, 10.1.13.8, ...])
WARNI [PeerAddressAllowlistFilter] Rejecting Raft gRPC connection from non-peer address: 10.1.13.10 (allowed=[::1, 10.1.13.8, ...])
WARNI [PeerAddressAllowlistFilter] Rejecting Raft gRPC connection from non-peer address: 10.1.13.10 (allowed=[::1, 10.1.13.9, 10.1.13.8, ...])
```

## Root cause

In `ha-raft/.../PeerAddressAllowlistFilter.java`:

1. `resolveNow()` (called from the constructor and on a miss) iterates `peerHosts` and, on `UnknownHostException`, only logs a WARNING and **drops the host** from the resolved set. There is no record that the host failed, no retry state, and no distinction between "resolved to no addresses" and "DNS not ready yet".
2. `transportReady()` re-resolves on a miss but **rate-limited by `refreshIntervalMs`** (`arcadedb.ha.grpcAllowlistRefreshMs`, default 30s). Since the constructor just resolved, the first wave of inbound connections after startup is inside that window, so the re-resolve is skipped and the peer is rejected immediately.
3. If a peer's DNS is still unpublished on the next allowed re-resolve (pod not yet `Ready`), it stays rejected until both the refresh interval elapses *and* DNS resolves, with gRPC DNS caching adding further lag.

The net effect is a self-inflicted partition during the exact window (bootstrap / rolling restart) when the cluster most needs peers to connect.

## Proposed hardening

Some combination of:

- **Force a re-resolve on a miss whenever the allowlist is known-incomplete** (i.e. fewer hosts resolved than configured), bypassing the `refreshIntervalMs` rate limit. Keep the rate limit only for the steady state where all peers already resolved at least once.
- **Track per-host last-known-good IPs** and keep them in the allowlist (with a TTL) so a transient DNS blip does not evict a peer that resolved fine moments ago (sticky entries). This directly addresses pod-IP churn + caching.
- **Startup grace window:** until the first *complete* resolution of all peer hosts succeeds, prefer logging + allowing (fail-open) over rejecting, or at minimum retry resolution aggressively. The filter is explicitly documented as "NOT a substitute for mTLS" (#3890), so a short fail-open bootstrap window is an acceptable trade-off and far safer operationally than locking the cluster out of itself.
- Optionally **proactively retry unresolved hosts** on a background tick rather than only on inbound misses, so the allowlist converges even while the cluster is quiet.

## Impact / workaround

- Contributes to leader-election churn and follower divergence during bulk load on k8s.
- Can block a restarted or wiped follower from rejoining and re-acquiring a snapshot from the leader.
- Current workaround: temporarily set `arcadedb.ha.peerAllowlist.enabled=false` during bootstrap/recovery, then re-enable.

## Related

- #3890 (mTLS - the allowlist is the interim non-cert mitigation this filter implements)

## Files

- `ha-raft/src/main/java/com/arcadedb/server/ha/raft/PeerAddressAllowlistFilter.java`
- Config: `arcadedb.ha.grpcAllowlistRefreshMs`, `arcadedb.ha.peerAllowlist.enabled`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HA] PeerAddressAllowlistFilter rejects legitimate peers during k8s DNS-resolution race (incomplete allowlist on startup/restart) #4471

Summary

Evidence (from customer logs)

Root cause

Proposed hardening

Impact / workaround

Related

Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[HA] PeerAddressAllowlistFilter rejects legitimate peers during k8s DNS-resolution race (incomplete allowlist on startup/restart) #4471

Description

Summary

Evidence (from customer logs)

Root cause

Proposed hardening

Impact / workaround

Related

Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions