Skip to content

internal hardening for availability #2414

@davepacheco

Description

@davepacheco

There are some basic things we'll want to check everywhere (e.g., Nexus, Sled Agent, DNS servers, etc.) for availability:

  • TCP KeepAlive: want to enable this on all network connections (in both directions) to identify failed systems. external vs. internal should probably have different values.
  • HTTP KeepAlive: probably want to just pick a value like 60 seconds. Consider having clients make dummy requests to keep the connections open? (to avoid the problem of picking a connection that's been open for just under 60 seconds, sending a request, and having the server slam the door in your face -- we ran into this with Manta, admittedly only at very large scale since it's fairly improbable)

We'll want to review these, too. They might be more security-related (see #2184):

  • limits for bad client behavior:
    • maximum time waiting for a client to send request headers (whether on a new connection or between requests)
    • minimum flow rate for request bodies (can be fairly low -- just want to avoid clients dribbling data in as a DoS vector to keep connections open)
    • maximum number of open connections (ideally limited separately for different APIs -- e.g., external vs. internal)
    • TCP listen socket backlog
    • maximum rate of new connections created [ideally per-client]
    • maximum rate of incoming requests [per authenticated user? or IP?, as well as overall]
    • maximum number of connect-in-progress sockets
    • maximum number of TLS-session-establishment-in-progress sockets
  • size of tokio worker thread pool, blocked thread pool
  • maximum length of time that graceful server shutdown can take

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions