Skip to content

Improve Docker socket connection resiliency when using socket proxy #535

Description

@mg-dev25

Issue Description

When using Sablier with a Docker socket proxy (rather than direct socket access), the application fails to properly handle connection interruptions. Specifically, if the socket proxy terminates a connection due to timeouts or other reasons, Sablier logs an error but continues to report itself as healthy while being unable to execute Docker operations.

Current Behavior

  1. Sablier establishes a connection to Docker via a socket proxy
  2. If the socket proxy times out or terminates the connection, Sablier logs:
    ERR docker/docker.go:155 event stream error provider=docker error="unexpected EOF"
    ERR docker/docker.go:148 event stream closed provider=docker
    
  3. Sablier's health check (/health endpoint) continues to return a 200 status code with "OK"
  4. Subsequent Docker operations fail silently, and users must manually restart the Sablier container

Expected Behavior

  1. Sablier should attempt to reconnect to the Docker daemon when the connection is lost
  2. The health check should verify Docker connectivity, not just API responsiveness
  3. If reconnection fails after several attempts, Sablier should either:
    • Update its health status to unhealthy
    • Log a clear error that the Docker connection is permanently lost
    • Automatically restart itself (if possible)

Reproduction Steps

  1. Configure Sablier to use a socket proxy with a timeout value (e.g., PROXY_READ_TIMEOUT=8000)
  2. Wait for the timeout to occur
  3. Observe Sablier logs showing "event stream error" and "event stream closed"
  4. Verify that /health still returns 200 OK
  5. Attempt to start a container through Sablier, which will fail

Technical Details

This issue stems from how Sablier handles the Docker client connection in the provider implementation. In the Docker provider, there appear to be two key issues:

  1. Event stream resilience: The NotifyInstanceStopped method establishes a connection to the Docker events API, but doesn't automatically reconnect if this connection is lost.
  2. Docker client reuse: The Docker client is created once during provider initialization and reused, but there's no mechanism to verify its connectivity or recreate it if it becomes invalid.

Proposed Solutions

  1. Implement automatic reconnection to Docker daemon:

    • Add a reconnection loop in the event stream handler
    • Periodically verify Docker connectivity and recreate the client if needed
  2. Enhance health check:

    • Update the /health endpoint to verify Docker connectivity
    • Add a new /health/docker endpoint specifically for Docker connectivity
  3. Connection tracking:

    • Track the most recent successful Docker operation
    • If operations fail or too much time passes since the last successful operation, attempt to recreate the connection

Environment Information

  • Sablier version: 1.8.4
  • Docker version: 27.5.1
  • Docker API version: 1.47
  • Socket proxy: Used with PROXY_READ_TIMEOUT=8000

Additional Context

This issue becomes particularly problematic in production environments where Sablier is a critical component for managing containers, as it requires manual intervention to recover from connection issues.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions