Skip to content

[RESILIENCE] Implement Circuit Breaker pattern for network clients #403

Description

@kcenon

Summary

Implement the Circuit Breaker pattern to prevent cascade failures and provide fast-fail behavior when backend services are unavailable.

Current State

  • resilient_client provides exponential backoff reconnection
  • No mechanism to stop retrying after persistent failures
  • No fast-fail for known-bad endpoints

Current resilient_client:

class resilient_client {
    // Only has retry with backoff
    // No circuit breaker state
};

Circuit Breaker Pattern

States

  1. Closed: Normal operation, requests pass through
  2. Open: Failures exceeded threshold, requests fail immediately
  3. Half-Open: Testing if service recovered

State Diagram

           success
    ┌──────────────────┐
    │                  │
    ▼                  │
┌───────┐  failure  ┌──────┐
│Closed │──────────►│ Open │
└───────┘threshold  └──────┘
    ▲                  │
    │                  │ timeout
    │    success       ▼
    │            ┌──────────┐
    └────────────│Half-Open │
                 └──────────┘
                      │ failure
                      └───► Open

Proposed Implementation

Circuit Breaker Class

class circuit_breaker {
public:
    struct config {
        size_t failure_threshold = 5;           // Failures before opening
        std::chrono::seconds open_duration{30}; // Time before half-open
        size_t half_open_successes = 2;         // Successes to close
        size_t half_open_max_calls = 3;         // Max calls in half-open
    };
    
    enum class state { closed, open, half_open };
    
    explicit circuit_breaker(config cfg = {});
    
    // Check if call should be allowed
    [[nodiscard]] bool allow_call();
    
    // Record call result
    void record_success();
    void record_failure();
    
    // State inspection
    [[nodiscard]] state current_state() const;
    [[nodiscard]] size_t failure_count() const;
    [[nodiscard]] std::chrono::steady_clock::time_point 
        next_attempt_time() const;
    
    // State change callbacks
    using state_change_callback = std::function<void(state, state)>;
    void set_state_change_callback(state_change_callback cb);
    
    // Reset circuit
    void reset();

private:
    config config_;
    std::atomic<state> state_{state::closed};
    std::atomic<size_t> failure_count_{0};
    std::atomic<size_t> success_count_{0};
    std::chrono::steady_clock::time_point open_time_;
    mutable std::mutex mutex_;
    state_change_callback callback_;
};

Integration with resilient_client

class resilient_client {
public:
    resilient_client(
        const std::string& client_id,
        const std::string& host,
        unsigned short port,
        size_t max_retries = 3,
        std::chrono::milliseconds initial_backoff = std::chrono::seconds(1),
        circuit_breaker::config cb_config = {}  // NEW
    );
    
    auto send_with_retry(std::vector<uint8_t>&& data) -> VoidResult {
        // Check circuit breaker first
        if (!circuit_breaker_.allow_call()) {
            return error_void(error_codes::circuit_open,
                "Circuit breaker is open");
        }
        
        auto result = do_send(std::move(data));
        if (result.is_ok()) {
            circuit_breaker_.record_success();
        } else {
            circuit_breaker_.record_failure();
        }
        return result;
    }
    
    // Circuit breaker state access
    [[nodiscard]] circuit_breaker::state circuit_state() const;
    void reset_circuit();

private:
    circuit_breaker circuit_breaker_;
};

Metrics Integration

// Report circuit state changes
circuit_breaker_.set_state_change_callback(
    [](state from, state to) {
        metric_reporter::report_circuit_state_change(from, to);
    });

Configuration Options

Parameter Default Description
failure_threshold 5 Consecutive failures to open circuit
open_duration 30s Duration before attempting half-open
half_open_successes 2 Successes needed to close
half_open_max_calls 3 Max calls during half-open

Tasks

  • Implement circuit_breaker class
  • Add state machine with thread-safe transitions
  • Implement state change callbacks
  • Integrate with resilient_client
  • Add circuit breaker metrics
  • Unit tests for all state transitions
  • Integration tests with simulated failures
  • Add configuration validation
  • Documentation and examples

Acceptance Criteria

  • Circuit opens after consecutive failures
  • Open circuit fails fast (no network calls)
  • Circuit transitions to half-open after timeout
  • Successful calls in half-open close the circuit
  • Failed calls in half-open re-open immediately
  • State changes are logged/reported
  • Thread-safe for concurrent access
  • All tests pass

Files to Create/Modify

  • New: include/kcenon/network/utils/circuit_breaker.h
  • New: src/utils/circuit_breaker.cpp
  • Modify: include/kcenon/network/utils/resilient_client.h
  • Modify: src/utils/resilient_client.cpp
  • New: tests/unit/test_circuit_breaker.cpp

Related

Metadata

Metadata

Assignees

Labels

asyncAsynchronous operationsenhancementNew feature or requestpriority:mediumMedium priority issue

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions