Skip to content

[METRICS] Add histogram metrics support for latency distributions #409

Description

@kcenon

Summary

Add histogram metric support to capture latency distributions with percentiles (p50, p95, p99) for network operations.

Current State

  • metric_reporter provides gauge and counter metrics
  • Latency reported as single value (report_latency(double ms))
  • No percentile calculations
  • No distribution visibility

Current limitation:

// Only reports latest value, no distribution
static void report_latency(double ms);

Why Histograms?

  • Average latency hides outliers
  • P99 latency is critical for SLA compliance
  • Distributions reveal performance patterns
  • Essential for capacity planning

Proposed Implementation

1. Histogram Class

class histogram {
public:
    struct config {
        std::vector<double> bucket_boundaries;  // Explicit buckets
        // OR
        double min_value = 0.0;
        double max_value = 10000.0;  // 10 seconds
        size_t bucket_count = 20;    // Exponential buckets
    };
    
    explicit histogram(config cfg);
    
    // Record a value
    void record(double value);
    
    // Get statistics
    [[nodiscard]] auto count() const -> uint64_t;
    [[nodiscard]] auto sum() const -> double;
    [[nodiscard]] auto min() const -> double;
    [[nodiscard]] auto max() const -> double;
    [[nodiscard]] auto mean() const -> double;
    
    // Percentiles
    [[nodiscard]] auto percentile(double p) const -> double;
    [[nodiscard]] auto p50() const -> double { return percentile(0.50); }
    [[nodiscard]] auto p95() const -> double { return percentile(0.95); }
    [[nodiscard]] auto p99() const -> double { return percentile(0.99); }
    [[nodiscard]] auto p999() const -> double { return percentile(0.999); }
    
    // Get all bucket counts
    [[nodiscard]] auto buckets() const 
        -> std::vector<std::pair<double, uint64_t>>;
    
    // Reset statistics
    void reset();

private:
    std::vector<double> boundaries_;
    std::vector<std::atomic<uint64_t>> bucket_counts_;
    std::atomic<uint64_t> count_{0};
    std::atomic<double> sum_{0.0};
    std::atomic<double> min_{std::numeric_limits<double>::max()};
    std::atomic<double> max_{std::numeric_limits<double>::lowest()};
    mutable std::mutex mutex_;
};

2. Sliding Window Histogram

class sliding_histogram {
public:
    struct config {
        histogram::config hist_config;
        std::chrono::seconds window_duration{60};
        size_t bucket_count_per_window = 6;  // 10-second buckets
    };
    
    explicit sliding_histogram(config cfg);
    
    void record(double value);
    
    // Get statistics for current window
    [[nodiscard]] auto p50() const -> double;
    [[nodiscard]] auto p95() const -> double;
    [[nodiscard]] auto p99() const -> double;
    
private:
    struct time_bucket {
        histogram hist;
        std::chrono::steady_clock::time_point start_time;
    };
    
    std::deque<time_bucket> buckets_;
    config config_;
    std::mutex mutex_;
};

3. Integration with Metrics System

namespace metric_names {
    // Histogram metrics
    constexpr const char* LATENCY_HISTOGRAM = "network.latency.histogram";
    constexpr const char* CONNECTION_TIME_HISTOGRAM = 
        "network.connection_time.histogram";
    constexpr const char* REQUEST_DURATION_HISTOGRAM = 
        "network.request_duration.histogram";
}

class metric_reporter {
public:
    // Existing methods...
    
    // NEW: Histogram methods
    static void record_latency(double ms);
    static void record_connection_time(double ms);
    static void record_request_duration(double ms);
    
    // Get histogram statistics
    static auto get_latency_p50() -> double;
    static auto get_latency_p95() -> double;
    static auto get_latency_p99() -> double;
    
    // Get all histograms for export
    static auto get_all_histograms() 
        -> std::map<std::string, histogram_snapshot>;

private:
    static histogram& latency_histogram();
    static histogram& connection_time_histogram();
    static histogram& request_duration_histogram();
};

4. Histogram Snapshot for Export

struct histogram_snapshot {
    uint64_t count;
    double sum;
    double min;
    double max;
    std::map<double, double> percentiles;  // percentile -> value
    std::vector<std::pair<double, uint64_t>> buckets;  // boundary -> count
    
    // Serialize for export
    auto to_prometheus() const -> std::string;
    auto to_json() const -> std::string;
};

5. Usage Examples

// Record connection latency
auto start = std::chrono::steady_clock::now();
client->connect(host, port);
auto elapsed = std::chrono::steady_clock::now() - start;
metric_reporter::record_connection_time(
    std::chrono::duration<double, std::milli>(elapsed).count());

// Get percentiles for alerting
if (metric_reporter::get_latency_p99() > 100.0) {
    log_warning("P99 latency exceeds 100ms threshold");
}

// Export for Prometheus
auto snapshot = metric_reporter::get_all_histograms();
for (const auto& [name, hist] : snapshot) {
    std::cout << hist.to_prometheus() << "\n";
}

6. Prometheus Format Output

# HELP network_latency_histogram Network operation latency in milliseconds
# TYPE network_latency_histogram histogram
network_latency_histogram_bucket{le="1"} 100
network_latency_histogram_bucket{le="5"} 250
network_latency_histogram_bucket{le="10"} 400
network_latency_histogram_bucket{le="25"} 480
network_latency_histogram_bucket{le="50"} 495
network_latency_histogram_bucket{le="100"} 499
network_latency_histogram_bucket{le="+Inf"} 500
network_latency_histogram_sum 12500
network_latency_histogram_count 500

Default Bucket Boundaries

Optimized for network latencies:

// Milliseconds: sub-ms to 10 seconds
{0.1, 0.5, 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000}

Tasks

  • Implement histogram class
  • Implement sliding_histogram class
  • Add thread-safe atomic operations
  • Implement percentile calculation
  • Add histogram metrics to metric_reporter
  • Instrument key operations (connect, send, receive)
  • Add Prometheus format export
  • Add JSON format export
  • Unit tests for histogram accuracy
  • Benchmark performance overhead
  • Documentation

Acceptance Criteria

  • Histograms capture value distributions
  • Percentiles (p50, p95, p99) calculated accurately
  • Thread-safe for concurrent recording
  • Minimal performance overhead (<1% CPU)
  • Prometheus-compatible export
  • All tests pass

Files to Create/Modify

  • New: include/kcenon/network/metrics/histogram.h
  • New: include/kcenon/network/metrics/sliding_histogram.h
  • New: src/metrics/histogram.cpp
  • Modify: include/kcenon/network/metrics/network_metrics.h
  • Modify: src/metrics/network_metrics.cpp
  • New: tests/unit/test_histogram.cpp

Related

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions