Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,8 @@
"pages": [
"router/traffic-shaping",
"router/traffic-shaping/retry",
"router/traffic-shaping/timeout"
"router/traffic-shaping/timeout",
"router/traffic-shaping/circuit-breaker"
]
},
"router/storage-providers",
Expand Down
38 changes: 38 additions & 0 deletions docs/router/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -532,6 +532,8 @@ telemetry:
| METRICS_OTLP_GRAPHQL_CACHE | graphql_cache | <Icon icon="square" /> | Enable the collection of metrics for the GraphQL operation router caches. | false |
| METRICS_OTLP_EXCLUDE_METRICS | exclude_metrics | <Icon icon="square" /> | The metrics to exclude from the OTEL metrics. Accepts a list of Go regular expressions. Use https://regex101.com/ to test your regular expressions. | [] |
| METRICS_OTLP_EXCLUDE_METRIC_LABELS | exclude_metric_labels | <Icon icon="square" /> | The metric labels to exclude from the OTEL metrics. Accepts a list of Go regular expressions. Use https://regex101.com/ to test your regular expressions. | [] |
| METRICS_OTLP_CONNECTION_STATS | connection_stats | <Icon icon="square" /> | Enable connection metrics. | false |
| METRICS_OTLP_CIRCUIT_BREAKER | circuit_breaker | <Icon icon="square" /> | Ensure that circuit breaker metrics are enabled for OTEL. | false |

### Attributes

Expand Down Expand Up @@ -575,9 +577,11 @@ telemetry:
| PROMETHEUS_HTTP_PATH | path | <Icon icon="square" /> | The HTTP path where metrics are exposed. | "/metrics" |
| PROMETHEUS_LISTEN_ADDR | listen_addr | <Icon icon="square" /> | The address to listen on for the prometheus metrics endpoint. | "127.0.0.1:8088" |
| PROMETHEUS_GRAPHQL_CACHE | graphql_cache | <Icon icon="square" /> | Enable the collection of metrics for the GraphQL operation router caches. | false |
| PROMETHEUS_CONNECTION_STATS | connection_stats | <Icon icon="square" /> | Enable connection metrics. | false |
| PROMETHEUS_EXCLUDE_METRICS | exclude_metrics | <Icon icon="square" /> | | |
| PROMETHEUS_EXCLUDE_METRIC_LABELS | exclude_metric_labels | <Icon icon="square" /> | | |
| PROMETHEUS_EXCLUDE_SCOPE_INFO | exclude_scope_info | <Icon icon="square" /> | Exclude scope info from Prometheus metrics. | false |
| PROMETHEUS_CIRCUIT_BREAKER | circuit_breaker | <Icon icon="square" /> | Enable the circuit breaker metrics for prometheus metric collection. | false |

### Example YAML config:

Expand Down Expand Up @@ -1127,6 +1131,19 @@ traffic_shaping:
max_attempts: 5
interval: 3s
max_duration: 10s
# Circuit Breaker
circuit_breaker:
enabled: true
request_threshold: 20
error_threshold_percentage: 50
sleep_window: 30s
half_open_attempts: 5
required_successful: 3
rolling_duration: 60s
num_buckets: 10
execution_timeout: 60s
max_concurrent_requests: -1

subgraphs: # allows you to create subgraph specific traffic shaping rules
products: # Will only affect this subgraph, and override the options in "all" for that subgraph
request_timeout: 60s
Expand All @@ -1139,6 +1156,9 @@ traffic_shaping:
max_idle_conns: 1024
max_conns_per_host: 100
max_idle_conns_per_host: 20
# You can configure circuit breakers per subgraph, which includes the above configurations
circuit_breaker:
enabled: false
```

### Subgraph Request Rules
Expand All @@ -1148,6 +1168,7 @@ These rules apply to requests being made from the Router to all Subgraphs.
| Environment Variable | YAML | Required | Description | Default Value |
| -------------------- | ------------------------- | --------------------------------------------- | ----------------------------------------------------------------------------------- | ------------- |
| | retry | <Icon icon="square" /> | [#traffic-shaping-jitter-retry](/router/configuration#traffic-shaping-jitter-retry) | |
| | circuit_breaker | <Icon icon="square" /> | [#circuit-breaker](/router/configuration#circuit-breaker) | |
| | request_timeout | <Icon icon="square-check" iconType="solid" /> | | 60s |
| | dial_timeout | <Icon icon="square" /> | | 30s |
| | response_header_timeout | <Icon icon="square" /> | | 0s |
Expand Down Expand Up @@ -1175,6 +1196,23 @@ In addition to the general traffic shaping rules, we also allow users to set sub
| | max_conns_per_host | <Icon icon="square" /> | | 100 |
| | max_idle_conns_per_host | <Icon icon="square" /> | | 20 |

### Circuit Breaker

Configure circuit breaker either for all subgraphs, or per subgraph. More information on circuit breakers can be found [here](/router/traffic-shaping/circuit-breaker).

| Environment Variable | YAML | Required | Description | Default Value |
| -------------------- | -------------------------- | -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| | enabled | <Icon icon="square" /> | Enable the circuit breaker for the target (all subgraphs or a specific subgraph). | false |
| | error_threshold_percentage | <Icon icon="square" /> | This represents the percentage of failed requests within the rolling window that will trigger the circuit to open. For example, with a 50% threshold and 100 total requests in the window, the circuit will open when 50 or more requests fail. | 50 |
| | request_threshold | <Icon icon="square" /> | Defines the minimum number of requests that must be received before the circuit breaker will evaluate error rates for potential state transitions. | 20 |
| | sleep_window | <Icon icon="square" /> | This setting determines how long the circuit will block all requests after transitioning to the open state. The sleep window serves a critical purpose: it gives the failing service time to recover without being overwhelmed by continued request attempts. | 5s |
| | half_open_attempts | <Icon icon="square" /> | Number of test requests allowed in the half-open state. | 1 |
| | required_successful | <Icon icon="square" /> | How many successful requests are needed to close the circuit from the half-open state. This setting works in conjunction with half_open_attempts to determine the recovery behavior. | 1 |
| | rolling_duration | <Icon icon="square" /> | The time window for collecting error and request statistics. Only data from within this window influences circuit breaker decisions, ensuring that the circuit responds to current conditions rather than being influenced by historical problems that may have been resolved. | 10s |
| | num_buckets | <Icon icon="square" /> | Number of buckets for statistics in the rolling window (higher = finer granularity). | 10 |
| | execution_timeout | <Icon icon="square" /> | The maximum time allocated before marking a request as failed due to timeout. This timeout is specifically for circuit breaker error tracking and operates independently of any actual request timeouts. | 60s |
| | max_concurrent_requests | <Icon icon="square" /> | This controls the maximum number of concurrent requests that the circuit breaker will process simultaneously. When set to the default value of -1, there is no limit on concurrent requests. | -1 |

### Jitter Retry

| Environment Variable | YAML | Required | Description | Default Value |
Expand Down
18 changes: 18 additions & 0 deletions docs/router/metrics-and-monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,24 @@ Additionally, we expose the following metric:
* `wg_router_version`: The version of the router that is running
* `wg_feature_flag`: (Optional) The name of the feature flag if this is a feature flag configuration

#### Circuit Breaker specific metrics

We currently support two attributes for monitoring circuit breakers. To enable these metrics you need to set one of the following
```yaml
telemetry:
metrics:
otlp:
circuit_breaker: true
prometheus:
circuit_breaker: true
```

All the below mentioned metrics have the `wg.subgraph.name` dimensions. Do note that since a circuit breaker can be shared across subgraphs if they have the same routing url, the dimension is a string slice instead of a string.

* `router.circuit_breaker.state`: This indicates the current state of a circuit, `0` represents not opened, and `1` represents opened.
* `router.circuit_breaker.short_circuits`: This indicates how many requests for this circuit have failed without even being processed, because the circuit was open.


#### GraphQL specific metrics

`router.graphql.operation.planning_time`: Time taken to plan the operation. An additional attribute `wg.engine.plan_cache_hit` indicates if the plan was served from the cache.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,15 @@ These metrics ensure efficient request handling, operation planning, and system

* The value will always be 1.

#### Circuit Breaker Metrics

* `router_circuit_breaker_state`: Info metric that provides information on the current state of the circuit breaker.

* This indicates the current state of a circuit: `0` represents closed, and `1` represents open.

* `router_circuit_breaker_short_circuits`: Metric for the number of short-circuited requests.

* This indicates how many requests for this circuit have failed without even being processed because the circuit was open.

### GraphQL Operation Cache Metrics

Expand Down Expand Up @@ -106,7 +115,7 @@ telemetry:

* `router_http_client_connection_max`: Static configuration values with the maximum connections allowed per host with a subgraph dimension.

* `router_http_client_connection_active`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While its less common, multiple subgraphs can share the same host, which is why both dimensions are included.
* `router_http_client_connection_active`: The number of currently active connections, grouped by both subgraph and host. A connection is considered active once it has completed DNS resolution, TLS handshake, and dialing. While it's less common, multiple subgraphs can share the same host, which is why both dimensions are included.

* `router_http_client_connection_acquire_duration`: The duration in ms that a connection took to be initialized, which includes all of DNS, TLS Handshakes, and Dialing the host.

Expand Down Expand Up @@ -383,9 +392,51 @@ This means that we can assume for the last N (in this case 20) seconds that ther
**Reason for Monitoring:**

1. **Router Change Detection:** Monitoring whenever a new router execution configuration was pushed to the router.

2. **Uptime Detection:** Whenever the router is running the `router_info` metric will be available. This can be used to detect whenever the router is down.

## `router_circuit_breaker_state`

**Description:**

Indicates the current state of the circuit breaker for a subgraph. `0` means the circuit is closed (requests are allowed), and `1` means the circuit is open (requests are blocked). Includes the `wg_subgraph` and `wg_feature_flag` dimensions for granular monitoring.

**Example PromQL Query:**

```bash
max by(wg_subgraph) (router_circuit_breaker_state)
```

**Reason for Monitoring:**

- Alert when a subgraph's circuit breaker is open (value is `1`).
- Track the health of subgraphs and feature-flagged variants.

**Error Cases Addressed:**

* Subgraph is unhealthy or experiencing repeated failures.
* Circuit breaker is open for extended periods.

## `router_circuit_breaker_short_circuits`

**Description:**

Counts how many requests have been immediately failed (short-circuited) because the circuit was open. Includes the `wg_subgraph` and `wg_feature_flag` dimensions for granular monitoring.

**Example PromQL Query:**

```bash
increase(router_circuit_breaker_short_circuits[5m])
```

**Reason for Monitoring:**

- Alert if many requests are being short-circuited, indicating persistent subgraph issues.
- Understand the impact of circuit breaker activity on request flow over time.

**Error Cases Addressed:**

* High rate of short-circuited requests due to subgraph instability.
* Circuit breaker is frequently protecting the system from cascading failures.

## `go_memstats_alloc_bytes`

Expand Down
2 changes: 1 addition & 1 deletion docs/router/open-telemetry/custom-attributes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ cors:
- "x-service"

telemetry:
trace:
tracing:
attributes:
- key: service
default: "static"
Expand Down
Loading