outbound: Configure balancers with service metadata #2374
Conversation
When performing policy-based routing, proxies may dispatch requests through per-route backend configurations. In order to illustrate how routing rules apply and how backend distributions are being honored, this change adds two new metrics: * `outbound_http_route_backend_requests_total`; and * `outbound_grpc_route_backend_requests_total` Each of these metrics includes labels that identify a routes parent (i.e. a Service), the route resource being used, and the backend resource being used. This implementation does NOT implement any form of metrics eviction for these new metrics. This is tolerable for the short term as the cardinality of services and routes is generally much less than the cardinality of individual endpoints (where we do require timeout/eviction for metrics).
There are no explicit indications of failure accrual related behavior.
This change adds INFO-level logs when a failure accrual breaker is
tripped or reeopened. These log messages are scoped within the
'endpoint' log context so that log lines include endpoint and balancer
addresses.
INFO balance{addr=logical.test.svc.cluster.local:666}:endpoint{addr=192.0.2.41:666}:consecutive_failures: linkerd_app_outbound::http::breaker::consecutive_failures: Consecutive failure-accrual breaker tripped
INFO balance{addr=logical.test.svc.cluster.local:666}:endpoint{addr=192.0.2.41:666}:consecutive_failures: linkerd_app_outbound::http::breaker::consecutive_failures: Consecutive failure-accrual breaker reopened
The OutboundPolicies API includes resource references with its
responses. These references allow us to accurately identify an arbitrary
(Kubernetes style) resource by its group, kind, namespace, and name.
To support new metrics that include these resource coordinates as
labels, we need access to the parent and backend references from the
balancer stack.
This change makes the following changes to accomodate this:
1. We introduce `ParentRef`, `BackendRef`, and `EndpointRef` new-types
to serve as explicit markers for different types of metadata we may
encounter.
2. The profile stack is updated to parse out metadata from service names
in the form <name>.<ns>.svc.*.
3. We also support discovering pod metadata from the endpoint labels
`dst_pod` and `dst_namespace`.
4. When we can't infer any metadata from a profile response, we use an
"unknown" metadata.
5. When using policy, we use control-plane provided metadata.
In practice, pod metadata won't be surfaced by the existing code path;
but we have a relatively simple path forward to using it.
This change splits the balancer stack into its own layer so it's
distinct from the rest of the concrete stack configuration.
There are only minor functional changes in this branch:
1. The `balance{addr=...}` tracing context is replace by
`service{ns=..,name=...}` for clarity;
2. Errors returned from the balance stack are now rendered as:
"Service <name>.<ns>: <error>"
This comment was marked as outdated.
This comment was marked as outdated.
hawkw
left a comment
There was a problem hiding this comment.
Overall this looks good to me! I left a handful of fairly minor comments.
| policy::BackendDispatcher::Fail { ref message } => { | ||
| mk_concrete(concrete::Dispatch::Fail { | ||
| policy::BackendDispatcher::Fail { ref message } => mk_concrete( | ||
| BackendRef(policy::Meta::new_default("fail")), |
There was a problem hiding this comment.
not a blocker for now, but it occurs to me that the failing backends could potentially have non-default BackendRef metadata, since they're returned by the controller when a backend service doesn't exist...so they should have the metadata of the referenced (non-existent) service.
| } | ||
|
|
||
| impl EndpointRef { | ||
| fn new(md: &Metadata, port: NonZeroU16) -> Self { |
There was a problem hiding this comment.
nit, take it or leave it (definitely not important): since this always returns metadata for a Pod, should this maybe be
| fn new(md: &Metadata, port: NonZeroU16) -> Self { | |
| fn new_pod(md: &Metadata, port: NonZeroU16) -> Self { |
or something?
| }; | ||
| let name = match md.labels().get("dst_pod") { | ||
| Some(pod) => pod.clone(), | ||
| None => return Self(UNKNOWN_META.clone()), |
There was a problem hiding this comment.
not that this will ever happen in practice, since we expect the controller to populate both the namespace and pod labels, but would it make sense to return a metadata that has the namespace field populated but the name field set to <unknown>, rather than throwing out the information that we do have (the ns and the kind)?
There was a problem hiding this comment.
Yes probably but way out of scope for now
This proxy release adds new `outbound_http_route_backend_requests_total` and `outbound_grpc_route_backend_requests_total` metrics, which can be used to track how routing rules and backend distributions apply to requests. These metrics contain labels describing the route's parent (i.e. a Service), the route resource being used, and the backend resource being used by each request. In addition, this proxy release adds new `INFO`-level logs for failure accrual, and updates logging contexts with Kubernetes resource metadata. --- * outbound: Emit INFO-level logs on failure accrual changes (linkerd/linkerd2-proxy#2373) * app: Rename Metrics types as InboundMetrics and OutboundMetrics (linkerd/linkerd2-proxy#2376) * outbound: Configure balancers with service metadata (linkerd/linkerd2-proxy#2374) * outbound: Report per-route-backend request count metrics (linkerd/linkerd2-proxy#2372) * outbound: Report per-route-backend request count metrics (linkerd/linkerd2-proxy#2377)
This proxy release adds new `outbound_http_route_backend_requests_total` and `outbound_grpc_route_backend_requests_total` metrics, which can be used to track how routing rules and backend distributions apply to requests. These metrics contain labels describing the route's parent (i.e. a Service), the route resource being used, and the backend resource being used by each request. In addition, this proxy release adds new `INFO`-level logs for failure accrual, and updates logging contexts with Kubernetes resource metadata. --- * outbound: Emit INFO-level logs on failure accrual changes (linkerd/linkerd2-proxy#2373) * app: Rename Metrics types as InboundMetrics and OutboundMetrics (linkerd/linkerd2-proxy#2376) * outbound: Configure balancers with service metadata (linkerd/linkerd2-proxy#2374) * outbound: Report per-route-backend request count metrics (linkerd/linkerd2-proxy#2372) * outbound: Report per-route-backend request count metrics (linkerd/linkerd2-proxy#2377)
Looks like we accidentally merged PR #2375 without a CI build against the latest state of `main`. In the meantime since #2375 was last built on CI, PR #2374 added an additional metadata field to `policy::HttpParams`, which the `HttpParams` constructed in the test added from #2375 doesn't populate. Therefore, merging this PR broke the build. Whoops! This commit populates the `meta` field, fixing it.
Looks like we accidentally merged PR #2375 without a CI build against the latest state of `main`. In the meantime since #2375 was last built on CI, PR #2374 added an additional metadata field to `policy::HttpParams`, which the `HttpParams` constructed in the test added from #2375 doesn't populate. Therefore, merging this PR broke the build. Whoops! This commit populates the `meta` field, fixing it.
The OutboundPolicies API includes resource references with its
responses. These references allow us to accurately identify an arbitrary
(Kubernetes style) resource by its group, kind, namespace, and name.
To support new metrics that include these resource coordinates as
labels, we need access to the parent and backend references from the
balancer stack.
This change makes the following changes to accommodate this:
ParentRef,RouteRef,BackendRef, andEndpointRefnew-types to serve as explicit markers fordifferent types of metadata we may encounter.
in the form ..svc.*.
dst_podanddst_namespace."unknown" metadata.
In practice, pod metadata won't be surfaced by the existing code path;
but we have a relatively simple path forward to using it.
This change splits the balancer stack into its own layer so it's
distinct from the rest of the concrete stack configuration.
There are only minor functional changes in this branch:
balance{addr=...}tracing context is replace byservice{ns=..,name=...}for clarity;Service {name}.{ns}: {error}