Skip to content

Add dst and route labels to request and response metrics#296

Closed
adleong wants to merge 4 commits intomasterfrom
alex/dst-metrics
Closed

Add dst and route labels to request and response metrics#296
adleong wants to merge 4 commits intomasterfrom
alex/dst-metrics

Conversation

@adleong
Copy link
Member

@adleong adleong commented Aug 5, 2019

The request_total and response_total metrics are measured in the endpoint stack and do not include a label which indicates the dst of the request. These metrics do have an authority label, but this label's value is taken from the target of the endpoint stack. In the case of a traffic-split, the authority label corresponds to the overridden (concrete) dst, and not to the authority of the request itself. In order to know the logical dst of the request, a new label is needed.

We add a dst label by recording the target of the dst stack into the request extensions. We then extract this value in the endpoint stack and use it as a label value. Similarly, we also record the route (the target of the route stack) into the request extensions and then use this as a label in the endpoint stat metrics as well. The route label is not related to traffic-splits, but seems like useful information anyway.

Labels

authority: the concrete dst (authors-v1.default.svc.cluster.local:7001)
dst_service: the concrete service (authors-v1)
dst: the logical dst (authors.default.svc.cluster.local:7001) [NEW!]
rt_rotue: the logical route (GET /authors.json) [NEW!]

One side-effect of this change is that the default route now has a route label with an empty value. This means that proxy metrics for requests without a route will now include the rt_route="" label. Prometheus seems to filter out these empty value labels and these are not present in Prometheus.

adleong added 4 commits August 5, 2019 11:04
Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
@adleong adleong requested review from olix0r and scottcarol August 5, 2019 22:00
@adleong adleong self-assigned this Aug 5, 2019
@scottcarol
Copy link

Excited to check this out. What's the best way to test this out and look at the Prometheus stats? This repo's README recommends using Cargo and make build, are there more steps I need to do after that to run with my local Linkerd2 branch?

@adleong
Copy link
Member Author

adleong commented Aug 5, 2019

The easiest way would be to use the proxy image I've already pushed: --proxy-version alex-dst-metrics-3. You can use this flag with either linkerd install or linkerd upgrade

@scottcarol
Copy link

Can you confirm that I'm testing this correctly? I have a trafficsplit setup in booksapp; requests are being split between authors and authors-clone service. The apex is authors.

I installed:

linkerd upgrade --proxy-version alex-dst-metrics-3 | kubectl apply -f -

Then ran linkerd check --proxy which passed, with these warnings:

‼ data plane is up-to-date
    linkerd/linkerd-controller-6495cf7bc4-z724w: unsupported version channel: alex-dst-metrics-3
    see https://linkerd.io/checks/#l5d-data-plane-version for hints
‼ data plane and cli versions match
    linkerd/linkerd-web-6486cb6d7c-8jplx running alex-dst-metrics-3 but cli running edge-19.8.1
    see https://linkerd.io/checks/#l5d-data-plane-cli-version for hints

Then I port forwarded Prometheus:

kubectl -n linkerd port-forward $(kubectl --namespace=linkerd get po --selector=linkerd.io/control-plane-component=prometheus -o jsonpath='{.items[*].metadata.name}') 9090:9090

When I went to localhost:9090, this request:

sum(irate(request_total{direction="outbound",namespace="default",dst_service!=""}[1m]))by(namespace, dst_service, authority) 

gave me the results :

{authority="authors-clone.default.svc.cluster.local:7001",dst_service="authors-clone",namespace="default"} | 0.7000500070005
{authority="authors.default.svc.cluster.local:7001",dst_service="authors",namespace="default"} | 6.000160060001601
{authority="books.default.svc.cluster.local:7002",dst_service="books",namespace="default"} | 8.200740082007401
{authority="webapp.default.svc.cluster.local:7000",dst_service="webapp",namespace="default"} | 7.401480296059212

I was expecting one of the result rows to be going from authority authors to dst_service authors-clone.

@adleong
Copy link
Member Author

adleong commented Aug 5, 2019

@scottcarol your setup is correct except that you should be looking for a result row that goes from dst=authors.default.svc.cluster.local:7001 to dst_service=authors-clone. The (unfortunately named) authority label and the dst_service will always match because they both refer to the concrete destination.

@scottcarol
Copy link

👍 Got it, just needed to delete the pods in my namespace so they could restart. The correct data is showing up!

{dst="authors.default.svc.cluster.local:7001",dst_service="authors-clone",namespace="default"} | 0.7002811659432091
{dst="authors.default.svc.cluster.local:7001",dst_service="authors",namespace="default"} | 5.806794395816288
{dst="books.default.svc.cluster.local:7002",dst_service="books",namespace="default"} | 7.89981257158902
{dst="webapp.default.svc.cluster.local:7000",dst_service="webapp",namespace="default"} | 7.22383866760309

@scottcarol
Copy link

Nice, and I'm seeing rt_route in my Prometheus metrics when I created a service profile. 💯


impl FmtLabels for dst::Route {
fn fmt_labels(&self, f: &mut fmt::Formatter) -> fmt::Result {
let labels = prefix_labels("rt", self.labels().as_ref().into_iter());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was your thinking behind the label name rt_route? Is the rt for route as well? target_route seems easier to understand to me, would that be technically correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rt_route label already exists on the route (route_response_total et al.) metrics. This change just puts that label on the endpoint (response_total et al.) metrics as well.

The rt prefix indicates that this is a route label and route is the name of the route label. The naming is not terribly obvious, but since the metrics are only user-facing for advanced users who are interacting directly with prometheus, the standards for clarity can be somewhat relaxed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation. Do we have a doc anywhere where we list the available user-facing labels, and what they mean? Happy to do this if there isn't one.

impl Default for Labels {
fn default() -> Self {
let mut labels = IndexMap::with_capacity(1);
labels.insert("route".to_string(), "".to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without more context, this is super weird. Why is there a magic string in here? Why is this necessary at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, why is it necessary to hydrate an empty label?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if a route has no labels, its FmtLabels implementation will format empty which will cause the total label string to include ,, which breaks Prometheus. Adding an empty label is hack to work around this.

last_update: Instant,
total: Counter,
by_retry_skipped: IndexMap<RetrySkipped, Counter>,
by_dst: IndexMap<Option<Addr>, DstMetrics<R, C>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why the metrics registry has to be aware of any of this... It seems like a it should all be a detail of the stack targets...

Do we not have both the logical and concrete dsts in the endpoint stack target? If not, can we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have the logical and concrete dsts in the endpoint stack target, just the concrete. We could have the logical dst in there, but this would make it so that concrete dst load balancers would not be shared between logical dsts.

This is why we propagate the logical dst in the request extensions. Having the logical dst in the target of the endpoint stack would certainly simplify this a lot. Not sure how to evaluate the tradeoff, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion: because a concrete lb is not shared across multiple logical names, we actually can have all of the needed metadata in the endpoint stack's target, which should greatly simplify the change.

@olix0r
Copy link
Member

olix0r commented Aug 6, 2019

Yakshave:

  • authority is intended to reflect the logical destination (i.e. what the request refers to). That it refers to the concrete dst is an accident introduced by traffic split.
  • dst_* is generally used as a prefix in our metrics, so i'd prefer to continue that with i.e. dst_concrete
  • I don't understand what rt_route has to do with this change. Can it be split into a separate change? Or is this coupled in some way?

@adleong
Copy link
Member Author

adleong commented Aug 6, 2019

rt_route isn't necessarily coupled to this change. While plumbing logical dst through the request extensions, I just though I would bring rt_route along for the ride since it seems like potentially useful information and is similar to logical dst in that they are both from the dst-route stack target. I can take it off this PR if you think it doesn't make sense here.

@olix0r
Copy link
Member

olix0r commented Aug 6, 2019

rt_route isn't necessarily coupled to this change.

let's split out that change, then. there's likely a simplification for that change as well, but it will be easier to address in a follow-up change.

@adleong
Copy link
Member Author

adleong commented Aug 6, 2019

Closed in favor of #300

@adleong adleong closed this Aug 6, 2019
adleong added a commit that referenced this pull request Aug 7, 2019
Supersedes #296

The authority label reflects the concrete dst of the request in the endpoint stack.  In the case of a traffic split, this may be different from the actual authority of the request (i.e. the logical dst).

We add the logical dst as a field on the endpoint target and use this value for the authority label instead of the concrete dst.

The net result is that the authority field reflects the apex service of a traffic split and the `dst_service` label reflects the leaf service.  

Signed-off-by: Alex Leong <alex@buoyant.io>
@olix0r olix0r deleted the alex/dst-metrics branch August 17, 2019 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants