Skip to content

tracing: inject context after decoding headers#5529

Closed
cgilmour wants to merge 1 commit intoenvoyproxy:masterfrom
DataDog:cgilmour/fix-healthcheck-sampled
Closed

tracing: inject context after decoding headers#5529
cgilmour wants to merge 1 commit intoenvoyproxy:masterfrom
DataDog:cgilmour/fix-healthcheck-sampled

Conversation

@cgilmour
Copy link
Copy Markdown
Contributor

@cgilmour cgilmour commented Jan 8, 2019

Signed-off-by: Caleb Gilmour cgilmour@gmail.com

Description: A simple fix for #5504 by making the header injection occur after the healthcheck filter extension has examined the request.

Risk Level: Low
Testing: Unit tests, E2E for datadog and zipkin tracers.

Docs Changes: N/A
Release Notes: N/A

Fixes #5504

Signed-off-by: Caleb Gilmour <cgilmour@gmail.com>
@mattklein123
Copy link
Copy Markdown
Member

I don't think this is the correct fix, as the headers may have already been forwarded to upstream by the time you do the injection.

We do already have: https://github.com/envoyproxy/envoy/blob/master/include/envoy/tracing/http_tracer.h#L23, but from my few minutes look at the code, it seems like we only end up using this for access logging at the end of a request and not for finalizing the span (I really thought we excluded HC requests from being traced so now I'm confused and might be missing something that I've forgotten).

I think we might need to think a bit more deeply about this issue. It does seem like we should have a way of deferring span creation until some set of filters has run. Since the HTTP filter API has access to the active span, I wonder if instead the HC filter should somehow cancel the span or effect it somehow dirctly?

@objectiser @rnburn any thoughts here?

@mattklein123 mattklein123 self-assigned this Jan 8, 2019
@mattklein123
Copy link
Copy Markdown
Member

OK I see the HC filter is already modifying the span by setting the sampling to false. Hmm yeah this is tricky. Curious what thoughts @objectiser and @rnburn have.

@cgilmour
Copy link
Copy Markdown
Contributor Author

cgilmour commented Jan 8, 2019

Happy to defer to other suggestions about the "ideal" moment to inject the context, and retest things.
From my testing, it solved the reported issue and didn't impact the zipkin tracer - the easiest one to test against using the example.

@mattklein123
Copy link
Copy Markdown
Member

I'm guessing you tested with an HTTP/1.1 upstream that had a delayed connection pool connection. It's for sure possible for the headers to be forwarded before the span is injected.

Can you describe in more detail what is happening? Does the sample state get injected into the header?

@objectiser
Copy link
Copy Markdown
Contributor

As mentioned by Matt in #5529 (comment), the zipkin tracer doesn't report health check spans due to the sampling being set to false on the active span.

Is it possible for the datadog tracer to also act upon this action?

@cgilmour
Copy link
Copy Markdown
Contributor Author

cgilmour commented Jan 8, 2019

@mattklein123:

I'm guessing you tested with an HTTP/1.1 upstream that had a delayed connection pool connection.

The test setup is essentially the same as https://github.com/envoyproxy/envoy/tree/master/examples/zipkin-tracing with the following changes:

  • datadog tracer instead of zipkin
  • healthcheck from front-envoy to service1
  • healthcheck filter in service1

Client requests come from either a browser or curl to http://.../trace/1, since that endpoint covers all of the relevant features.
Healthcheck requests occur on their own schedule.

It's for sure possible for the headers to be forwarded before the span is injected.

OK.

Can you describe in more detail what is happening?
Does the sample state get injected into the header?

Sure. The current envoy behaviour is

  • receive HTTP request
  • create active_span and inject active_span into request headers
  • change sampled state to false if it's a healthcheck

This means the value of sampled in active_span is now different from the value in the request headers.
For the datadog tracer, when the span is injected, it makes the sampled state immutable.

The PR tries to change the behaviour to

  • receive HTTP request
  • create active_span
  • change sampled state to false if it's a healthcheck
  • inject active_span into request headers

Other tracers don't have this problem because they don't make things immutable when the span is injected.
Right now, there's only one place that changes the sampled state: the healthcheck filter.
And healthchecks are handled internally, so the injected headers don't get sent anywhere.

If there were other parts in core or extensions that changed sampled state after injection and still sent the request onward, other tracers would be affected and need a fix.
The fix for it would be making sure the injection occurs after any changes to sampled state.

@objectiser:

Is it possible for the datadog tracer to also act upon this action?

See above. It could be done, but it'd be an envoy-specific hack and I'm trying to avoid that.

@mattklein123
Copy link
Copy Markdown
Member

When the trace driver gets the sampled call, can you just re-inject the headers? TBH I think this is going to be the simplest fix.

@cgilmour
Copy link
Copy Markdown
Contributor Author

cgilmour commented Jan 9, 2019

That'd be a bit tricky, because we don't (and probably shouldn't) retain a reference to the HeaderMap.

If there's a strong opinion that this isn't the right fix for the problem, then let's close it and I'll work on alternatives instead :-)

@mattklein123
Copy link
Copy Markdown
Member

This is definitely not the right fix so I would close and we can discuss in the tracking issue, or we can discuss here until we figure out the solution. Any change to when we do injection is going to be very complicated, so I would prefer to figure out some type of workaround for this case.

@cgilmour
Copy link
Copy Markdown
Contributor Author

cgilmour commented Jan 9, 2019

Closing as incorrect fix. Discussion to continue on #5504

@cgilmour cgilmour closed this Jan 9, 2019
@cgilmour cgilmour deleted the cgilmour/fix-healthcheck-sampled branch January 30, 2019 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

datadog tracer samples heathchecks

3 participants