Skip to content

Introduce tracing interfaces#87921

Merged
elasticsearchmachine merged 24 commits intoelastic:mainfrom
pugnascotia:apm-integration-part-2
Jul 25, 2022
Merged

Introduce tracing interfaces#87921
elasticsearchmachine merged 24 commits intoelastic:mainfrom
pugnascotia:apm-integration-part-2

Conversation

@pugnascotia
Copy link
Copy Markdown
Contributor

Part of #84369. Split out from #87696. Introduce tracing interfaces in advance of adding APM support to Elasticsearch. The only implementation at this point is a no-op class.

@pugnascotia pugnascotia requested a review from rjernst June 22, 2022 14:36
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jun 22, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@pugnascotia pugnascotia marked this pull request as draft June 22, 2022 15:36
@pugnascotia
Copy link
Copy Markdown
Contributor Author

@rjernst this is actually a draft since it builds on #87917.

@pugnascotia pugnascotia force-pushed the apm-integration-part-2 branch from c913b8b to 8f3f5f1 Compare July 7, 2022 12:43
@pugnascotia pugnascotia marked this pull request as ready for review July 8, 2022 14:52
@pugnascotia
Copy link
Copy Markdown
Contributor Author

@rjernst I think this is ready for review. We can continue the discussion from the PR #87696 here.

@pugnascotia
Copy link
Copy Markdown
Contributor Author

@original-brownbear I'd appreciate your thoughts on this PR, it's a precursor to #88443, which I asked you about yesterday.

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a quick read and added a few questions that would help with giving better feedback here.

* A channel used to construct bytes / builder based outputs, and send responses.
*/
public interface RestChannel {
public interface RestChannel extends Traceable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the need for making this interface more complex.
Can't we simply compute these things inline where we currently pass the channel to the tracer? What's the value of making this interface more complex and associating more state (looking at the new empty methods added here) with the channel potentially?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to simplify this by moving the default implementations to AbstractRestChannel, which is a better home for them anyway. Does that help?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not better by much I'm afraid. My main concern is simply with the fact that the RestChannel as an abstraction is already kind of broken and on the list of things to refactor away.

With this change we get a lot less flexible in fixing the abstractions here. If you just inline the logic from here to where we call the tracer on the other hand, the change has zero negative impact on the flexibility we have in refactoring the REST layer.
Looking at this from a different perspective: There really isn't a clear definition of the lifecycle of a RestChannel today so either this doesn't need to live on the rest channel and is just needless complexity or it introduces new subtle requirements on its lifecycle. Just inlining the code and making it completely independent of the channel instances avoids trouble either way.

return attributes;
}

void setTracePath(String path);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this going to be used for? Unless we have to, we shouldn't start associating more state with the implementations of this interface. They are already hard to maintain and on the list of things to refactor, due to the difficulty behind the releasing of the x-content builders that come out of them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When tracing REST requests, APM typically names spans and transactions using a concatenation of the HTTP method and request path. However, it's better for the name to be low cardinality, so if you're using path variables, you want to use paths that still have the /{var} placeholders. This trace path stuff is all about getting access to the path without variables.

/**
* Something which maps onto a <i>span</i> in a distributed trace.
*/
public interface Traceable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the motivation or need for this interface.
It appears a Traceable is just a tuple of two strings and a map and each call-site of a Tracer will only ever receive a single type anyway.
Why can't we just pass two strings and a map instead and put the logic for stuff like the fact that the span id is the task id as a string to the call sites? That way we avoid the added complexity of blowing up various of interfaces and having to implement rather redundant things like the overrides in the delegate rest response channel.
Unless we have call sites where we would e.g. pass either a channel or a task, this seems like the wrong level of abstraction to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abstraction was introduced in the original SpaceTime project. TBH I never thought about changing it.

We have this abstraction because we won't stop with tracing tasks. For example, in pugnascotia@06d6105 (and bear in mind that is just a quickly hacked-together example, hence it does some questionable things), we're tracing FetchSubPhaseProcessor instances, and there's quite a few implementations of that interface. It's not difficult to imagine that each span might want to add different attributes to the span, and it would be better for that logic to live on the individual implementations that rather in the FetchPhase class.

It's also important the the logic for the span ID is the same when opened and closing spans, so having it in once place via the interface ensures this.

Having said all that, I would be particularly interested for us to add tracing to e.g. repository API requests (S3), so I would be very interested to know whether you think this API could work for that, or if we need to make some adjustments. Would you be able to do that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also important the the logic for the span ID is the same when opened and closing spans, so having it in once place via the interface ensures this.

But it's not really "one place". Now we're forcing new implicit life-cycle requirements on things like the rest-channel (which to me seems the worst one from this perspective) and making our whole codebase less flexible because any change to functionality that implements this interface now has to account for the tracing logic.

If you build the three objects here in-line where they're actually used you avoid complicating the codebase like that and we retain all flexibility. For the specific case of the fetch phase we could still add an abstraction to cover different implementations maybe but for all the other spots forcing this interface on top of random existing classes will increase the cost of future changes needlessly as far as I can tell.
If you think it's easiest to piggy back onto existing classes at this point, a better solution to this may be to just have RestChannel or Task return a Traceable via a getter without introducing a common abstraction across these very different classes. Then if we want to change or remove any of these we can just pull the Traceable out of them and pass it around some other way pretty easily? (also doing this Traceable could just become a record and doesn't even need to be an interface maybe because we don't have any state associated with an instance anyway as its getters only?)


private BytesStream bytesOut;

private String tracePath;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the trace path?


void onTraceException(Traceable traceable, Throwable throwable);

void setAttribute(Traceable traceable, String key, boolean value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some javadoc on what these are supposed to do in a real implementation here? Would this be recorded to somewhere like a log line or would we create state associated with a specific Traceable here that we'd have to take care of at some point?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more class-level Javadoc to this class, which I hope covers this.

Instead of messing around with supporting tracing methods directly on
a `RestChannel`, construct a `RestController` with a `Tracer` and trace
requests the usual way. As a result, strip out added code from
`RestChannel` and `HttpTracer`.

Also add Javadoc to `Tracer`.
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the responses Rory, I think I understand the ideas here better now. I tried to come up with an alternative suggestion here that retains the current level of flexibility in the code. Let me know what you think :)

* A channel used to construct bytes / builder based outputs, and send responses.
*/
public interface RestChannel {
public interface RestChannel extends Traceable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not better by much I'm afraid. My main concern is simply with the fact that the RestChannel as an abstraction is already kind of broken and on the list of things to refactor away.

With this change we get a lot less flexible in fixing the abstractions here. If you just inline the logic from here to where we call the tracer on the other hand, the change has zero negative impact on the flexibility we have in refactoring the REST layer.
Looking at this from a different perspective: There really isn't a clear definition of the lifecycle of a RestChannel today so either this doesn't need to live on the rest channel and is just needless complexity or it introduces new subtle requirements on its lifecycle. Just inlining the code and making it completely independent of the channel instances avoids trouble either way.

/**
* Something which maps onto a <i>span</i> in a distributed trace.
*/
public interface Traceable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also important the the logic for the span ID is the same when opened and closing spans, so having it in once place via the interface ensures this.

But it's not really "one place". Now we're forcing new implicit life-cycle requirements on things like the rest-channel (which to me seems the worst one from this perspective) and making our whole codebase less flexible because any change to functionality that implements this interface now has to account for the tracing logic.

If you build the three objects here in-line where they're actually used you avoid complicating the codebase like that and we retain all flexibility. For the specific case of the fetch phase we could still add an abstraction to cover different implementations maybe but for all the other spots forcing this interface on top of random existing classes will increase the cost of future changes needlessly as far as I can tell.
If you think it's easiest to piggy back onto existing classes at this point, a better solution to this may be to just have RestChannel or Task return a Traceable via a getter without introducing a common abstraction across these very different classes. Then if we want to change or remove any of these we can just pull the Traceable out of them and pass it around some other way pretty easily? (also doing this Traceable could just become a record and doesn't even need to be an interface maybe because we don't have any state associated with an instance anyway as its getters only?)

@pugnascotia
Copy link
Copy Markdown
Contributor Author

@original-brownbear I experimented with dropping Traceable entirely in f0f0d77, please take a look and see what you think.

@pugnascotia
Copy link
Copy Markdown
Contributor Author

Argh, I fumbled the commit, please see 00d4b50 instead.

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rory this is pretty much exactly what I had in mind! The approach now LGTM :)

String contentLength = null;
final Runnable onFinish = () -> {
Releasables.close(toClose);
tracer.onTraceStopped(traceId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just add () -> tracer.onTraceStopped(traceId) to toClose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea, it's so obvious now you've pointed it out 🤷‍♂️

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on code details and one on docs, looks good otherwise!

name = restPath;
}

Map<String, Object> attributes = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pre-size this correctly e.g. using Maps. utilities? This isn't on the super hot path but sill it's on a network thread :)

* and the exception is captured in the span.
*/
public void testDispatchUnsupportedHttpMethodTracesException() {
final RestRequest request = RestRequest.request(parserConfig(), new HttpRequest() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we override FakeHttpRequest here? Seems we only want the method() override right?

}
},
null,
new ActionListener<>() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want this shorter, you can use ActionTestUtils.assertNoFailure(r -> {})


private void startTrace(ThreadContext threadContext, Task task) {
TaskId parentTask = task.getParentTaskId();
Map<String, Object> attributes = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pre-size the map correctly here or just use Map.of( (with a slight bit of repetition for the task id key but whatever) to save some memory and make this compile nicer since its on the hot path.


final String traceId = "rest-" + this.request.getRequestId();

final ArrayList<Releasable> toClose = new ArrayList<>(3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this a 4 now :)

startTrace(threadContext, channel, null);
}

private void startTrace(ThreadContext threadContext, RestChannel channel, String restPath) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the RestRequest here, we don't need the channel I think (and should have the request available in a variable at most callsites already anyway.
Mainly asking for this since I noticed we already have some strange redundancy in caller methods where we pass both the channel and the request (which we also have on the channel) which is confusing.

* methods, e.g. {@link #setAttribute(String, String, String)}. This allows you
* to attach data that is not available when the span is opened.
*/
public interface Tracer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a comment here that the methods must have performance characteristics similar to logging since they're all used on hot transport threads.
If we implement this interface to send out network requests on any method we need to make sure that it's all non-blocking and with appropriate queuing to avoid slowing down transport threads.

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI is happy, though Ryan should probably have a look as well to make sure he's good with the overall approach also.
Thanks for the extra iterations here!

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:06
Copy link
Copy Markdown
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting this out from the larger PR. I left a few thoughts.

}

final HttpTracer trace = tracer.maybeTraceRequest(restRequest, exception);
final HttpTracer maybeHttpTracer = httpTracer.maybeTraceRequest(restRequest, exception);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing to any reader why there is both Tracer and HttpTracer in this class. Perhaps at least a comment somewhere explaining the difference? Is this something that can eventually be subsumed by the new tracer? I see it also bleeds into other classes, maybe also add the explanation there, or reference to where the explanation exists.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it is confusing, and for a while I have HttpTracer implementing Tracer, but I ended up taking it as part of simplifying some stuff with Armin. The fundamental problem is that HttpTracer is actually just a logger, but we reference it in our docs so renaming the class isn't 100% trivial. However I've renamed many of the uses / instances of HttpTracer so make it clearer that it's about logging.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rename of the member, it does help in the reading. However it is still confusing since the class is still called HttpTracer.

we reference it in our docs so renaming the class isn't 100% trivial.

I think this could be handled through keeping the HttpTracer class, but with no impl, so that whatever the class is renamed to can still have LogManager.getLogger(HttpTracer.class). We do need a way deprecate such logging; it is an existing problem that our class and package names leak into logger configuration since it restricts us from easily changing, or we accidentally make a configuration breaking change without realizing it.

The rename to HttpLogger I suggest above isn't crucial for this PR, but I do think it should be done as a followup.

/**
* Called with a Tracer so that each plugin has a chance to work with it.
*/
public void onTracer(Tracer tracer) {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"onXXX" doesn't really fit with other plugin methods (the one exception is onIndexModule, but that is a holdover from the older style push based model, which index specific plugin extensions still use). The purpose of this seems to be to give plugins this global Tracer object, so that they can stash it to use later.

If the tracer object needs to be passed around the entire system, then it doesn't seem any better than just having actual global state for this purpose. Otherwise every time we realize something else needs access to the tracer, we have to plumb it through, which it seems like could eventually be the entire system? If I've misread the purpose of this object, then it would seem better as an argument to createComponents, where other "common" systems are passed to plugins so they can be utilized in plugin service construction.

}

private Tracer getTracer(Collection<Object> pluginComponents) {
final List<Tracer> tracers = pluginComponents.stream().map(c -> c instanceof Tracer t ? t : null).filter(Objects::nonNull).toList();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to java modules, we have a newer way of achieving this type of extensibility that we only intend a single implementation. See RestInterceptorActionPlugin. Basically you would have a specific interface for creating the tracer, and that package would only be exported to the module implementing it. Since this implementation is coming in a followup PR, it could be in a new unqualified export in this PR, then make it qualified in the followup.

The class `HttpTracer` is really a logging class, which is unfortunate
because we're introducing the `Tracer` interface for actual distributed
tracing.

Change how the class is used to make it clearer that its purpose is
logging. Note that we can't rename the class without also affecting how
the HTTP logging is configured and documented, so that is left for
another time.
@pugnascotia
Copy link
Copy Markdown
Contributor Author

@rjernst I'm not sure if this is exactly what you intended, but: I introduced a TracerPlugin interface, which is a marker for plugins that can create a Tracer. This allows me to create a tracer and pass it into createComponents(), instead of waiting until createComponents() is called on all the plugins and then pulling it out of the components.

The TracerPlugin class could later have its visibility restricted. Does that work?

@pugnascotia pugnascotia self-assigned this Jul 25, 2022
@rjernst
Copy link
Copy Markdown
Member

rjernst commented Jul 25, 2022

The TracerPlugin class could later have its visibility restricted. Does that work?

Yes, the TracerPlugin + createComponents changes are what I was thinking there.

Copy link
Copy Markdown
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming you agree with "startTrace" and "stopTrace"

@pugnascotia pugnascotia added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 25, 2022
@pugnascotia
Copy link
Copy Markdown
Contributor Author

LGTM assuming you agree with "startTrace" and "stopTrace"

Oh man, I meant to rename them that way before. 🤦‍♂️

@elasticsearchmachine elasticsearchmachine merged commit 5c5981d into elastic:main Jul 25, 2022
@pugnascotia pugnascotia deleted the apm-integration-part-2 branch July 25, 2022 20:02
pugnascotia added a commit to pugnascotia/elasticsearch that referenced this pull request Jul 28, 2022
Split out from elastic#88443. Part of elastic#84369. Use the tracing API that was
added in elastic#87921 in TaskManager. This won't actually do anything until we
provide a tracer with an actual implemenation.
pugnascotia added a commit that referenced this pull request Jul 28, 2022
Split out from #88443. Part of #84369. Use the tracing API that was
added in #87921 in TaskManager. This won't actually do anything until we
provide a tracer with an actual implemenation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Core/Infra/Core Core issues without another label >feature >non-issue Team:Core/Infra Meta label for core/infra team v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants