Introduce tracing interfaces#87921
Conversation
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
c913b8b to
8f3f5f1
Compare
|
@original-brownbear I'd appreciate your thoughts on this PR, it's a precursor to #88443, which I asked you about yesterday. |
original-brownbear
left a comment
There was a problem hiding this comment.
I gave this a quick read and added a few questions that would help with giving better feedback here.
server/src/main/java/org/elasticsearch/http/DefaultRestChannel.java
Outdated
Show resolved
Hide resolved
| * A channel used to construct bytes / builder based outputs, and send responses. | ||
| */ | ||
| public interface RestChannel { | ||
| public interface RestChannel extends Traceable { |
There was a problem hiding this comment.
I'm not sure I understand the need for making this interface more complex.
Can't we simply compute these things inline where we currently pass the channel to the tracer? What's the value of making this interface more complex and associating more state (looking at the new empty methods added here) with the channel potentially?
There was a problem hiding this comment.
I've tried to simplify this by moving the default implementations to AbstractRestChannel, which is a better home for them anyway. Does that help?
There was a problem hiding this comment.
Not better by much I'm afraid. My main concern is simply with the fact that the RestChannel as an abstraction is already kind of broken and on the list of things to refactor away.
With this change we get a lot less flexible in fixing the abstractions here. If you just inline the logic from here to where we call the tracer on the other hand, the change has zero negative impact on the flexibility we have in refactoring the REST layer.
Looking at this from a different perspective: There really isn't a clear definition of the lifecycle of a RestChannel today so either this doesn't need to live on the rest channel and is just needless complexity or it introduces new subtle requirements on its lifecycle. Just inlining the code and making it completely independent of the channel instances avoids trouble either way.
| return attributes; | ||
| } | ||
|
|
||
| void setTracePath(String path); |
There was a problem hiding this comment.
What is this going to be used for? Unless we have to, we shouldn't start associating more state with the implementations of this interface. They are already hard to maintain and on the list of things to refactor, due to the difficulty behind the releasing of the x-content builders that come out of them.
There was a problem hiding this comment.
When tracing REST requests, APM typically names spans and transactions using a concatenation of the HTTP method and request path. However, it's better for the name to be low cardinality, so if you're using path variables, you want to use paths that still have the /{var} placeholders. This trace path stuff is all about getting access to the path without variables.
| /** | ||
| * Something which maps onto a <i>span</i> in a distributed trace. | ||
| */ | ||
| public interface Traceable { |
There was a problem hiding this comment.
I'm not sure I understand the motivation or need for this interface.
It appears a Traceable is just a tuple of two strings and a map and each call-site of a Tracer will only ever receive a single type anyway.
Why can't we just pass two strings and a map instead and put the logic for stuff like the fact that the span id is the task id as a string to the call sites? That way we avoid the added complexity of blowing up various of interfaces and having to implement rather redundant things like the overrides in the delegate rest response channel.
Unless we have call sites where we would e.g. pass either a channel or a task, this seems like the wrong level of abstraction to me.
There was a problem hiding this comment.
The abstraction was introduced in the original SpaceTime project. TBH I never thought about changing it.
We have this abstraction because we won't stop with tracing tasks. For example, in pugnascotia@06d6105 (and bear in mind that is just a quickly hacked-together example, hence it does some questionable things), we're tracing FetchSubPhaseProcessor instances, and there's quite a few implementations of that interface. It's not difficult to imagine that each span might want to add different attributes to the span, and it would be better for that logic to live on the individual implementations that rather in the FetchPhase class.
It's also important the the logic for the span ID is the same when opened and closing spans, so having it in once place via the interface ensures this.
Having said all that, I would be particularly interested for us to add tracing to e.g. repository API requests (S3), so I would be very interested to know whether you think this API could work for that, or if we need to make some adjustments. Would you be able to do that?
There was a problem hiding this comment.
It's also important the the logic for the span ID is the same when opened and closing spans, so having it in once place via the interface ensures this.
But it's not really "one place". Now we're forcing new implicit life-cycle requirements on things like the rest-channel (which to me seems the worst one from this perspective) and making our whole codebase less flexible because any change to functionality that implements this interface now has to account for the tracing logic.
If you build the three objects here in-line where they're actually used you avoid complicating the codebase like that and we retain all flexibility. For the specific case of the fetch phase we could still add an abstraction to cover different implementations maybe but for all the other spots forcing this interface on top of random existing classes will increase the cost of future changes needlessly as far as I can tell.
If you think it's easiest to piggy back onto existing classes at this point, a better solution to this may be to just have RestChannel or Task return a Traceable via a getter without introducing a common abstraction across these very different classes. Then if we want to change or remove any of these we can just pull the Traceable out of them and pass it around some other way pretty easily? (also doing this Traceable could just become a record and doesn't even need to be an interface maybe because we don't have any state associated with an instance anyway as its getters only?)
|
|
||
| private BytesStream bytesOut; | ||
|
|
||
| private String tracePath; |
There was a problem hiding this comment.
What is the trace path?
|
|
||
| void onTraceException(Traceable traceable, Throwable throwable); | ||
|
|
||
| void setAttribute(Traceable traceable, String key, boolean value); |
There was a problem hiding this comment.
Can you add some javadoc on what these are supposed to do in a real implementation here? Would this be recorded to somewhere like a log line or would we create state associated with a specific Traceable here that we'd have to take care of at some point?
There was a problem hiding this comment.
I've added more class-level Javadoc to this class, which I hope covers this.
Instead of messing around with supporting tracing methods directly on a `RestChannel`, construct a `RestController` with a `Tracer` and trace requests the usual way. As a result, strip out added code from `RestChannel` and `HttpTracer`. Also add Javadoc to `Tracer`.
original-brownbear
left a comment
There was a problem hiding this comment.
Thanks for the responses Rory, I think I understand the ideas here better now. I tried to come up with an alternative suggestion here that retains the current level of flexibility in the code. Let me know what you think :)
| * A channel used to construct bytes / builder based outputs, and send responses. | ||
| */ | ||
| public interface RestChannel { | ||
| public interface RestChannel extends Traceable { |
There was a problem hiding this comment.
Not better by much I'm afraid. My main concern is simply with the fact that the RestChannel as an abstraction is already kind of broken and on the list of things to refactor away.
With this change we get a lot less flexible in fixing the abstractions here. If you just inline the logic from here to where we call the tracer on the other hand, the change has zero negative impact on the flexibility we have in refactoring the REST layer.
Looking at this from a different perspective: There really isn't a clear definition of the lifecycle of a RestChannel today so either this doesn't need to live on the rest channel and is just needless complexity or it introduces new subtle requirements on its lifecycle. Just inlining the code and making it completely independent of the channel instances avoids trouble either way.
| /** | ||
| * Something which maps onto a <i>span</i> in a distributed trace. | ||
| */ | ||
| public interface Traceable { |
There was a problem hiding this comment.
It's also important the the logic for the span ID is the same when opened and closing spans, so having it in once place via the interface ensures this.
But it's not really "one place". Now we're forcing new implicit life-cycle requirements on things like the rest-channel (which to me seems the worst one from this perspective) and making our whole codebase less flexible because any change to functionality that implements this interface now has to account for the tracing logic.
If you build the three objects here in-line where they're actually used you avoid complicating the codebase like that and we retain all flexibility. For the specific case of the fetch phase we could still add an abstraction to cover different implementations maybe but for all the other spots forcing this interface on top of random existing classes will increase the cost of future changes needlessly as far as I can tell.
If you think it's easiest to piggy back onto existing classes at this point, a better solution to this may be to just have RestChannel or Task return a Traceable via a getter without introducing a common abstraction across these very different classes. Then if we want to change or remove any of these we can just pull the Traceable out of them and pass it around some other way pretty easily? (also doing this Traceable could just become a record and doesn't even need to be an interface maybe because we don't have any state associated with an instance anyway as its getters only?)
|
@original-brownbear I experimented with dropping |
|
Argh, I fumbled the commit, please see 00d4b50 instead. |
original-brownbear
left a comment
There was a problem hiding this comment.
Thanks Rory this is pretty much exactly what I had in mind! The approach now LGTM :)
| String contentLength = null; | ||
| final Runnable onFinish = () -> { | ||
| Releasables.close(toClose); | ||
| tracer.onTraceStopped(traceId); |
There was a problem hiding this comment.
why not just add () -> tracer.onTraceStopped(traceId) to toClose?
There was a problem hiding this comment.
I have no idea, it's so obvious now you've pointed it out 🤷♂️
original-brownbear
left a comment
There was a problem hiding this comment.
Some comments on code details and one on docs, looks good otherwise!
| name = restPath; | ||
| } | ||
|
|
||
| Map<String, Object> attributes = new HashMap<>(); |
There was a problem hiding this comment.
Maybe pre-size this correctly e.g. using Maps. utilities? This isn't on the super hot path but sill it's on a network thread :)
| * and the exception is captured in the span. | ||
| */ | ||
| public void testDispatchUnsupportedHttpMethodTracesException() { | ||
| final RestRequest request = RestRequest.request(parserConfig(), new HttpRequest() { |
There was a problem hiding this comment.
Can't we override FakeHttpRequest here? Seems we only want the method() override right?
| } | ||
| }, | ||
| null, | ||
| new ActionListener<>() { |
There was a problem hiding this comment.
If you want this shorter, you can use ActionTestUtils.assertNoFailure(r -> {})
|
|
||
| private void startTrace(ThreadContext threadContext, Task task) { | ||
| TaskId parentTask = task.getParentTaskId(); | ||
| Map<String, Object> attributes = new HashMap<>(); |
There was a problem hiding this comment.
We could pre-size the map correctly here or just use Map.of( (with a slight bit of repetition for the task id key but whatever) to save some memory and make this compile nicer since its on the hot path.
|
|
||
| final String traceId = "rest-" + this.request.getRequestId(); | ||
|
|
||
| final ArrayList<Releasable> toClose = new ArrayList<>(3); |
There was a problem hiding this comment.
can make this a 4 now :)
| startTrace(threadContext, channel, null); | ||
| } | ||
|
|
||
| private void startTrace(ThreadContext threadContext, RestChannel channel, String restPath) { |
There was a problem hiding this comment.
We can pass the RestRequest here, we don't need the channel I think (and should have the request available in a variable at most callsites already anyway.
Mainly asking for this since I noticed we already have some strange redundancy in caller methods where we pass both the channel and the request (which we also have on the channel) which is confusing.
| * methods, e.g. {@link #setAttribute(String, String, String)}. This allows you | ||
| * to attach data that is not available when the span is opened. | ||
| */ | ||
| public interface Tracer { |
There was a problem hiding this comment.
I think we should add a comment here that the methods must have performance characteristics similar to logging since they're all used on hot transport threads.
If we implement this interface to send out network requests on any method we need to make sure that it's all non-blocking and with appropriate queuing to avoid slowing down transport threads.
original-brownbear
left a comment
There was a problem hiding this comment.
LGTM if CI is happy, though Ryan should probably have a look as well to make sure he's good with the overall approach also.
Thanks for the extra iterations here!
rjernst
left a comment
There was a problem hiding this comment.
Thanks for splitting this out from the larger PR. I left a few thoughts.
| } | ||
|
|
||
| final HttpTracer trace = tracer.maybeTraceRequest(restRequest, exception); | ||
| final HttpTracer maybeHttpTracer = httpTracer.maybeTraceRequest(restRequest, exception); |
There was a problem hiding this comment.
It's confusing to any reader why there is both Tracer and HttpTracer in this class. Perhaps at least a comment somewhere explaining the difference? Is this something that can eventually be subsumed by the new tracer? I see it also bleeds into other classes, maybe also add the explanation there, or reference to where the explanation exists.
There was a problem hiding this comment.
I agree, it is confusing, and for a while I have HttpTracer implementing Tracer, but I ended up taking it as part of simplifying some stuff with Armin. The fundamental problem is that HttpTracer is actually just a logger, but we reference it in our docs so renaming the class isn't 100% trivial. However I've renamed many of the uses / instances of HttpTracer so make it clearer that it's about logging.
There was a problem hiding this comment.
Thanks for the rename of the member, it does help in the reading. However it is still confusing since the class is still called HttpTracer.
we reference it in our docs so renaming the class isn't 100% trivial.
I think this could be handled through keeping the HttpTracer class, but with no impl, so that whatever the class is renamed to can still have LogManager.getLogger(HttpTracer.class). We do need a way deprecate such logging; it is an existing problem that our class and package names leak into logger configuration since it restricts us from easily changing, or we accidentally make a configuration breaking change without realizing it.
The rename to HttpLogger I suggest above isn't crucial for this PR, but I do think it should be done as a followup.
| /** | ||
| * Called with a Tracer so that each plugin has a chance to work with it. | ||
| */ | ||
| public void onTracer(Tracer tracer) {} |
There was a problem hiding this comment.
"onXXX" doesn't really fit with other plugin methods (the one exception is onIndexModule, but that is a holdover from the older style push based model, which index specific plugin extensions still use). The purpose of this seems to be to give plugins this global Tracer object, so that they can stash it to use later.
If the tracer object needs to be passed around the entire system, then it doesn't seem any better than just having actual global state for this purpose. Otherwise every time we realize something else needs access to the tracer, we have to plumb it through, which it seems like could eventually be the entire system? If I've misread the purpose of this object, then it would seem better as an argument to createComponents, where other "common" systems are passed to plugins so they can be utilized in plugin service construction.
| } | ||
|
|
||
| private Tracer getTracer(Collection<Object> pluginComponents) { | ||
| final List<Tracer> tracers = pluginComponents.stream().map(c -> c instanceof Tracer t ? t : null).filter(Objects::nonNull).toList(); |
There was a problem hiding this comment.
Thanks to java modules, we have a newer way of achieving this type of extensibility that we only intend a single implementation. See RestInterceptorActionPlugin. Basically you would have a specific interface for creating the tracer, and that package would only be exported to the module implementing it. Since this implementation is coming in a followup PR, it could be in a new unqualified export in this PR, then make it qualified in the followup.
The class `HttpTracer` is really a logging class, which is unfortunate because we're introducing the `Tracer` interface for actual distributed tracing. Change how the class is used to make it clearer that its purpose is logging. Note that we can't rename the class without also affecting how the HTTP logging is configured and documented, so that is left for another time.
|
@rjernst I'm not sure if this is exactly what you intended, but: I introduced a The |
Yes, the TracerPlugin + createComponents changes are what I was thinking there. |
rjernst
left a comment
There was a problem hiding this comment.
LGTM assuming you agree with "startTrace" and "stopTrace"
Oh man, I meant to rename them that way before. 🤦♂️ |
Split out from elastic#88443. Part of elastic#84369. Use the tracing API that was added in elastic#87921 in TaskManager. This won't actually do anything until we provide a tracer with an actual implemenation.
Part of #84369. Split out from #87696. Introduce tracing interfaces in advance of adding APM support to Elasticsearch. The only implementation at this point is a no-op class.