Refactor tasks to improve APM support#87917
Conversation
Part of elastic#84369. Split out from elastic#87696. Rework how some work is executed by creating child tasks for them, so that when traced by APM, it results in more meaningful parent and child tasks in the UI. It also improves how Elasticsearch is modelling the work.
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
@elasticmachine run elasticsearch-ci/part-1 because the failure doesn't reproduce. |
henningandersen
left a comment
There was a problem hiding this comment.
This looks mostly good. I hope we can fix the constructor order. Also, I would like to see a bit of shallow testing that we actually utilize the task manager/generate child tasks during cluster state publishing and recovery, just to retain the functionality for the future.
server/src/main/java/org/elasticsearch/cluster/service/ClusterService.java
Outdated
Show resolved
Hide resolved
|
|
||
| static final String MASTER_UPDATE_THREAD_NAME = "masterService#updateTask"; | ||
|
|
||
| public static final String STATE_UPDATE_ACTION_NAME = "internal:cluster/coordination/update_state"; |
There was a problem hiding this comment.
This looks like a transport action name. In other places where we register a task, we pick a simpler name with no "internal:" prefix, for instance for enrich, it is just policy_execution. I'd prefer the same here to avoid the confusion. publish_cluster_state_update for instance.
There was a problem hiding this comment.
But there are also examples in the same form - for example, JoinHelper has 3 action names that all start with internal:cluster/coordination. The internal: prefix will also make it easier to filter out cluster management tasks when capturing traces, because we can filter in and filter out by span name, which here equates to task name.
There was a problem hiding this comment.
Those 3 are different, they are all 3 registered as transport actions and used on the wire. A common prefix for all artificial actions would be fine though, we just have not had it. We could start one now or leave this without one. I'd prefer not to reuse the space we have. I think filtering internal: is not going to be useful anyway, so many actions under that anyway.
There was a problem hiding this comment.
I agree that filtering on prefixes like internal: is probably not helpful, and indeed end-users probably don't know much about the structure of these action names. It would have been good to have all action names following the same structure as the ones for transport actions, but I think we've already crossed that bridge unfortunately. So yes a bare publish_cluster_state_update would be ok with me.
There was a problem hiding this comment.
I updated the task name, and had a go at adding a test in MasterServiceTests. Please let me know what you think, I hacked it together from other examples.
I looked into adding something similar for recoveries, but I'm a bit lost. I looked at PeerRecoverySourceServiceTests since it's already being changed to create a task to pass in, but the existing test is very narrow in scope and it's not obvious how to problem the task-executing behaviour without lots of new setup code or by opening up method visibility, neither of which is appealing. I'm hoping someone can point somewhere and say "oh, you just need something a bit like that."
...anyone?
There was a problem hiding this comment.
I had something like IndexRecoveryIT.testOngoingRecoveryAndMasterFailOver in mind, i.e., ensure a recovery, then block it through transport, then check that the task is registered correctly (not doing the master failover part of the test).
I'd be happy to hack something together.
Did not look at your test yet, might be sufficient.
|
@henningandersen CI is green, can you take another look? |
henningandersen
left a comment
There was a problem hiding this comment.
The Node construction part looks good to me. I left a couple more comments.
Also, I would like to see a bit of shallow testing that we actually utilize the task manager/generate child tasks during cluster state publishing and recovery, just to retain the functionality for the future.
I did not see this added, will you look into that please?
| @@ -204,6 +225,28 @@ public TransportService( | |||
| @Nullable ClusterSettings clusterSettings, | |||
| Set<String> taskHeaders, | |||
| ConnectionManager connectionManager | |||
There was a problem hiding this comment.
Ideally we'd get rid of both the two old constructors, but one of them is widely used in tests. This one however looks like it is only used in a couple of tests and I'd prefer to just use the one below, creating a task manager in the test.
Can we add a comment to the other constructor that accepts "task headers" that it is used in tests only?
server/src/main/java/org/elasticsearch/cluster/service/ClusterService.java
Outdated
Show resolved
Hide resolved
|
|
||
| static final String MASTER_UPDATE_THREAD_NAME = "masterService#updateTask"; | ||
|
|
||
| public static final String STATE_UPDATE_ACTION_NAME = "internal:cluster/coordination/update_state"; |
There was a problem hiding this comment.
Those 3 are different, they are all 3 registered as transport actions and used on the wire. A common prefix for all artificial actions would be fine though, we just have not had it. We could start one now or leave this without one. I'd prefer not to reuse the space we have. I think filtering internal: is not going to be useful anyway, so many actions under that anyway.
|
This class can demonstrate it works for recovery too: Click to show class |
|
@henningandersen brilliant! Thank you very much for the test 🙏 Can I ask for a final approval now? |
henningandersen
left a comment
There was a problem hiding this comment.
LGTM, thanks for the extra iterations.
Part of #84369. Split out from #87696. Rework how some work is executed
by creating child tasks for them, so that when traced by APM, it results
in more meaningful parent and child tasks in the UI. It also improves
how Elasticsearch is modelling the work.