[ML] Introduce a "starting" datafeed state for lazy jobs#53918
[ML] Introduce a "starting" datafeed state for lazy jobs#53918droberts195 merged 5 commits intoelastic:masterfrom
Conversation
It is possible for ML jobs to open lazily if the "allow_lazy_open" option in the job config is set to true. Such jobs wait in the "opening" state until a node has sufficient capacity to run them. This commit fixes the bug that prevented datafeeds for jobs lazily waiting assignment from being started. The state of such datafeeds is "starting", and they can be stopped by the stop datafeed API while in this state with or without force. Relates elastic#53763
|
Pinging @elastic/ml-core (:ml) |
| * `stopping`: The {dfeed} has been requested to stop gracefully and is | ||
| completing its final action. |
There was a problem hiding this comment.
The stopping state is not introduced by this PR - it has existed for a couple of years. But I thought I'd add it while I was modifying the list.
(This PR exposes the starting state externally for the first time, so it's understandable it wasn't previously documented.)
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Looks good. Just left a question.
|
|
||
| // This means we are awaiting a new node to be spun up, ok to return back to the user to await node creation | ||
| if (assignment != null && assignment.equals(JobNodeSelector.AWAITING_LAZY_ASSIGNMENT)) { | ||
| opened = true; |
There was a problem hiding this comment.
Do we want to return "opened": true even though the job is not in the opened state yet?
There was a problem hiding this comment.
This is what I changed my mind on - look at the difference between the two commits on this PR to see.
If we return "opened" : false" for lazy jobs then the UI as it stands won't start the datafeed - see https://github.com/elastic/kibana/blob/33af1c154beeef8a80bb2f2b3279ab62632ab664/x-pack/plugins/ml/server/models/job_service/datafeeds.ts#L84-L97 and https://github.com/elastic/kibana/blob/33af1c154beeef8a80bb2f2b3279ab62632ab664/x-pack/plugins/ml/server/models/job_service/datafeeds.ts#L57-L59
You could say that the UI should be changed to accept "opened" : false as an acceptable response to start the datafeed if the job is a lazy job, but that's not great becasue:
- Currently that part of the UI source code does not have access to the job configs
- There are two ways of specifying laziness (the global lazy nodes setting and the per-job laziness setting), and even if the UI were changed to check the per-job setting it couldn't make a sensible decision on the global lazy nodes setting
So, the response from the open job endpoint has to be sufficient for the UI to answer the question "is it appropriate to start the datafeed?"
I think there are two options:
- Leave the response with a single
openedfield that basically answers that question and change the docs to say that - Revert the second commit on this PR and instead add a second field,
"awaiting_lazy_assignment": true, to all three responses (i.e. open anomaly detection job, start datafeed, start data frame analytics job) and change the UI to||the two booleans for the result of itsopenJob()function
Interestingly start data frame analytics currently does return true for lazily assigned jobs - see
Which do you prefer?
There was a problem hiding this comment.
May I suggest a 3rd option?
We can stick to interpreting "opened" and "started" as "it is now opened|started".
The UI could simply check the status code for 200 to interpret "the request to open|start was successful" as it doesn't care about whether it is already "opened|started" by the time it receives the response.
There was a problem hiding this comment.
Yes, sure, that is another option. I think that would be a trivial change in the UI code because already >= 400 status codes go through a different control path so it would basically be removing a single if.
I noticed that docs currently don't mention the possibility that opened could be false; it's currently documented that opened is always true, like an acknowledged response but with the field name changed. So doing it this way would mean updating the docs (which is not a problem if we go this way).
The final thing to consider is that starting a data frame analytics job does currently return an acknowledged response with a value hardcoded to true. So if it's important for the response to say whether the job was lazily opened then maybe in a followup PR we should change data frame analytics to return started with a true or false value depending on whether it was lazily assigned for consistency?
None of these options are particularly difficult to implement so I will wait until Monday morning once others have had a chance to comment before making any more code changes to avoid unnecessary churn.
There was a problem hiding this comment.
Cool. Note that I don't feel strongly about whether opened is true when the job started lazily. I just wonder if it could be misleading. If we conclude that we're not concerned with this, then the simplest thing is to do what the PR is doing already.
There was a problem hiding this comment.
Parity with analytics and anomaly is useful, as is treatment for the UI.
So opened: true plus a docs change seems acceptable.
Is there a possibility to add a second response status:opening?
We document that opened: true|false means that the instruction to open the job has been accepted (I suspect we would use acknoweldged:true if we were designing today) . And we can use the status to explain if this is happening immediately or slowly. Just a thought.
There was a problem hiding this comment.
I suspect we would use
acknoweldged:trueif we were designing today
Yes, I think the least BWC-breaking way we can move towards that is to always have opened: true and that is what is documented (the documentation currently says the value is always true).
Is there a possibility to add a second response
status:opening?
This is quite similar to the awaiting_lazy_assignment: true of my second option (although in my second option I was proposing to switch back to opened:false as well, which I think we are now agreeing is not the way to go).
I prefer awaiting_lazy_assignment: true over status:opening because if I was an end user and saw:
{
"opened": true,
"status": "opening"
}
then I would probably have to read the docs to work out how to interpret that. Also:
{
"opened": true,
"status": "opened"
}
is what most users would see, and that looks like it contains redundancy.
Whereas:
{
"opened": true,
"awaiting_lazy_assignment": true
}
and:
{
"opened": true,
"awaiting_lazy_assignment": false
}
seem easier to interpret without reading the docs.
So unless anyone strongly objects I will eventually implement that. But I'd rather get this PR merged ASAP since it fixes a nasty bug and then add awaiting_lazy_assignment across both job types and datafeeds in a different PR. It will touch quite a few files (since it means HLRC and docs changes for 3 endpoints) and will obscure the main fix in this PR.
There was a problem hiding this comment.
Agreed, let's merge PR to fix the bug.
And let's treat the second field as a separate discussion for a future release.
It is possible for ML jobs to open lazily if the "allow_lazy_open" option in the job config is set to true. Such jobs wait in the "opening" state until a node has sufficient capacity to run them. This commit fixes the bug that prevented datafeeds for jobs lazily waiting assignment from being started. The state of such datafeeds is "starting", and they can be stopped by the stop datafeed API while in this state with or without force. Backport of elastic#53918
It is possible for ML jobs to open lazily if the "allow_lazy_open" option in the job config is set to true. Such jobs wait in the "opening" state until a node has sufficient capacity to run them. This commit fixes the bug that prevented datafeeds for jobs lazily waiting assignment from being started. The state of such datafeeds is "starting", and they can be stopped by the stop datafeed API while in this state with or without force. Backport of #53918
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true. Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.
This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started. The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.
Fixes #53763