-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[ML] Store failure reason in ML job task status #34431
Description
At present if an ML job fails then its persistent task is not deleted, but remains in the cluster state with a status of failed. However, to find the reason why it failed you currently need to look in the log file of the node that was running the persistent task at the time when it failed. This almost invariably involves multiple back-and-forth cycles when working on support cases.
It would be relatively easy to store the getMessage() of the exception that caused the job to fail in the job task status for ML job persistent tasks. This would be helpful in the case where the logs for the node the job was running on are not initially available. In some cases it may prevent the need to ask for those logs at all.
The only problem with making this change is BWC of the job task status. If it is strictly parsed in 6.0 or above then 6.7 needs to be changed to leniently ignore unknown fields from the job task status and the new exception field can only be added in 7.0.