[ML] Store failure reason in ML job task status

At present if an ML job fails then its persistent task is not deleted, but remains in the cluster state with a status of `failed`.  However, to find the reason why it failed you currently need to look in the log file of the node that was running the persistent task at the time when it failed.  This almost invariably involves multiple back-and-forth cycles when working on support cases.

It would be relatively easy to store the `getMessage()` of the exception that caused the job to fail in the job task status for ML job persistent tasks.  This would be helpful in the case where the logs for the node the job was running on are not initially available.  In some cases it may prevent the need to ask for those logs at all.

The only problem with making this change is BWC of the job task status.  If it is strictly parsed in 6.0 or above then 6.7 needs to be changed to leniently ignore unknown fields from the job task status and the new exception field can only be added in 7.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Store failure reason in ML job task status #34431

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ML] Store failure reason in ML job task status #34431

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions