Skip to content

[ML] Store failure reason in ML job task status #34431

@droberts195

Description

@droberts195

At present if an ML job fails then its persistent task is not deleted, but remains in the cluster state with a status of failed. However, to find the reason why it failed you currently need to look in the log file of the node that was running the persistent task at the time when it failed. This almost invariably involves multiple back-and-forth cycles when working on support cases.

It would be relatively easy to store the getMessage() of the exception that caused the job to fail in the job task status for ML job persistent tasks. This would be helpful in the case where the logs for the node the job was running on are not initially available. In some cases it may prevent the need to ask for those logs at all.

The only problem with making this change is BWC of the job task status. If it is strictly parsed in 6.0 or above then 6.7 needs to be changed to leniently ignore unknown fields from the job task status and the new exception field can only be added in 7.0.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions