The UNKNOWN status in LSF does not mean the job has actually died, just that the LSF daemon has lost touch with each other. A job may continue running in an unknown state for a long period of time and write output via shared disks. It may also recover and terminate with exit code 0. I would suggest treating it as QueueStatus.HOLD not QueueStatus.ERROR.
|
'UNKWN': QueueStatus.ERROR, |