plex waits for bacalhau jobs to complete for up to 60 min#633
plex waits for bacalhau jobs to complete for up to 60 min#633acashmoney merged 16 commits intomainfrom
Conversation
…x will wait for a bacalhau job to complete
LAB-585 Error processing IO entry
contextWe completed the creation and testing of a Protein Design container image & tool, called colabdesign. To generate 32 protein binder candidates the tool requires 20 minutes on a T4-equivalent GPU. We use the notebook below to trigger jobs on the lab-exchange. https://colab.research.google.com/drive/1Co4baV_fopBnq1N5XbhItgzrJ-j1xVTL#scrollTo=VONRnnzlpI57 Our input is a protein template (to design against) and a configuration file. Both of them are downloaded within the notebook for debugging. The container itself is fairly large, totalling ca. 28GB without any optimisation for size. At the time this error report was generated, the exact same container image had already been triggered a couple times through plex. issue description[[error processing IO entry]]About 12 minutes after submission, the client returns an error that is listed below. The issue is "error processing IO entry". When we submit the same job to the cluster again, we receive the exact same error, again, after 12 minutes. When we check the status of the job using When we check the status of the job using [[bacalhau errored job]]If we keep pushing jobs to the cluster, we eventually get a new error that comes directly from bacalhau When we check the state of the task, we see that the system rejected the job bid. interpretationWe have two problems:
proposed solutionexperiment 1 - reduce the job scope and test the failure timepointReduce the scope of the job to a shorter job by reducing the number of binders to 1 from 32. This reduces the runtime down to 8 minutes (note the serious installation & initiatialisation overhead that exists in this container). The job completes as expected after 8 minutes. |
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Summary
plex (client-side) previous job time limit ~12 min
Previously, plex outputted an error whenever jobs took over ~12 min to process. This has now been patched to a default value of 60 min.
maxTimeflag added forplex runThe new flag determines how long in minutes plex should wait for a Bacalhau job to complete.