-
Notifications
You must be signed in to change notification settings - Fork 15
Description
context
We completed the creation and testing of a Protein Design container image & tool, called colabdesign.
To generate 32 protein binder candidates the tool requires 20 minutes on a T4-equivalent GPU.
We use the notebook below to trigger jobs on the lab-exchange.
https://colab.research.google.com/drive/1Co4baV_fopBnq1N5XbhItgzrJ-j1xVTL#scrollTo=VONRnnzlpI57
Our input is a protein template (to design against) and a configuration file. Both of them are downloaded within the notebook for debugging. The container itself is fairly large, totalling ca. 28GB without any optimisation for size.
At the time this error report was generated, the exact same container image had already been triggered a couple times through plex.
issue description
[[error processing IO entry]]
About 12 minutes after submission, the client returns an error that is listed below. The issue is "error processing IO entry". When we submit the same job to the cluster again, we receive the exact same error, again, after 12 minutes.
Requirement already satisfied: PlexLabExchange in /usr/local/lib/python3.10/dist-packages (0.9.1) plex init -t /content/_colabdesign-dev.json -i {"config": ["/content/config.yaml"], "protein": ["/content/6vja_stripped.pdb"]} --scatteringMethod=dotProduct --autoRun=false Plex version (v0.10.1) up to date.
Pinned IO JSON CID: QmR1Der61pRwrV9WAX8zJdyhGuPt1K8oaVmQkkhNWG7Kx7
Plex version (v0.10.1) up to date.
Created working directory: /jobs/16c49441-303f-4693-8100-2a24e826a56c Initialized IO file at: /jobs/16c49441-303f-4693-8100-2a24e826a56c/io.json
Processing IO Entries
Starting to process IO entry 0 Job running...
Bacalhau job id: 5cad28b8-1bf6-4249-944c-b58402d78d34
Error processing IO entry 0 error cleaning Bacalhau output directory: open /jobs/16c49441-303f-4693-8100-2a24e826a56c/entry-0/outputs/outputs: no such file or directory
Finished processing, results written to /jobs/16c49441-303f-4693-8100-2a24e826a56c/io.json
Completed IO JSON CID: QmT8X5KzqU2y4istxJn1KmFMhtrCWkUn6bxFUgb1siSNne
When we check the status of the job using bacalhau describe at the time of plex failure, we see that the job is still in process on the bacalhau network.
When we check the status of the job using bacalhau describe after ca. 20 minutes, we see that they did not in fact error, but they completed as expected.
(base) rindtorff@NiklasTR Downloads % export BACALHAU_API_HOST=54.210.19.52
(base) rindtorff@NiklasTR Downloads % bacalhau describe 5cad28b8-1bf6-4249-944c-b58402d78d34
Job:
APIVersion: V1beta1
Metadata:
ClientID: 7fbadd066bbf71c93ce1f67403a91e4fb9d4a056b0549f83f37097f5861a8450
CreatedAt: "2023-09-05T07:59:21.362554648Z"
ID: 5cad28b8-1bf6-4249-944c-b58402d78d34
Requester:
RequesterNodeID: QmbETsVtL1sQ97KKV1jPQA5ng8RSyzPWUiDgRBQp7AcjRt
RequesterPublicKey: CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDaeKS6+PcOSU4Qk9g4Vy/JFXlQK2I/jzd4GDb+jaVXUJxMDZMsBnuHSiX1DD+j/W8/TT2odg3ItGzY3JFex1ba5xTK4fdEfhMHeWHWUxXQ0z0ngF+kr7pnc29viMpgWkK4OsHjyHRTJKW/Ry9SZABrtvrtxNINeNzo+kLI30vmu91dbV/R88N2ROz6NHo24CafcRxM2LS6bi7fWCA5KUv7gVZIus3AdnIz4fHAcgtIvSBUVfVrrU7N60y7rui1r8vGDU/mv50ROyJn8vvdr1d+PvGRrVk7IweiBwW4rg8VjT4wNYOMRI5ROcYw2mF3dLO0+D7WXytYp0rpua3uRJhRAgMBAAE=
Spec:
Annotations:
- python
- userId=0xcaf6A0c4468087d76e6B2917cea10F0E1aA2f9D4
Deal:
Concurrency: 1
Docker:
Entrypoint:
- /bin/bash
- -c
- /bin/bash -c "wget https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/install.py;
wget https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/main.py;
mv /inputs/*.pdb /inputs/target_protein.pdb && mv /inputs/*.yaml /inputs/config.yaml;
python install.py; python main.py; mv *.zip /outputs;"
Image: docker.io/niklastr/dev@sha256:01dd6d1eb418d67e97e7517de220e76f9403517502efa411f68c2eeb88977d0f
Engine: Docker
Language:
JobContext: {}
Network:
Type: Full
NodeSelectors:
- Key: owner
Operator: =
Values:
- labdao
Publisher: Estuary
PublisherSpec:
Type: Estuary
Resources:
GPU: "1"
Timeout: 1800
Verifier: Noop
Wasm:
EntryModule: {}
inputs:
- CID: QmfQ1rDG7HFYAHTFA6CXFW7Rw9yF3D4Fcc6T1yGqEDd5kx
StorageSource: IPFS
path: /inputs
outputs:
- Name: outputs
StorageSource: IPFS
path: /outputs
State:
CreateTime: "2023-09-05T07:59:21.362584564Z"
Executions:
- AcceptedAskForBid: true
ComputeReference: e-485b2e0c-8fef-4ec6-9293-117b2c0f5088
CreateTime: "2023-09-05T08:19:42.121454297Z"
JobID: 5cad28b8-1bf6-4249-944c-b58402d78d34
NodeId: QmfAHirFbd3Rjw7X6tT2zmpXbmYzgun5foMYZWvyvfsqg5
PublishedResults:
CID: QmcZZpxwbxd8i9Vn3JVStyEV9NAGtX8sq96Phh87wkc5hh
Name: ipfs://QmcZZpxwbxd8i9Vn3JVStyEV9NAGtX8sq96Phh87wkc5hh
StorageSource: IPFS
RunOutput:
exitCode: 0
runnerError: ""
stderr: "--2023-09-05 07:59:21-- https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/install.py\nResolving
raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133,
185.199.111.133, 185.199.108.133, ...\nConnecting to raw.githubusercontent.com
(raw.githubusercontent.com)|185.199.110.133|:443... connected.\nHTTP request
sent, awaiting response... 200 OK\nLength: 1848 (1.8K) [text/plain]\nSaving
to: ‘install.py’\n\n 0K . 100%
37.4M=0s\n\n2023-09-05 07:59:22 (37.4 MB/s) - ‘install.py’ saved [1848/1848]\n\n--2023-09-05
07:59:22-- https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/main.py\nResolving
raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133,
185.199.109.133, 185.199.110.133, ...\nConnecting to raw.githubusercontent.com
(raw.githubusercontent.com)|185.199.108.133|:443... connected.\nHTTP request
sent, awaiting response... 200 OK\nLength: 15687 (15K) [text/plain]\nSaving
to: ‘main.py’\n\n 0K .......... ..... 100%
\ 114M=0s\n\n2023-09-05 07:59:23 (114 MB/s) - ‘main.py’ saved [15687/15687]\n\nmv:
'/inputs/config.yaml' and '/inputs/config.yaml' are the same file\nCloning
into 'RFdiffusion'...\nERROR: pip's dependency resolver does not currently
take into account all the packages that are installed. This behaviour is the
source of the following dependency conflicts.\nmoviepy 1.0.3 requires decorator<5.0,>=4.0.2,
but you have decorator 5.1.0 which is incompatible.\n/app/main.py:5: UserWarning:
\nThe version_base parameter is not specified.\nPlease specify a compatability
version level, or None.\nWill assume defaults for version 1.1\n initialize(config_path=\"../inputs\")\n/usr/local/lib/python3.10/dist-packages/dgl/backend/pytorch/tensor.py:445:
UserWarning: TypedStorage is deprecated. It will be removed in the future
and UntypedStorage will be the only storage class. This should only matter
to you if you are using storages directly. To acc"
stderrtruncated: true
stdout: "Reading package lists...\nBuilding dependency tree...\nReading state
information...\nThe following additional packages will be installed:\n libaria2-0
libc-ares2\nThe following NEW packages will be installed:\n aria2 libaria2-0
libc-ares2\n0 upgraded, 3 newly installed, 0 to remove and 69 not upgraded.\nNeed
to get 1,513 kB of archives.\nAfter this operation, 5,441 kB of additional
disk space will be used.\nGet:1 http://archive.ubuntu.com/ubuntu jammy-updates/main
amd64 libc-ares2 amd64 1.18.1-1ubuntu0.22.04.2 [45.0 kB]\nGet:2 http://archive.ubuntu.com/ubuntu
jammy/universe amd64 libaria2-0 amd64 1.36.0-1 [1,086 kB]\nGet:3 http://archive.ubuntu.com/ubuntu
jammy/universe amd64 aria2 amd64 1.36.0-1 [381 kB]\nFetched 1,513 kB in 1s
(1,783 kB/s)\nSelecting previously unselected package libc-ares2:amd64.\r\n(Reading
database ... \r(Reading database ... 5%\r(Reading database ... 10%\r(Reading
database ... 15%\r(Reading database ... 20%\r(Reading database ... 25%\r(Reading
database ... 30%\r(Reading database ... 35%\r(Reading database ... 40%\r(Reading
database ... 45%\r(Reading database ... 50%\r(Reading database ... 55%\r(Reading
database ... 60%\r(Reading database ... 65%\r(Reading database ... 70%\r(Reading
database ... 75%\r(Reading database ... 80%\r(Reading database ... 85%\r(Reading
database ... 90%\r(Reading database ... 95%\r(Reading database ... 100%\r(Reading
database ... 120573 files and directories currently installed.)\r\nPreparing
to unpack .../libc-ares2_1.18.1-1ubuntu0.22.04.2_amd64.deb ...\r\nUnpacking
libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.2) ...\r\nSelecting previously unselected
package libaria2-0:amd64.\r\nPreparing to unpack .../libaria2-0_1.36.0-1_amd64.deb
...\r\nUnpacking libaria2-0:amd64 (1.36.0-1) ...\r\nSelecting previously unselected
package aria2.\r\nPreparing to unpack .../aria2_1.36.0-1_amd64.deb ...\r\nUnpacking
aria2 (1.36.0-1) ...\r\nSetting up libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.2)
...\r\nSetting up libaria2-0:amd64 (1.36.0-1) ...\r\nSetting up aria2 (1.36.0-1)
...\r\nProcessing triggers for man-db (2.10.2-1) ...\r\nProcessing triggers
f"
stdouttruncated: true
State: Completed
UpdateTime: "2023-09-05T08:19:41.815301553Z"
VerificationResult:
Complete: true
Result: true
Version: 6
JobID: 5cad28b8-1bf6-4249-944c-b58402d78d34
State: Completed
TimeoutAt: "0001-01-01T00:00:00Z"
UpdateTime: "2023-09-05T08:19:42.121487889Z"
Version: 5
[[bacalhau errored job]]
If we keep pushing jobs to the cluster, we eventually get a new error that comes directly from bacalhau bacalhau errored job. These errors come in a fast cadence.
Requirement already satisfied: PlexLabExchange in /usr/local/lib/python3.10/dist-packages (0.9.1) plex init -t /content/_colabdesign-dev.json -i {"config": ["/content/config.yaml"], "protein": ["/content/6vja_stripped.pdb"]} --scatteringMethod=dotProduct --autoRun=false Plex version (v0.10.1) up to date. Pinned IO JSON CID: QmR1Der61pRwrV9WAX8zJdyhGuPt1K8oaVmQkkhNWG7Kx7 Plex version (v0.10.1) up to date. Created working directory: /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f Initialized IO file at: /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f/io.json Processing IO Entries Starting to process IO entry 0 Job running... Bacalhau job id: da4b77f3-953c-41bb-b650-e44219d98b98 Error processing IO entry 0 error getting Bacalhau job results: bacalhau errored job; please run `bacalhau describe da4b77f3-953c-41bb-b650-e44219d98b98` for more details Finished processing, results written to /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f/io.json Completed IO JSON CID: QmPn3rEhZd8vPWAmBN5oGySj7bjA3ZyKd9phc9rv89ZhzA DAG filepath /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f/io.json DAG location QmPn3rEhZd8vPWAmBN5oGySj7bjA3ZyKd9phc9rv89ZhzA Plex version (v0.10.1) up to date.
When we check the state of the task, we see that the system rejected the job bid.
interpretation
We have two problems:
- an unexplained plex failure that throws errors when tasks are long-running. This error emerges consistently after 12 minutes.
- a missing well-explained plex error message that informs the user about a bidrejection because the cluster is busy.
proposed solution
experiment 1 - reduce the job scope and test the failure timepoint
Reduce the scope of the job to a shorter job by reducing the number of binders to 1 from 32. This reduces the runtime down to 8 minutes (note the serious installation & initiatialisation overhead that exists in this container).
basic_settings:
experiment_name: input_32_16
binder_length: 50
pdb: /inputs/target_protein.pdb
pdb_chain: D
num_designs: 1
advanced_settings:
pdb_start_residue: 50
pdb_end_residue: 100
hotspot: ''
min_binder_length: null
max_binder_length: null
use_beta_model: false
expert_settings:
RFDiffusion_Binder:
contigs_override: ''
# the contigs_override completely overrules all other contig related settings. Make sure it is '' if you do not want to overwrite settings.
iterations: 50
visual: none
RFDiffusion_Symmetry:
symmetry: none
order: 1
chains: ''
add_potential: true
ProteinMPNN:
num_seqs: 16
rm_aa: C
mpnn_sampling_temp: 0.1
use_solubleMPNN: true
initial_guess: true
Alphafold:
use_multimer: false
num_recycles: 3
The job completes as expected after 8 minutes.
Requirement already satisfied: PlexLabExchange in /usr/local/lib/python3.10/dist-packages (0.9.1) plex init -t /content/_colabdesign-dev.json -i {"config": ["/content/config.yaml"], "protein": ["/content/6vja_stripped.pdb"]} --scatteringMethod=dotProduct --autoRun=false Plex version (v0.10.1) up to date. Pinned IO JSON CID: Qmapbi5p3UU4qU2Er3Hed44FLZMfXcWSTBqcDabWQtN4vq Plex version (v0.10.1) up to date. Created working directory: /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0 Initialized IO file at: /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0/io.json Processing IO Entries Starting to process IO entry 0 Job running... Bacalhau job id: 0dd2fd6a-7a5c-4f56-8158-5d5e6d3c1f0a Computing default go-libp2p Resource Manager limits based on: - 'Swarm.ResourceMgr.MaxMemory': "6.8 GB" - 'Swarm.ResourceMgr.MaxFileDescriptors': 524288 Applying any user-supplied overrides on top. Run 'ipfs swarm limit all' to see the resulting limits. Success processing IO entry 0 Finished processing, results written to /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0/io.json Completed IO JSON CID: QmeRkTZ8Kd5pRXkygyqypZLJPPKkHhsq89vJBgkRcwqphD 2023/09/05 09:18:23 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See [https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size](https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size) for details. DAG filepath /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0/io.json DAG location QmeRkTZ8Kd5pRXkygyqypZLJPPKkHhsq89vJBgkRcwqphD
From SyncLinear.com | LAB-585