Skip to content

[LAB-585] Error processing IO entry #632

@NiklasTR

Description

@NiklasTR

context

We completed the creation and testing of a Protein Design container image & tool, called colabdesign.

To generate 32 protein binder candidates the tool requires 20 minutes on a T4-equivalent GPU.

We use the notebook below to trigger jobs on the lab-exchange.

https://colab.research.google.com/drive/1Co4baV_fopBnq1N5XbhItgzrJ-j1xVTL#scrollTo=VONRnnzlpI57

Our input is a protein template (to design against) and a configuration file. Both of them are downloaded within the notebook for debugging. The container itself is fairly large, totalling ca. 28GB without any optimisation for size.

At the time this error report was generated, the exact same container image had already been triggered a couple times through plex.

issue description

[[error processing IO entry]]

About 12 minutes after submission, the client returns an error that is listed below. The issue is "error processing IO entry". When we submit the same job to the cluster again, we receive the exact same error, again, after 12 minutes.


Requirement already satisfied: PlexLabExchange in /usr/local/lib/python3.10/dist-packages (0.9.1) plex init -t /content/_colabdesign-dev.json -i {"config": ["/content/config.yaml"], "protein": ["/content/6vja_stripped.pdb"]} --scatteringMethod=dotProduct --autoRun=false Plex version (v0.10.1) up to date. 

Pinned IO JSON CID: QmR1Der61pRwrV9WAX8zJdyhGuPt1K8oaVmQkkhNWG7Kx7 

Plex version (v0.10.1) up to date. 

Created working directory: /jobs/16c49441-303f-4693-8100-2a24e826a56c Initialized IO file at: /jobs/16c49441-303f-4693-8100-2a24e826a56c/io.json 

Processing IO Entries 

Starting to process IO entry 0 Job running... 

Bacalhau job id: 5cad28b8-1bf6-4249-944c-b58402d78d34 

Error processing IO entry 0 error cleaning Bacalhau output directory: open /jobs/16c49441-303f-4693-8100-2a24e826a56c/entry-0/outputs/outputs: no such file or directory 

Finished processing, results written to /jobs/16c49441-303f-4693-8100-2a24e826a56c/io.json 

Completed IO JSON CID: QmT8X5KzqU2y4istxJn1KmFMhtrCWkUn6bxFUgb1siSNne

When we check the status of the job using bacalhau describe at the time of plex failure, we see that the job is still in process on the bacalhau network.

When we check the status of the job using bacalhau describe after ca. 20 minutes, we see that they did not in fact error, but they completed as expected.


(base) rindtorff@NiklasTR Downloads % export BACALHAU_API_HOST=54.210.19.52

(base) rindtorff@NiklasTR Downloads % bacalhau describe 5cad28b8-1bf6-4249-944c-b58402d78d34

Job:

  APIVersion: V1beta1

  Metadata:

    ClientID: 7fbadd066bbf71c93ce1f67403a91e4fb9d4a056b0549f83f37097f5861a8450

    CreatedAt: "2023-09-05T07:59:21.362554648Z"

    ID: 5cad28b8-1bf6-4249-944c-b58402d78d34

    Requester:

      RequesterNodeID: QmbETsVtL1sQ97KKV1jPQA5ng8RSyzPWUiDgRBQp7AcjRt

      RequesterPublicKey: CAASpgIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDaeKS6+PcOSU4Qk9g4Vy/JFXlQK2I/jzd4GDb+jaVXUJxMDZMsBnuHSiX1DD+j/W8/TT2odg3ItGzY3JFex1ba5xTK4fdEfhMHeWHWUxXQ0z0ngF+kr7pnc29viMpgWkK4OsHjyHRTJKW/Ry9SZABrtvrtxNINeNzo+kLI30vmu91dbV/R88N2ROz6NHo24CafcRxM2LS6bi7fWCA5KUv7gVZIus3AdnIz4fHAcgtIvSBUVfVrrU7N60y7rui1r8vGDU/mv50ROyJn8vvdr1d+PvGRrVk7IweiBwW4rg8VjT4wNYOMRI5ROcYw2mF3dLO0+D7WXytYp0rpua3uRJhRAgMBAAE=

  Spec:

    Annotations:

    - python

    - userId=0xcaf6A0c4468087d76e6B2917cea10F0E1aA2f9D4

    Deal:

      Concurrency: 1

    Docker:

      Entrypoint:

      - /bin/bash

      - -c

      - /bin/bash -c "wget https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/install.py;

        wget https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/main.py;

        mv /inputs/*.pdb /inputs/target_protein.pdb && mv /inputs/*.yaml /inputs/config.yaml;

        python install.py; python main.py; mv *.zip /outputs;"

      Image: docker.io/niklastr/dev@sha256:01dd6d1eb418d67e97e7517de220e76f9403517502efa411f68c2eeb88977d0f

    Engine: Docker

    Language:

      JobContext: {}

    Network:

      Type: Full

    NodeSelectors:

    - Key: owner

      Operator: =

      Values:

      - labdao

    Publisher: Estuary

    PublisherSpec:

      Type: Estuary

    Resources:

      GPU: "1"

    Timeout: 1800

    Verifier: Noop

    Wasm:

      EntryModule: {}

    inputs:

    - CID: QmfQ1rDG7HFYAHTFA6CXFW7Rw9yF3D4Fcc6T1yGqEDd5kx

      StorageSource: IPFS

      path: /inputs

    outputs:

    - Name: outputs

      StorageSource: IPFS

      path: /outputs

State:

  CreateTime: "2023-09-05T07:59:21.362584564Z"

  Executions:

  - AcceptedAskForBid: true

    ComputeReference: e-485b2e0c-8fef-4ec6-9293-117b2c0f5088

    CreateTime: "2023-09-05T08:19:42.121454297Z"

    JobID: 5cad28b8-1bf6-4249-944c-b58402d78d34

    NodeId: QmfAHirFbd3Rjw7X6tT2zmpXbmYzgun5foMYZWvyvfsqg5

    PublishedResults:

      CID: QmcZZpxwbxd8i9Vn3JVStyEV9NAGtX8sq96Phh87wkc5hh

      Name: ipfs://QmcZZpxwbxd8i9Vn3JVStyEV9NAGtX8sq96Phh87wkc5hh

      StorageSource: IPFS

    RunOutput:

      exitCode: 0

      runnerError: ""

      stderr: "--2023-09-05 07:59:21--  https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/install.py\nResolving

        raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133,

        185.199.111.133, 185.199.108.133, ...\nConnecting to raw.githubusercontent.com

        (raw.githubusercontent.com)|185.199.110.133|:443... connected.\nHTTP request

        sent, awaiting response... 200 OK\nLength: 1848 (1.8K) [text/plain]\nSaving

        to: ‘install.py’\n\n     0K .                                                     100%

        37.4M=0s\n\n2023-09-05 07:59:22 (37.4 MB/s) - ‘install.py’ saved [1848/1848]\n\n--2023-09-05

        07:59:22--  https://raw.githubusercontent.com/labdao/plex/add_colabdesign_frfr/tools/colabdesign/main.py\nResolving

        raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133,

        185.199.109.133, 185.199.110.133, ...\nConnecting to raw.githubusercontent.com

        (raw.githubusercontent.com)|185.199.108.133|:443... connected.\nHTTP request

        sent, awaiting response... 200 OK\nLength: 15687 (15K) [text/plain]\nSaving

        to: ‘main.py’\n\n     0K .......... .....                                      100%

        \ 114M=0s\n\n2023-09-05 07:59:23 (114 MB/s) - ‘main.py’ saved [15687/15687]\n\nmv:

        '/inputs/config.yaml' and '/inputs/config.yaml' are the same file\nCloning

        into 'RFdiffusion'...\nERROR: pip's dependency resolver does not currently

        take into account all the packages that are installed. This behaviour is the

        source of the following dependency conflicts.\nmoviepy 1.0.3 requires decorator<5.0,>=4.0.2,

        but you have decorator 5.1.0 which is incompatible.\n/app/main.py:5: UserWarning:

        \nThe version_base parameter is not specified.\nPlease specify a compatability

        version level, or None.\nWill assume defaults for version 1.1\n  initialize(config_path=\"../inputs\")\n/usr/local/lib/python3.10/dist-packages/dgl/backend/pytorch/tensor.py:445:

        UserWarning: TypedStorage is deprecated. It will be removed in the future

        and UntypedStorage will be the only storage class. This should only matter

        to you if you are using storages directly.  To acc"

      stderrtruncated: true

      stdout: "Reading package lists...\nBuilding dependency tree...\nReading state

        information...\nThe following additional packages will be installed:\n  libaria2-0

        libc-ares2\nThe following NEW packages will be installed:\n  aria2 libaria2-0

        libc-ares2\n0 upgraded, 3 newly installed, 0 to remove and 69 not upgraded.\nNeed

        to get 1,513 kB of archives.\nAfter this operation, 5,441 kB of additional

        disk space will be used.\nGet:1 http://archive.ubuntu.com/ubuntu jammy-updates/main

        amd64 libc-ares2 amd64 1.18.1-1ubuntu0.22.04.2 [45.0 kB]\nGet:2 http://archive.ubuntu.com/ubuntu

        jammy/universe amd64 libaria2-0 amd64 1.36.0-1 [1,086 kB]\nGet:3 http://archive.ubuntu.com/ubuntu

        jammy/universe amd64 aria2 amd64 1.36.0-1 [381 kB]\nFetched 1,513 kB in 1s

        (1,783 kB/s)\nSelecting previously unselected package libc-ares2:amd64.\r\n(Reading

        database ... \r(Reading database ... 5%\r(Reading database ... 10%\r(Reading

        database ... 15%\r(Reading database ... 20%\r(Reading database ... 25%\r(Reading

        database ... 30%\r(Reading database ... 35%\r(Reading database ... 40%\r(Reading

        database ... 45%\r(Reading database ... 50%\r(Reading database ... 55%\r(Reading

        database ... 60%\r(Reading database ... 65%\r(Reading database ... 70%\r(Reading

        database ... 75%\r(Reading database ... 80%\r(Reading database ... 85%\r(Reading

        database ... 90%\r(Reading database ... 95%\r(Reading database ... 100%\r(Reading

        database ... 120573 files and directories currently installed.)\r\nPreparing

        to unpack .../libc-ares2_1.18.1-1ubuntu0.22.04.2_amd64.deb ...\r\nUnpacking

        libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.2) ...\r\nSelecting previously unselected

        package libaria2-0:amd64.\r\nPreparing to unpack .../libaria2-0_1.36.0-1_amd64.deb

        ...\r\nUnpacking libaria2-0:amd64 (1.36.0-1) ...\r\nSelecting previously unselected

        package aria2.\r\nPreparing to unpack .../aria2_1.36.0-1_amd64.deb ...\r\nUnpacking

        aria2 (1.36.0-1) ...\r\nSetting up libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.2)

        ...\r\nSetting up libaria2-0:amd64 (1.36.0-1) ...\r\nSetting up aria2 (1.36.0-1)

        ...\r\nProcessing triggers for man-db (2.10.2-1) ...\r\nProcessing triggers

        f"

      stdouttruncated: true

    State: Completed

    UpdateTime: "2023-09-05T08:19:41.815301553Z"

    VerificationResult:

      Complete: true

      Result: true

    Version: 6

  JobID: 5cad28b8-1bf6-4249-944c-b58402d78d34

  State: Completed

  TimeoutAt: "0001-01-01T00:00:00Z"

  UpdateTime: "2023-09-05T08:19:42.121487889Z"

  Version: 5

[[bacalhau errored job]]

If we keep pushing jobs to the cluster, we eventually get a new error that comes directly from bacalhau bacalhau errored job. These errors come in a fast cadence.


Requirement already satisfied: PlexLabExchange in /usr/local/lib/python3.10/dist-packages (0.9.1) plex init -t /content/_colabdesign-dev.json -i {"config": ["/content/config.yaml"], "protein": ["/content/6vja_stripped.pdb"]} --scatteringMethod=dotProduct --autoRun=false Plex version (v0.10.1) up to date. Pinned IO JSON CID: QmR1Der61pRwrV9WAX8zJdyhGuPt1K8oaVmQkkhNWG7Kx7 Plex version (v0.10.1) up to date. Created working directory: /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f Initialized IO file at: /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f/io.json Processing IO Entries Starting to process IO entry 0 Job running... Bacalhau job id: da4b77f3-953c-41bb-b650-e44219d98b98 Error processing IO entry 0 error getting Bacalhau job results: bacalhau errored job; please run `bacalhau describe da4b77f3-953c-41bb-b650-e44219d98b98` for more details Finished processing, results written to /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f/io.json Completed IO JSON CID: QmPn3rEhZd8vPWAmBN5oGySj7bjA3ZyKd9phc9rv89ZhzA DAG filepath /jobs/281bca8f-fb04-4ef5-867f-3fcabb17db6f/io.json DAG location QmPn3rEhZd8vPWAmBN5oGySj7bjA3ZyKd9phc9rv89ZhzA Plex version (v0.10.1) up to date.

When we check the state of the task, we see that the system rejected the job bid.

interpretation

We have two problems:

  1. an unexplained plex failure that throws errors when tasks are long-running. This error emerges consistently after 12 minutes.
  2. a missing well-explained plex error message that informs the user about a bidrejection because the cluster is busy.

proposed solution

experiment 1 - reduce the job scope and test the failure timepoint

Reduce the scope of the job to a shorter job by reducing the number of binders to 1 from 32. This reduces the runtime down to 8 minutes (note the serious installation & initiatialisation overhead that exists in this container).


basic_settings:

  experiment_name: input_32_16

  binder_length: 50

  pdb: /inputs/target_protein.pdb

  pdb_chain: D

  num_designs: 1

advanced_settings:

  pdb_start_residue: 50

  pdb_end_residue: 100

  hotspot: ''

  min_binder_length: null

  max_binder_length: null

  use_beta_model: false

expert_settings:

  RFDiffusion_Binder:

    contigs_override: ''

    # the contigs_override completely overrules all other contig related settings. Make sure it is '' if you do not want to overwrite settings.

    iterations: 50

    visual: none

  RFDiffusion_Symmetry:

    symmetry: none

    order: 1

    chains: ''

    add_potential: true

  ProteinMPNN:

    num_seqs: 16

    rm_aa: C

    mpnn_sampling_temp: 0.1

    use_solubleMPNN: true

    initial_guess: true

  Alphafold:

    use_multimer: false

    num_recycles: 3

The job completes as expected after 8 minutes.


Requirement already satisfied: PlexLabExchange in /usr/local/lib/python3.10/dist-packages (0.9.1) plex init -t /content/_colabdesign-dev.json -i {"config": ["/content/config.yaml"], "protein": ["/content/6vja_stripped.pdb"]} --scatteringMethod=dotProduct --autoRun=false Plex version (v0.10.1) up to date. Pinned IO JSON CID: Qmapbi5p3UU4qU2Er3Hed44FLZMfXcWSTBqcDabWQtN4vq Plex version (v0.10.1) up to date. Created working directory: /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0 Initialized IO file at: /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0/io.json Processing IO Entries Starting to process IO entry 0 Job running... Bacalhau job id: 0dd2fd6a-7a5c-4f56-8158-5d5e6d3c1f0a Computing default go-libp2p Resource Manager limits based on: - 'Swarm.ResourceMgr.MaxMemory': "6.8 GB" - 'Swarm.ResourceMgr.MaxFileDescriptors': 524288 Applying any user-supplied overrides on top. Run 'ipfs swarm limit all' to see the resulting limits. Success processing IO entry 0 Finished processing, results written to /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0/io.json Completed IO JSON CID: QmeRkTZ8Kd5pRXkygyqypZLJPPKkHhsq89vJBgkRcwqphD 2023/09/05 09:18:23 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See [https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size](https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size) for details. DAG filepath /jobs/0c779a51-3c30-49a5-885c-9854577ad4d0/io.json DAG location QmeRkTZ8Kd5pRXkygyqypZLJPPKkHhsq89vJBgkRcwqphD

From SyncLinear.com | LAB-585

Metadata

Metadata

Assignees

Labels

UrgentCreated by Linear-GitHub Sync

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions