Usage#

Quick Start#

Embed sequences with language model#

Sequences should be in .fasta format.

dscript embed --seqs [sequences] --outfile [embedding file]

Predict a new network using a trained model#

Pre-trained models can be downloaded from here. Protein names should be listed one per line with no header for prediction between all pairs of proteins. Alternatively, candidate pairs should be in tab-separated (.tsv) format with no header, and columns for [protein name 1], [protein name 2]. For a list of pairs, additional columns (for example, a [label] in training or test data files), can exist but are ignored.

dscript predict --proteins [list of proteins] --embeddings [embedding file] --outfile [outfile] --model [model file]
dscript predict --pairs [list of pairs] --embeddings [embedding file] --outfile [outfile] --model [model file]

Train and save a model#

Training and validation data should be in tab-separated (.tsv) format with no header, and columns for [protein name 1], [protein name 2], [label].

dscript train --train [training data] --val [validation data] --embedding [embedding file] --save-prefix [prefix]

Evaluate a trained model#

dscript evaluate --model [model file] --test [test data] --embeddings [embedding file] --outfile [result file]

Blocked, Multi-GPU Prediction#

usage: dscript predict [-h] [--proteins PROTEINS] [--pairs PAIRS] [--model MODEL] --embeddings EMBEDDINGS [--foldseek_fasta FOLDSEEK_FASTA] [-o OUTFILE] [-d DEVICE]
                   [--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC] [--blocks BLOCKS] [--sparse_loading]

Make new predictions with a pre-trained model using blocked, multi-GPU pariwise inference. One of --proteins and --pairs is required.

options:
  -h, --help            show this help message and exit
  --proteins PROTEINS   File with protein IDs for which to predict all pairs, one per line; specify one of proteins or pairs
  --pairs PAIRS         File with candidate protein pairs to predict, one pair per line; specify one of proteins or pairs
  --model MODEL         Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub
                        [default: samsl/topsy_turvy_human_v1]
  --embeddings EMBEDDINGS
                        h5 file with (a superset of) pre-embedded sequences. Generate with dscript embed.
  --foldseek_fasta FOLDSEEK_FASTA
                        3di sequences in .fasta format. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default
                        D-SCRIPT/TT will be run.
  -o OUTFILE, --outfile OUTFILE
                        File for predictions
  -d DEVICE, --device DEVICE
                        Compute device to use. Options: 'cpu', 'all' (all GPUs), or GPU index (0, 1, 2, etc.). To use specific GPUs, set CUDA_VISIBLE_DEVICES
                        beforehand and use 'all'. [default: all]
  --store_cmaps         Store contact maps for predicted pairs above `--thresh` in an h5 file
  --thresh THRESH       Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
  --load_proc LOAD_PROC
                        Number of processes to use when loading embeddings (-1 = # of available CPUs, default=16). Because loading is IO-bound, values larger that the
                        # of CPUs are allowed.
  --blocks BLOCKS       Number of equal-sized blocks to split proteins into. In the multi-block case, maximum (embedding) memory usage should be 3 blocks' worth. When
                        multiple GPUs are used, memory usage may briefly be higher when different GPUs are working on tasks from different blocks. And, small blocks
                        may lead to occasional brief hangs with multiple GPUs. Default 1.
  --sparse_loading      Load only the proteins required from each block, but do not reuse loaded blocks in memory. Recommended when predicting with many blocks on
                        sparse pairs, such that many pairs of blocks might contain no pairs of proteins of interest. Only available when blocks > 1 and pairs
                        specified. Maximum (embedding) memory usage with this option is 4 blocks' worth.

Bipartite Prediction#

usage: dscript predict_bipartite [-h] --protA PROTA --protB PROTB [--model MODEL] --embedA EMBEDA [--embedB EMBEDB] [--foldseekA FOLDSEEKA] [--foldseekB FOLDSEEKB] [-o OUTFILE] [-d DEVICE] [--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC] [--blocksA BLOCKSA]
                             [--blocksB BLOCKSB]

Make new predictions between two protein sets using blocked, multi-GPU pariwise inference  with a pre-trained model.

options:
  -h, --help            show this help message and exit
  --protA PROTA         A text file with protein IDs, one on each line. All pairs between proteins in this file and proteins in protB will be predicted
  --protB PROTB         A text file with protein IDs, one on each line. All pairs between proteins in protA and proteins in this file will be predicted
  --model MODEL         Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub [default: samsl/topsy_turvy_human_v1]
  --embedA EMBEDA       h5 file with (a superset of) pre-embedded sequences from the file protA. Generate with dscript embed. If a single file contains embeddings for both protA and protB, specify it as embedA.
  --embedB EMBEDB       h5 file with (a superset of) pre-embedded sequences from the file protB. Generate with dscript embed.
  --foldseekA FOLDSEEKA
                        3di sequences in .fasta format for proteins in protA. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default D-SCRIPT/TT will be run. If a single file contains 3di sequences for both protA and protB,
                        specify it as foldseekA.
  --foldseekB FOLDSEEKB
                        3di sequences in .fasta format for proteins in protA. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default D-SCRIPT/TT will be run.
  -o OUTFILE, --outfile OUTFILE
                        File for predictions
  -d DEVICE, --device DEVICE
                        Compute device to use. Options: 'cpu', 'all' (all GPUs), or GPU index (0, 1, 2, etc.). To use specific GPUs, set CUDA_VISIBLE_DEVICES
                        beforehand and use 'all'. [default: all]
  --store_cmaps         Store contact maps for predicted pairs above `--thresh` in an h5 file
  --thresh THRESH       Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
  --load_proc LOAD_PROC
                        Number of processes to use when loading embeddings (-1 = # of available CPUs, default=16). Because loading is IO-bound, values larger that the # of CPUs are allowed.
  --blocksA BLOCKSA     Number of equal-sized blocks to split proteins in protA into. If one set is smuch smaller, it is recommended to set the corresponding # of blocks to 1. Default 1.
  --blocksB BLOCKSB     Number of equal-sized blocks to split proteins in protB into. Default 1.

Serial Prediction#

usage: dscript predict_serial [-h] --pairs PAIRS [--model MODEL] [--seqs SEQS] [--embeddings EMBEDDINGS] [--foldseek_fasta FOLDSEEK_FASTA] [-o OUTFILE] [-d DEVICE]
                          [--store_cmaps] [--thresh THRESH] [--load_proc LOAD_PROC]

Make new predictions with a pre-trained model using legacy (serial) inference. One of --seqs or --embeddings is required.

options:
  -h, --help            show this help message and exit
  --pairs PAIRS         Candidate protein pairs to predict
  --model MODEL         Pretrained Model. If this is a `.sav` or `.pt` file, it will be loaded. Otherwise, we will try to load `[model]` from HuggingFace hub [default:
                        samsl/topsy_turvy_human_v1]
  --seqs SEQS           Protein sequences in .fasta format
  --embeddings EMBEDDINGS
                        h5 file with embedded sequences
  --foldseek_fasta FOLDSEEK_FASTA
                        3di sequences in .fasta format. Can be generated using `dscript extract-3di. Default is None. If provided, TT3D will be run, otherwise default
                        D-SCRIPT/TT will be run.
  -o OUTFILE, --outfile OUTFILE
                        File for predictions
  -d DEVICE, --device DEVICE
                        Compute device to use. Options: 'cpu' or GPU index (0, 1, 2, etc.).
  --store_cmaps         Store contact maps for predicted pairs above `--thresh` in an h5 file
  --thresh THRESH       Positive prediction threshold - used to store contact maps and predictions in a separate file. [default: 0.5]
  --load_proc LOAD_PROC
                        Number of processes to use when loading embeddings (-1 = # of CPUs, default=32)

Embedding#

usage: dscript embed [-h] --seqs SEQS -o OUTFILE [-d DEVICE]

Generate new embeddings using pre-trained language model

optional arguments:
-h, --help              show this help message and exit
--seqs SEQS             Sequences to be embedded
-o, --outfile OUTFILE   h5 file to write results
-d DEVICE, --device DEVICE
                        Compute device to use. Options: 'cpu' or GPU index (0, 1, 2, etc.).

Training#

usage: dscript train [-h] --train TRAIN --test TEST --embedding EMBEDDING
                 [--no-augment] [--input-dim INPUT_DIM]
                 [--projection-dim PROJECTION_DIM] [--dropout-p DROPOUT_P]
                 [--hidden-dim HIDDEN_DIM] [--kernel-width KERNEL_WIDTH]
                 [--no-w] [--no-sigmoid] [--do-pool]
                 [--pool-width POOL_WIDTH] [--num-epochs NUM_EPOCHS]
                 [--batch-size BATCH_SIZE] [--weight-decay WEIGHT_DECAY]
                 [--lr LR] [--lambda INTERACTION_WEIGHT] [--topsy-turvy]
                 [--glider-weight GLIDER_WEIGHT]
                 [--glider-thresh GLIDER_THRESH] [-o OUTFILE]
                 [--save-prefix SAVE_PREFIX] [-d DEVICE]
                 [--checkpoint CHECKPOINT]

Train a new model.

optional arguments:
  -h, --help            show this help message and exit

Data:
  --train TRAIN         list of training pairs
  --test TEST           list of validation/testing pairs
  --embedding EMBEDDING
                        h5py path containing embedded sequences
  --no-augment          data is automatically augmented by adding (B A) for
                        all pairs (A B). Set this flag to not augment data

Projection Module:
  --input-dim INPUT_DIM
                        dimension of input language model embedding (per amino
                        acid) (default: 6165)
  --projection-dim PROJECTION_DIM
                        dimension of embedding projection layer (default: 100)
  --dropout-p DROPOUT_P
                        parameter p for embedding dropout layer (default: 0.5)

Contact Module:
  --hidden-dim HIDDEN_DIM
                        number of hidden units for comparison layer in contact
                        prediction (default: 50)
  --kernel-width KERNEL_WIDTH
                        width of convolutional filter for contact prediction
                        (default: 7)

Interaction Module:
  --no-w                don't use weight matrix in interaction prediction
                        model
  --no-sigmoid          don't use sigmoid activation at end of interaction
                        model
  --do-pool             use max pool layer in interaction prediction model
  --pool-width POOL_WIDTH
                        size of max-pool in interaction model (default: 9)

Training:
  --num-epochs NUM_EPOCHS
                        number of epochs (default: 10)
  --batch-size BATCH_SIZE
                        minibatch size (default: 25)
  --weight-decay WEIGHT_DECAY
                        L2 regularization (default: 0)
  --lr LR               learning rate (default: 0.001)
  --lambda INTERACTION_WEIGHT
                        weight on the similarity objective (default: 0.35)
  --topsy-turvy         run in Topsy-Turvy mode -- use top-down GLIDER scoring
                        to guide training (reference TBD)
  --glider-weight GLIDER_WEIGHT
                        weight on the GLIDER accuracy objective (default: 0.2)
  --glider-thresh GLIDER_THRESH
                        proportion of GLIDER scores treated as positive edges
                        (0 < gt < 1) (default: 0.925)

Output and Device:
  -o OUTPUT, --output OUTPUT
                        output file path (default: stdout)
  --save-prefix SAVE_PREFIX
                        path prefix for saving models
  -d DEVICE, --device DEVICE
                        compute device to use
  --checkpoint CHECKPOINT
                        checkpoint model to start training from

Evaluation#

usage: dscript eval [-h] --model MODEL --test TEST --embedding EMBEDDING
                    [-o OUTFILE] [-d DEVICE]

Evaluate a trained model

optional arguments:
-h, --help            show this help message and exit
--model MODEL         Trained prediction model
--test TEST           Test Data
--embedding EMBEDDING
                        h5 file with embedded sequences
-o OUTFILE, --outfile OUTFILE
                        Output file to write results
-d DEVICE, --device DEVICE
                        Compute device to use