Conversation
taylormjs
commented
May 6, 2025
- Fixed several peer tasks that were buggy
- Added evaluate method to peer callback for streamlined eval
* dataloader callback * utils * ume * gitignore dev * tests
* lock * torch 2.5 * torch 2.5 * part * .env
* scheduler * fix scheduler * fix scheduler
Processed Atomica interactions dataset
…nizers (#65) add two special tokens: <convert> and <interact> for later stages of Ume training: will be used as this: (or something like that) [CLS] PROT_SEQ [SEP] <convert> PROT_STRUCT(masked) [SEP] [CLS] PROT_SEQ [SEP] <interact> SMILES(masked) [SEP] extend functionality of UmeTokenizerTransform to handle dual modalities change the name of Ume embedding method and allow embedding from existing input_ids fix existing tokenizers: add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters) BPE handled SMILES tokenization incorrectly, switch to WordLevel
* tokenizer * fix tests * lowercase normalizer for nt * tests * remove mod conv dataset * embed * Test * merge 2mod into UmeTokenizerTransform * fix tests * all * type hints * docstrings * tests * fix SMILES tokenizer * switch all tokenizer to BPE * Revert "switch all tokenizer to BPE" This reverts commit 367e77d. * tok * fix SMILES tokenizer * remove print statement
* pplx * tests * src * ignore torchmetrics warnings * docstrings * docstrings
* pplx as attr * pplx as attr * pplx * comments * on step * comment
There was a problem hiding this comment.
Main file to review
There was a problem hiding this comment.
Updated peer constants too
There was a problem hiding this comment.
Updated peer tests
|
@karinazad @ncfrey Fixed the peer callback and added an evaluate method. Was thinking of pushing this first and then integrating into cmdline. lobster-evaluate-ume in a separate MR |
| x = {k: v.to(pl_module.device) for k, v in tokenized_inputs.items() if isinstance(v, Tensor)} | ||
|
|
||
| # Extract embeddings | ||
| embeddings = pl_module.model.tokens_to_latents(**x) |
There was a problem hiding this comment.
I added embed method to Ume which takes input_ids directly so it could be used here instead
https://github.com/prescient-design/lobster/blob/main/src/lobster/model/_ume.py#L292
There was a problem hiding this comment.
the ignore index is -100 for padding etc
There was a problem hiding this comment.
Pull Request Overview
This PR addresses multiple issues with peer tasks and evaluation while also adding new features and updates. The key changes include:
- Fixes to peer task definitions and classification types.
- Addition of the AtomicaDataset and improvements to the evaluation callback (including a new evaluate method and enhanced metric logging).
- Updates to dataset handling, S3 utilities, tokenizers, SLURM scripts, and dependency constraints.
Reviewed Changes
Copilot reviewed 41 out of 41 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/lobster/datasets/_latent_generator_3d_coordinates_dataset.py | Added a new "columns" parameter to allow dynamic key selection based on provided columns. |
| src/lobster/datasets/_atomica_dataset.py | Introduced a new dataset class for Atomica with detailed documentation and custom getitem behavior. |
| src/lobster/constants/_peer_tasks.py | Updated task definitions to reflect binary/multiclass settings and improved clarity for peer tasks. |
| src/lobster/callbacks/_peer_evaluation_callback.py | Refactored evaluation logic with new helper methods, improved metric logging, and added an evaluate method. |
| src/lobster/callbacks/_dataloader_checkpoint_callback.py | Added a callback to checkpoint dataloader states, including S3 upload support. |
| Assets & tokenizers files | Updated special tokens and tokenizer configuration to replace deprecated tokens and adjust pre-tokenizer patterns. |
| slurm/scripts/train_ume.sh | Adjusted SLURM settings for resource allocation and runtime format. |
| pyproject.toml | Updated dependency constraints with new versions and added dotenv, ensuring compatibility with flash-attn releases. |
Comments suppressed due to low confidence (1)
pyproject.toml:122
- [nitpick] Ensure that the boolean portion of the wheel filename ('cxx11abiFALSE') is intended, as inconsistent casing (FALSE vs False) might cause confusion or deployment issues.
{ url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl", marker = "sys_platform == 'linux' and python_version == '3.11'"},
| if len(x) == 1: | ||
| x = x[0] | ||
|
|
There was a problem hiding this comment.
Flattening the tuple when it contains a single element may lead to inconsistent return types compared to when multiple elements are present. Consider always returning a tuple to maintain a consistent API.
| if len(x) == 1: | |
| x = x[0] |
| self.scheduler, | ||
| optimizer, | ||
| num_training_steps=self._num_training_steps, | ||
| num_warmup_steps=self._num_warmup_steps, |
There was a problem hiding this comment.
this might fail outside of specific schedulers
There was a problem hiding this comment.
@ncfrey Just checked - if a scheduler doesn't use num_warmup_steps or num_training_steps, it just ignores it. So this is safe: https://github.com/huggingface/transformers/blob/5f4ecf2d9f867a1255131d2461d75793c0cf1db2/src/transformers/optimization.py#L513
| self.scheduler, | ||
| optimizer, | ||
| num_training_steps=self._num_training_steps, | ||
| num_warmup_steps=self._num_warmup_steps, |
* add <cls_modality> tokens * add <cls_modality> tokens * docstring
* add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test
* add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist
* add initial smiles to peptide and peptide to smiles transforms * remove smiles -> * transforms and touch up conversion functions * rename * add option to randomize smiles and caps --------- Co-authored-by: Colin Grambow <grambowc@gene.com>