Skip to content

Fix peer#70

Merged
karinazad merged 20 commits intomainfrom
fix-peer
May 14, 2025
Merged

Fix peer#70
karinazad merged 20 commits intomainfrom
fix-peer

Conversation

@taylormjs
Copy link
Collaborator

  • Fixed several peer tasks that were buggy
  • Added evaluate method to peer callback for streamlined eval

Taylor Joren and others added 13 commits April 11, 2025 04:49
* dataloader callback

* utils

* ume

* gitignore dev

* tests
* lock

* torch 2.5

* torch 2.5

* part

* .env
* scheduler

* fix scheduler

* fix scheduler
Processed Atomica interactions dataset
…nizers (#65)

add two special tokens: <convert> and <interact> for later stages of Ume training:
will be used as this: (or something like that)
[CLS]  PROT_SEQ  [SEP] <convert> PROT_STRUCT(masked)  [SEP]
[CLS]  PROT_SEQ  [SEP] <interact> SMILES(masked)  [SEP] 
extend functionality of UmeTokenizerTransform to handle dual modalities
change the name of Ume embedding method and allow embedding from existing input_ids
fix existing tokenizers:

add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters)
BPE handled SMILES tokenization incorrectly, switch to WordLevel
* tokenizer

* fix tests

* lowercase normalizer for nt

* tests

* remove mod conv dataset

* embed

* Test

* merge 2mod into UmeTokenizerTransform

* fix tests

* all

* type hints

* docstrings

* tests

* fix SMILES tokenizer

* switch all tokenizer to BPE

* Revert "switch all tokenizer to BPE"

This reverts commit 367e77d.

* tok

* fix SMILES tokenizer

* remove print statement
* pplx

* tests

* src

* ignore torchmetrics warnings

* docstrings

* docstrings
* pplx as attr

* pplx as attr

* pplx

* comments

* on step

* comment
@taylormjs taylormjs temporarily deployed to test.pypi.org May 6, 2025 16:21 — with GitHub Actions Inactive
@taylormjs taylormjs temporarily deployed to test.pypi.org May 6, 2025 16:22 — with GitHub Actions Inactive
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main file to review

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated peer constants too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated peer tests

@taylormjs taylormjs requested review from karinazad and ncfrey May 6, 2025 17:08
@taylormjs
Copy link
Collaborator Author

@karinazad @ncfrey Fixed the peer callback and added an evaluate method. Was thinking of pushing this first and then integrating into cmdline. lobster-evaluate-ume in a separate MR

x = {k: v.to(pl_module.device) for k, v in tokenized_inputs.items() if isinstance(v, Tensor)}

# Extract embeddings
embeddings = pl_module.model.tokens_to_latents(**x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added embed method to Ume which takes input_ids directly so it could be used here instead

https://github.com/prescient-design/lobster/blob/main/src/lobster/model/_ume.py#L292

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ignore index is -100 for padding etc

@ncfrey ncfrey requested a review from Copilot May 8, 2025 17:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses multiple issues with peer tasks and evaluation while also adding new features and updates. The key changes include:

  • Fixes to peer task definitions and classification types.
  • Addition of the AtomicaDataset and improvements to the evaluation callback (including a new evaluate method and enhanced metric logging).
  • Updates to dataset handling, S3 utilities, tokenizers, SLURM scripts, and dependency constraints.

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/lobster/datasets/_latent_generator_3d_coordinates_dataset.py Added a new "columns" parameter to allow dynamic key selection based on provided columns.
src/lobster/datasets/_atomica_dataset.py Introduced a new dataset class for Atomica with detailed documentation and custom getitem behavior.
src/lobster/constants/_peer_tasks.py Updated task definitions to reflect binary/multiclass settings and improved clarity for peer tasks.
src/lobster/callbacks/_peer_evaluation_callback.py Refactored evaluation logic with new helper methods, improved metric logging, and added an evaluate method.
src/lobster/callbacks/_dataloader_checkpoint_callback.py Added a callback to checkpoint dataloader states, including S3 upload support.
Assets & tokenizers files Updated special tokens and tokenizer configuration to replace deprecated tokens and adjust pre-tokenizer patterns.
slurm/scripts/train_ume.sh Adjusted SLURM settings for resource allocation and runtime format.
pyproject.toml Updated dependency constraints with new versions and added dotenv, ensuring compatibility with flash-attn releases.
Comments suppressed due to low confidence (1)

pyproject.toml:122

  • [nitpick] Ensure that the boolean portion of the wheel filename ('cxx11abiFALSE') is intended, as inconsistent casing (FALSE vs False) might cause confusion or deployment issues.
{ url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl", marker = "sys_platform == 'linux' and python_version == '3.11'"},

Comment on lines +141 to +143
if len(x) == 1:
x = x[0]

Copy link

Copilot AI May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flattening the tuple when it contains a single element may lead to inconsistent return types compared to when multiple elements are present. Consider always returning a tuple to maintain a consistent API.

Suggested change
if len(x) == 1:
x = x[0]

Copilot uses AI. Check for mistakes.
self.scheduler,
optimizer,
num_training_steps=self._num_training_steps,
num_warmup_steps=self._num_warmup_steps,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might fail outside of specific schedulers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncfrey Just checked - if a scheduler doesn't use num_warmup_steps or num_training_steps, it just ignores it. So this is safe: https://github.com/huggingface/transformers/blob/5f4ecf2d9f867a1255131d2461d75793c0cf1db2/src/transformers/optimization.py#L513

self.scheduler,
optimizer,
num_training_steps=self._num_training_steps,
num_warmup_steps=self._num_warmup_steps,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

karinazad added 3 commits May 14, 2025 01:40
* add <cls_modality> tokens

* add <cls_modality> tokens

* docstring
* add <cls_modality> tokens

* add <cls_modality> tokens

* modality embeddings

* module dict

* embeddings

* tests

* modality and device

* rank zero only

* rank zero

* fix back modality mask

* sync dist

* RNS implementation

* restore from main

* restore

* docstrings

* docstrings

* review

* test
* add <cls_modality> tokens

* add <cls_modality> tokens

* modality embeddings

* module dict

* embeddings

* tests

* modality and device

* rank zero only

* rank zero

* fix back modality mask

* sync dist
cgrambow and others added 2 commits May 14, 2025 01:40
* add initial smiles to peptide and peptide to smiles transforms

* remove smiles -> * transforms and touch up conversion functions

* rename

* add option to randomize smiles and caps

---------

Co-authored-by: Colin Grambow <grambowc@gene.com>
@karinazad karinazad merged commit 04b32e5 into main May 14, 2025
5 checks passed
@karinazad karinazad deleted the fix-peer branch May 14, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants