Fix peer by taylormjs · Pull Request #70 · prescient-design/lobster

taylormjs · 2025-05-06T15:57:41Z

Fixed several peer tasks that were buggy
Added evaluate method to peer callback for streamlined eval

* dataloader callback * utils * ume * gitignore dev * tests

* lock * torch 2.5 * torch 2.5 * part * .env

* scheduler * fix scheduler * fix scheduler

Processed Atomica interactions dataset

…nizers (#65) add two special tokens: <convert> and <interact> for later stages of Ume training: will be used as this: (or something like that) [CLS] PROT_SEQ [SEP] <convert> PROT_STRUCT(masked) [SEP] [CLS] PROT_SEQ [SEP] <interact> SMILES(masked) [SEP] extend functionality of UmeTokenizerTransform to handle dual modalities change the name of Ume embedding method and allow embedding from existing input_ids fix existing tokenizers: add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters) BPE handled SMILES tokenization incorrectly, switch to WordLevel

* tokenizer * fix tests * lowercase normalizer for nt * tests * remove mod conv dataset * embed * Test * merge 2mod into UmeTokenizerTransform * fix tests * all * type hints * docstrings * tests * fix SMILES tokenizer * switch all tokenizer to BPE * Revert "switch all tokenizer to BPE" This reverts commit 367e77d. * tok * fix SMILES tokenizer * remove print statement

* pplx * tests * src * ignore torchmetrics warnings * docstrings * docstrings

* pplx as attr * pplx as attr * pplx * comments * on step * comment

taylormjs · 2025-05-06T16:22:50Z

src/lobster/callbacks/_peer_evaluation_callback.py

Main file to review

taylormjs · 2025-05-06T16:23:42Z

src/lobster/constants/_peer_tasks.py

Updated peer constants too

taylormjs · 2025-05-06T16:24:05Z

tests/lobster/callbacks/test__peer_evaluation_callback.py

Updated peer tests

taylormjs · 2025-05-06T17:10:22Z

@karinazad @ncfrey Fixed the peer callback and added an evaluate method. Was thinking of pushing this first and then integrating into cmdline. lobster-evaluate-ume in a separate MR

karinazad · 2025-05-06T18:50:47Z

src/lobster/callbacks/_peer_evaluation_callback.py

+        x = {k: v.to(pl_module.device) for k, v in tokenized_inputs.items() if isinstance(v, Tensor)}
+
+        # Extract embeddings
+        embeddings = pl_module.model.tokens_to_latents(**x)


I added embed method to Ume which takes input_ids directly so it could be used here instead

https://github.com/prescient-design/lobster/blob/main/src/lobster/model/_ume.py#L292

karinazad · 2025-05-06T18:55:00Z

src/lobster/callbacks/_peer_evaluation_callback.py

the ignore index is -100 for padding etc

Copilot

Pull Request Overview

This PR addresses multiple issues with peer tasks and evaluation while also adding new features and updates. The key changes include:

Fixes to peer task definitions and classification types.
Addition of the AtomicaDataset and improvements to the evaluation callback (including a new evaluate method and enhanced metric logging).
Updates to dataset handling, S3 utilities, tokenizers, SLURM scripts, and dependency constraints.

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/lobster/datasets/_latent_generator_3d_coordinates_dataset.py	Added a new "columns" parameter to allow dynamic key selection based on provided columns.
src/lobster/datasets/_atomica_dataset.py	Introduced a new dataset class for Atomica with detailed documentation and custom getitem behavior.
src/lobster/constants/_peer_tasks.py	Updated task definitions to reflect binary/multiclass settings and improved clarity for peer tasks.
src/lobster/callbacks/_peer_evaluation_callback.py	Refactored evaluation logic with new helper methods, improved metric logging, and added an evaluate method.
src/lobster/callbacks/_dataloader_checkpoint_callback.py	Added a callback to checkpoint dataloader states, including S3 upload support.
Assets & tokenizers files	Updated special tokens and tokenizer configuration to replace deprecated tokens and adjust pre-tokenizer patterns.
slurm/scripts/train_ume.sh	Adjusted SLURM settings for resource allocation and runtime format.
pyproject.toml	Updated dependency constraints with new versions and added dotenv, ensuring compatibility with flash-attn releases.

Comments suppressed due to low confidence (1)

pyproject.toml:122

[nitpick] Ensure that the boolean portion of the wheel filename ('cxx11abiFALSE') is intended, as inconsistent casing (FALSE vs False) might cause confusion or deployment issues.

{ url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl", marker = "sys_platform == 'linux' and python_version == '3.11'"},

Copilot · 2025-05-08T17:49:23Z

src/lobster/datasets/_atomica_dataset.py

+        if len(x) == 1:
+            x = x[0]
+


Flattening the tuple when it contains a single element may lead to inconsistent return types compared to when multiple elements are present. Consider always returning a tuple to maintain a consistent API.

Suggested change

if len(x) == 1:

x = x[0]

ncfrey · 2025-05-12T18:15:08Z

src/lobster/model/_clm.py

+            self.scheduler,
+            optimizer,
+            num_training_steps=self._num_training_steps,
+            num_warmup_steps=self._num_warmup_steps,


this might fail outside of specific schedulers

@ncfrey Just checked - if a scheduler doesn't use num_warmup_steps or num_training_steps, it just ignores it. So this is safe: https://github.com/huggingface/transformers/blob/5f4ecf2d9f867a1255131d2461d75793c0cf1db2/src/transformers/optimization.py#L513

ncfrey · 2025-05-12T18:15:18Z

src/lobster/model/_mlm.py

+            self.scheduler,
+            optimizer,
+            num_training_steps=self._num_training_steps,
+            num_warmup_steps=self._num_warmup_steps,


same as above

* add <cls_modality> tokens * add <cls_modality> tokens * docstring

* add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test

* add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist

* add initial smiles to peptide and peptide to smiles transforms * remove smiles -> * transforms and touch up conversion functions * rename * add option to randomize smiles and caps --------- Co-authored-by: Colin Grambow <grambowc@gene.com>

Taylor Joren and others added 13 commits April 11, 2025 04:49

peer fixes, add evaluate method

71a2b17

dataloader checkpoint callback (#60)

d21266e

* dataloader callback * utils * ume * gitignore dev * tests

update flash attention wheels (#61)

ac0097b

* lock * torch 2.5 * torch 2.5 * part * .env

unpin flash attn (#62)

b71dda1

fix scheduler params (#64)

b21c385

* scheduler * fix scheduler * fix scheduler

Add AtomicaDataset (#63)

87038a9

Processed Atomica interactions dataset

Ume perplexity logging (#67)

70c3401

* pplx * tests * src * ignore torchmetrics warnings * docstrings * docstrings

Update README.md (#69)

ec7287f

Ume fix perplexity device (#68)

a3ddc77

* pplx as attr * pplx as attr * pplx * comments * on step * comment

update tests, fix ruff

f258577

ruff

3b79391

taylormjs temporarily deployed to test.pypi.org May 6, 2025 16:21 — with GitHub Actions Inactive

ruff ruff

8041df7

taylormjs temporarily deployed to test.pypi.org May 6, 2025 16:22 — with GitHub Actions Inactive

taylormjs commented May 6, 2025

View reviewed changes

src/lobster/callbacks/_peer_evaluation_callback.py

Copy link

Collaborator Author

taylormjs May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main file to review

taylormjs commented May 6, 2025

View reviewed changes

src/lobster/constants/_peer_tasks.py

Copy link

Collaborator Author

taylormjs May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated peer constants too

taylormjs commented May 6, 2025

View reviewed changes

tests/lobster/callbacks/test__peer_evaluation_callback.py

Copy link

Collaborator Author

taylormjs May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated peer tests

taylormjs requested review from karinazad and ncfrey May 6, 2025 17:08

karinazad reviewed May 6, 2025

View reviewed changes

ncfrey requested a review from Copilot May 8, 2025 17:48

Copilot AI reviewed May 8, 2025

View reviewed changes

ncfrey reviewed May 12, 2025

View reviewed changes

karinazad added 3 commits May 14, 2025 01:40

Add <cls_modality> to Ume tokenizers (#71)

4a631f4

* add <cls_modality> tokens * add <cls_modality> tokens * docstring

Ume modality-specific embeddings (#72)

df54b94

* add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist

cgrambow and others added 2 commits May 14, 2025 01:40

add conversion transforms (#74)

5739c6c

* add initial smiles to peptide and peptide to smiles transforms * remove smiles -> * transforms and touch up conversion functions * rename * add option to randomize smiles and caps --------- Co-authored-by: Colin Grambow <grambowc@gene.com>

fix def pad token, replace process_and_embed w/ ume.embed

c16ecd1

taylormjs temporarily deployed to test.pypi.org May 14, 2025 01:41 — with GitHub Actions Inactive

update tests w -100 pad token

8d3ed89

taylormjs temporarily deployed to test.pypi.org May 14, 2025 01:59 — with GitHub Actions Inactive

karinazad approved these changes May 14, 2025

View reviewed changes

karinazad merged commit 04b32e5 into main May 14, 2025
5 checks passed

karinazad deleted the fix-peer branch May 14, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix peer#70

Fix peer#70
karinazad merged 20 commits intomainfrom
fix-peer

taylormjs commented May 6, 2025

Uh oh!

taylormjs May 6, 2025

Uh oh!

taylormjs May 6, 2025

Uh oh!

taylormjs May 6, 2025

Uh oh!

taylormjs commented May 6, 2025

Uh oh!

karinazad May 6, 2025

Uh oh!

karinazad May 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 8, 2025

Uh oh!

ncfrey May 12, 2025

Uh oh!

taylormjs May 14, 2025

Uh oh!

ncfrey May 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

taylormjs commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taylormjs commented May 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants