Add <cls_modality> to Ume tokenizers by karinazad · Pull Request #71 · prescient-design/lobster

karinazad · 2025-05-07T15:28:35Z

For single modality tracks:

"<cls_amino_acid> V Y F <eos>"
"<cls_smiles> C C O <eos>"
"<cls_nucleotide> a c g t <eos>"

For conversion/interaction tasks:

<cls_convert> <cls_nucleotide> a g u <sep> <cls_amino_acid> S <sep> <pad> <pad>

Copilot

Pull Request Overview

This PR adds support for modality‐specific CLS tokens and updates reserved token naming in the Ume tokenizer implementations. Key changes include renaming latent generator tokens to coordinates 3D tokens, adding modality‐specific CLS tokens across tokenizers, and updating tests and asset JSON files accordingly.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

File	Description
tests/lobster/tokenization/test__ume_tokenizers.py	Updates to reserved token names and test expectations; renamed test_ume_amino_acid_tokenizer for clarity.
src/lobster/tokenization/_ume_tokenizers.py	Modifications to special token definitions, reserved token generation, and post processor configuration; renaming functions and parameters to support modality-specific tokens.
src/lobster/assets/ume_tokenizers/*	Updates to tokenizer configuration and special tokens map files to reflect new CLS token names and reserved token ordering.

Comments suppressed due to low confidence (1)

src/lobster/tokenization/_ume_tokenizers.py:334

The function was renamed to _make_3d_coordinates_tokenizer_fast while the class remains named UmeLatentGenerator3DCoordTokenizerFast. Consider renaming the class to maintain a consistent naming convention for 3D coordinate tokenizers.

def _make_3d_coordinates_tokenizer_fast(vocab: list[str]) -> PreTrainedTokenizerFast:

src/lobster/tokenization/_ume_tokenizers.py

* add <cls_modality> tokens * add <cls_modality> tokens * docstring

* peer fixes, add evaluate method * dataloader checkpoint callback (#60) * dataloader callback * utils * ume * gitignore dev * tests * update flash attention wheels (#61) * lock * torch 2.5 * torch 2.5 * part * .env * unpin flash attn (#62) * fix scheduler params (#64) * scheduler * fix scheduler * fix scheduler * Add AtomicaDataset (#63) Processed Atomica interactions dataset * Ume conversion/interaction tokenizer + fix SMILES and nucleotide tokenizers (#65) add two special tokens: <convert> and <interact> for later stages of Ume training: will be used as this: (or something like that) [CLS] PROT_SEQ [SEP] <convert> PROT_STRUCT(masked) [SEP] [CLS] PROT_SEQ [SEP] <interact> SMILES(masked) [SEP] extend functionality of UmeTokenizerTransform to handle dual modalities change the name of Ume embedding method and allow embedding from existing input_ids fix existing tokenizers: add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters) BPE handled SMILES tokenization incorrectly, switch to WordLevel * Ume SMILES tokenizer fix (#66) * tokenizer * fix tests * lowercase normalizer for nt * tests * remove mod conv dataset * embed * Test * merge 2mod into UmeTokenizerTransform * fix tests * all * type hints * docstrings * tests * fix SMILES tokenizer * switch all tokenizer to BPE * Revert "switch all tokenizer to BPE" This reverts commit 367e77d. * tok * fix SMILES tokenizer * remove print statement * Ume perplexity logging (#67) * pplx * tests * src * ignore torchmetrics warnings * docstrings * docstrings * Update README.md (#69) * Ume fix perplexity device (#68) * pplx as attr * pplx as attr * pplx * comments * on step * comment * update tests, fix ruff * ruff * ruff ruff * Add <cls_modality> to Ume tokenizers (#71) * add <cls_modality> tokens * add <cls_modality> tokens * docstring * RNS metric implementation (#73) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test * Ume modality-specific embeddings (#72) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * add conversion transforms (#74) * add initial smiles to peptide and peptide to smiles transforms * remove smiles -> * transforms and touch up conversion functions * rename * add option to randomize smiles and caps --------- Co-authored-by: Colin Grambow <grambowc@gene.com> * fix def pad token, replace process_and_embed w/ ume.embed * update tests w -100 pad token --------- Co-authored-by: Taylor Joren <joren.taylor@gene.com> Co-authored-by: Karina Zadorozhny <karina.zadorozhny@gmail.com> Co-authored-by: Nathan Frey <ncfrey@users.noreply.github.com> Co-authored-by: Colin Grambow <17198155+cgrambow@users.noreply.github.com> Co-authored-by: Colin Grambow <grambowc@gene.com>

karinazad added 2 commits May 7, 2025 11:26

add <cls_modality> tokens

36138aa

add <cls_modality> tokens

8093508

karinazad requested a review from ncfrey May 7, 2025 15:28

ncfrey requested a review from Copilot May 7, 2025 16:06

Copilot AI reviewed May 7, 2025

View reviewed changes

src/lobster/tokenization/_ume_tokenizers.py Outdated Show resolved Hide resolved

ncfrey approved these changes May 7, 2025

View reviewed changes

docstring

35fb35f

karinazad temporarily deployed to test.pypi.org May 9, 2025 13:23 — with GitHub Actions Inactive

karinazad merged commit 4c1ea1e into main May 9, 2025
5 checks passed

karinazad deleted the ume-tokenizer-cls-modality branch May 9, 2025 13:28

taylormjs pushed a commit that referenced this pull request May 14, 2025

Add <cls_modality> to Ume tokenizers (#71)

4a631f4

* add <cls_modality> tokens * add <cls_modality> tokens * docstring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add <cls_modality> to Ume tokenizers#71

Add <cls_modality> to Ume tokenizers#71
karinazad merged 3 commits intomainfrom
ume-tokenizer-cls-modality

karinazad commented May 7, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

karinazad commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karinazad commented May 7, 2025 •

edited

Loading