RNS metric implementation by karinazad · Pull Request #73 · prescient-design/lobster

karinazad · 2025-05-09T16:47:55Z

Based on a paper shared by @ncfrey https://www.biorxiv.org/content/10.1101/2025.04.30.651545v1

…to ume-modality-embeddings

ncfrey · 2025-05-09T16:54:26Z

src/lobster/metrics/_random_neighbor_score.py

+    RNS quantifies the fraction of non-biological (random) neighbors among the
+    k-nearest neighbors of an embedding.
+    A higher RNS value indicates greater uncertainty in the embedding's representation.
+    Lower is better.


can you add a Reference heading with the citation

Quantifying uncertainty in Protein Representations Across Models and Task
R Prabakaran, Y Bromberg
bioRxiv 2025.04.30.651545; doi: https://doi.org/10.1101/2025.04.30.651545

100%! forgot to add this

ncfrey · 2025-05-09T16:55:32Z

src/lobster/metrics/_random_neighbor_score.py

+    random_embeddings : Tensor
+        Reference embeddings from randomly generated non-biological sequences.
+    k : int, optional
+        Number of nearest neighbors to consider, by default 500.


can you add any comment on a reasonable range of values to consider?

ncfrey · 2025-05-09T16:57:06Z

src/lobster/metrics/_random_neighbor_score.py

+
+    Parameters
+    ----------
+    biological_embeddings : Tensor


any shape constraints on these?

ncfrey · 2025-05-09T16:59:36Z

src/lobster/metrics/_random_neighbor_score.py

+    ----------
+    biological_embeddings : Tensor
+        Reference embeddings from biological sequences.
+    random_embeddings : Tensor


TODO: cache some reasonable default set

Ideally, we'd just provide the name of the eval set (e.g. OAS, ZINC, etc.) and an embedding function. Then the random sequences would just be generated automatically with the same vocab. I'll leave that as a TODO

Copilot

Pull Request Overview

This PR implements the Random Neighbor Score (RNS) metric based on a referenced paper, adding core functionality to compute the metric and corresponding tests.

Introduces the RandomNeighborScore class with support for cosine and euclidean distance metrics.
Implements balancing of reference embeddings and the computation logic for the metric.
Adds parameterized tests covering different scenarios for the metric.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
tests/lobster/metrics/test__random_neighbor_score.py	Provides tests for initialization and compute behavior.
src/lobster/metrics/_random_neighbor_score.py	Implements the RandomNeighborScore metric with balancing logic.
src/lobster/metrics/init.py	Exports the new RandomNeighborScore metric.

Comments suppressed due to low confidence (1)

tests/lobster/metrics/test__random_neighbor_score.py:8

Consider adding a test case to verify that the compute method raises a ValueError when no query embeddings have been added, ensuring full test coverage for error conditions.

class TestRandomNeighborScore:

Copilot · 2025-05-09T17:00:38Z

src/lobster/metrics/_random_neighbor_score.py

+            self.biological_embeddings = self.biological_embeddings[indices]
+
+        elif n_rand > n_bio:
+            indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio]
+            self.random_embeddings = self.random_embeddings[indices]


[nitpick] Reassigning registered buffers in _balance_reference_sets may lead to potential issues with the registered state in torchmetrics; consider updating them in place or re-registering buffers after re-sampling for clarity and consistency.

Suggested change

self.biological_embeddings = self.biological_embeddings[indices]

elif n_rand > n_bio:

indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio]

self.random_embeddings = self.random_embeddings[indices]

self.biological_embeddings[:n_rand] = self.biological_embeddings[indices]

self.biological_embeddings = self.biological_embeddings[:n_rand]

elif n_rand > n_bio:

indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio]

self.random_embeddings[:n_bio] = self.random_embeddings[indices]

self.random_embeddings = self.random_embeddings[:n_bio]

* add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test

* peer fixes, add evaluate method * dataloader checkpoint callback (#60) * dataloader callback * utils * ume * gitignore dev * tests * update flash attention wheels (#61) * lock * torch 2.5 * torch 2.5 * part * .env * unpin flash attn (#62) * fix scheduler params (#64) * scheduler * fix scheduler * fix scheduler * Add AtomicaDataset (#63) Processed Atomica interactions dataset * Ume conversion/interaction tokenizer + fix SMILES and nucleotide tokenizers (#65) add two special tokens: <convert> and <interact> for later stages of Ume training: will be used as this: (or something like that) [CLS] PROT_SEQ [SEP] <convert> PROT_STRUCT(masked) [SEP] [CLS] PROT_SEQ [SEP] <interact> SMILES(masked) [SEP] extend functionality of UmeTokenizerTransform to handle dual modalities change the name of Ume embedding method and allow embedding from existing input_ids fix existing tokenizers: add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters) BPE handled SMILES tokenization incorrectly, switch to WordLevel * Ume SMILES tokenizer fix (#66) * tokenizer * fix tests * lowercase normalizer for nt * tests * remove mod conv dataset * embed * Test * merge 2mod into UmeTokenizerTransform * fix tests * all * type hints * docstrings * tests * fix SMILES tokenizer * switch all tokenizer to BPE * Revert "switch all tokenizer to BPE" This reverts commit 367e77d. * tok * fix SMILES tokenizer * remove print statement * Ume perplexity logging (#67) * pplx * tests * src * ignore torchmetrics warnings * docstrings * docstrings * Update README.md (#69) * Ume fix perplexity device (#68) * pplx as attr * pplx as attr * pplx * comments * on step * comment * update tests, fix ruff * ruff * ruff ruff * Add <cls_modality> to Ume tokenizers (#71) * add <cls_modality> tokens * add <cls_modality> tokens * docstring * RNS metric implementation (#73) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test * Ume modality-specific embeddings (#72) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * add conversion transforms (#74) * add initial smiles to peptide and peptide to smiles transforms * remove smiles -> * transforms and touch up conversion functions * rename * add option to randomize smiles and caps --------- Co-authored-by: Colin Grambow <grambowc@gene.com> * fix def pad token, replace process_and_embed w/ ume.embed * update tests w -100 pad token --------- Co-authored-by: Taylor Joren <joren.taylor@gene.com> Co-authored-by: Karina Zadorozhny <karina.zadorozhny@gmail.com> Co-authored-by: Nathan Frey <ncfrey@users.noreply.github.com> Co-authored-by: Colin Grambow <17198155+cgrambow@users.noreply.github.com> Co-authored-by: Colin Grambow <grambowc@gene.com>

karinazad added 13 commits May 7, 2025 11:26

add <cls_modality> tokens

36138aa

add <cls_modality> tokens

8093508

modality embeddings

327ca95

module dict

1e732e9

embeddings

bb04f30

tests

976813e

modality and device

236ab9d

rank zero only

5783ad0

rank zero

a588c5e

fix back modality mask

f10da98

sync dist

f0d5c65

Merge branch 'main' of https://github.com/prescient-design/lobster in…

20c4e2d

…to ume-modality-embeddings

RNS implementation

5acd356

karinazad requested a review from ncfrey May 9, 2025 16:47

restore from main

f17e1fb

karinazad temporarily deployed to test.pypi.org May 9, 2025 16:49 — with GitHub Actions Inactive

restore

0230e84

karinazad temporarily deployed to test.pypi.org May 9, 2025 16:50 — with GitHub Actions Inactive

ncfrey approved these changes May 9, 2025

View reviewed changes

ncfrey requested a review from Copilot May 9, 2025 17:00

Copilot AI reviewed May 9, 2025

View reviewed changes

karinazad added 2 commits May 9, 2025 13:19

docstrings

55b2449

docstrings

4436daa

karinazad temporarily deployed to test.pypi.org May 9, 2025 17:20 — with GitHub Actions Inactive

review

654ecbd

karinazad temporarily deployed to test.pypi.org May 9, 2025 17:22 — with GitHub Actions Inactive

test

606c7a5

karinazad temporarily deployed to test.pypi.org May 9, 2025 17:23 — with GitHub Actions Inactive

karinazad merged commit f9bfcf0 into main May 13, 2025
5 checks passed

karinazad deleted the ume-evaluation branch May 13, 2025 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNS metric implementation #73

RNS metric implementation #73
karinazad merged 19 commits intomainfrom
ume-evaluation

karinazad commented May 9, 2025

Uh oh!

ncfrey May 9, 2025

Uh oh!

karinazad May 9, 2025

Uh oh!

ncfrey May 9, 2025

Uh oh!

ncfrey May 9, 2025

Uh oh!

karinazad May 13, 2025

Uh oh!

ncfrey May 9, 2025

Uh oh!

karinazad May 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

karinazad commented May 9, 2025

Uh oh!

ncfrey May 9, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad May 9, 2025

Choose a reason for hiding this comment

Uh oh!

ncfrey May 9, 2025

Choose a reason for hiding this comment

Uh oh!

ncfrey May 9, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad May 13, 2025

Choose a reason for hiding this comment

Uh oh!

ncfrey May 9, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants