Skip to content

RNS metric implementation #73

Merged
karinazad merged 19 commits intomainfrom
ume-evaluation
May 13, 2025
Merged

RNS metric implementation #73
karinazad merged 19 commits intomainfrom
ume-evaluation

Conversation

@karinazad
Copy link
Collaborator

@karinazad karinazad requested a review from ncfrey May 9, 2025 16:47
@karinazad karinazad temporarily deployed to test.pypi.org May 9, 2025 16:49 — with GitHub Actions Inactive
@karinazad karinazad temporarily deployed to test.pypi.org May 9, 2025 16:50 — with GitHub Actions Inactive
RNS quantifies the fraction of non-biological (random) neighbors among the
k-nearest neighbors of an embedding.
A higher RNS value indicates greater uncertainty in the embedding's representation.
Lower is better.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Reference heading with the citation

Quantifying uncertainty in Protein Representations Across Models and Task
R Prabakaran, Y Bromberg
bioRxiv 2025.04.30.651545; doi: https://doi.org/10.1101/2025.04.30.651545

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%! forgot to add this

random_embeddings : Tensor
Reference embeddings from randomly generated non-biological sequences.
k : int, optional
Number of nearest neighbors to consider, by default 500.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add any comment on a reasonable range of values to consider?


Parameters
----------
biological_embeddings : Tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any shape constraints on these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

----------
biological_embeddings : Tensor
Reference embeddings from biological sequences.
random_embeddings : Tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: cache some reasonable default set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd just provide the name of the eval set (e.g. OAS, ZINC, etc.) and an embedding function. Then the random sequences would just be generated automatically with the same vocab. I'll leave that as a TODO

@ncfrey ncfrey requested a review from Copilot May 9, 2025 17:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the Random Neighbor Score (RNS) metric based on a referenced paper, adding core functionality to compute the metric and corresponding tests.

  • Introduces the RandomNeighborScore class with support for cosine and euclidean distance metrics.
  • Implements balancing of reference embeddings and the computation logic for the metric.
  • Adds parameterized tests covering different scenarios for the metric.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/lobster/metrics/test__random_neighbor_score.py Provides tests for initialization and compute behavior.
src/lobster/metrics/_random_neighbor_score.py Implements the RandomNeighborScore metric with balancing logic.
src/lobster/metrics/init.py Exports the new RandomNeighborScore metric.
Comments suppressed due to low confidence (1)

tests/lobster/metrics/test__random_neighbor_score.py:8

  • Consider adding a test case to verify that the compute method raises a ValueError when no query embeddings have been added, ensuring full test coverage for error conditions.
class TestRandomNeighborScore:

Comment on lines +70 to +74
self.biological_embeddings = self.biological_embeddings[indices]

elif n_rand > n_bio:
indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio]
self.random_embeddings = self.random_embeddings[indices]
Copy link

Copilot AI May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Reassigning registered buffers in _balance_reference_sets may lead to potential issues with the registered state in torchmetrics; consider updating them in place or re-registering buffers after re-sampling for clarity and consistency.

Suggested change
self.biological_embeddings = self.biological_embeddings[indices]
elif n_rand > n_bio:
indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio]
self.random_embeddings = self.random_embeddings[indices]
self.biological_embeddings[:n_rand] = self.biological_embeddings[indices]
self.biological_embeddings = self.biological_embeddings[:n_rand]
elif n_rand > n_bio:
indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio]
self.random_embeddings[:n_bio] = self.random_embeddings[indices]
self.random_embeddings = self.random_embeddings[:n_bio]

Copilot uses AI. Check for mistakes.
@karinazad karinazad temporarily deployed to test.pypi.org May 9, 2025 17:20 — with GitHub Actions Inactive
@karinazad karinazad temporarily deployed to test.pypi.org May 9, 2025 17:22 — with GitHub Actions Inactive
@karinazad karinazad temporarily deployed to test.pypi.org May 9, 2025 17:23 — with GitHub Actions Inactive
@karinazad karinazad merged commit f9bfcf0 into main May 13, 2025
5 checks passed
@karinazad karinazad deleted the ume-evaluation branch May 13, 2025 01:28
taylormjs pushed a commit that referenced this pull request May 14, 2025
* add <cls_modality> tokens

* add <cls_modality> tokens

* modality embeddings

* module dict

* embeddings

* tests

* modality and device

* rank zero only

* rank zero

* fix back modality mask

* sync dist

* RNS implementation

* restore from main

* restore

* docstrings

* docstrings

* review

* test
karinazad added a commit that referenced this pull request May 14, 2025
* peer fixes, add evaluate method

* dataloader checkpoint callback (#60)

* dataloader callback

* utils

* ume

* gitignore dev

* tests

* update flash attention wheels (#61)

* lock

* torch 2.5

* torch 2.5

* part

* .env

* unpin flash attn (#62)

* fix scheduler params (#64)

* scheduler

* fix scheduler

* fix scheduler

* Add AtomicaDataset (#63)

Processed Atomica interactions dataset

* Ume conversion/interaction tokenizer + fix SMILES and nucleotide tokenizers (#65)

add two special tokens: <convert> and <interact> for later stages of Ume training:
will be used as this: (or something like that)
[CLS]  PROT_SEQ  [SEP] <convert> PROT_STRUCT(masked)  [SEP]
[CLS]  PROT_SEQ  [SEP] <interact> SMILES(masked)  [SEP] 
extend functionality of UmeTokenizerTransform to handle dual modalities
change the name of Ume embedding method and allow embedding from existing input_ids
fix existing tokenizers:

add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters)
BPE handled SMILES tokenization incorrectly, switch to WordLevel

* Ume SMILES tokenizer fix (#66)

* tokenizer

* fix tests

* lowercase normalizer for nt

* tests

* remove mod conv dataset

* embed

* Test

* merge 2mod into UmeTokenizerTransform

* fix tests

* all

* type hints

* docstrings

* tests

* fix SMILES tokenizer

* switch all tokenizer to BPE

* Revert "switch all tokenizer to BPE"

This reverts commit 367e77d.

* tok

* fix SMILES tokenizer

* remove print statement

* Ume perplexity logging (#67)

* pplx

* tests

* src

* ignore torchmetrics warnings

* docstrings

* docstrings

* Update README.md (#69)

* Ume fix perplexity device (#68)

* pplx as attr

* pplx as attr

* pplx

* comments

* on step

* comment

* update tests, fix ruff

* ruff

* ruff ruff

* Add <cls_modality> to Ume tokenizers (#71)

* add <cls_modality> tokens

* add <cls_modality> tokens

* docstring

* RNS metric implementation  (#73)

* add <cls_modality> tokens

* add <cls_modality> tokens

* modality embeddings

* module dict

* embeddings

* tests

* modality and device

* rank zero only

* rank zero

* fix back modality mask

* sync dist

* RNS implementation

* restore from main

* restore

* docstrings

* docstrings

* review

* test

* Ume modality-specific embeddings (#72)

* add <cls_modality> tokens

* add <cls_modality> tokens

* modality embeddings

* module dict

* embeddings

* tests

* modality and device

* rank zero only

* rank zero

* fix back modality mask

* sync dist

* add conversion transforms (#74)

* add initial smiles to peptide and peptide to smiles transforms

* remove smiles -> * transforms and touch up conversion functions

* rename

* add option to randomize smiles and caps

---------

Co-authored-by: Colin Grambow <grambowc@gene.com>

* fix def pad token, replace process_and_embed w/ ume.embed

* update tests w -100 pad token

---------

Co-authored-by: Taylor Joren <joren.taylor@gene.com>
Co-authored-by: Karina Zadorozhny <karina.zadorozhny@gmail.com>
Co-authored-by: Nathan Frey <ncfrey@users.noreply.github.com>
Co-authored-by: Colin Grambow <17198155+cgrambow@users.noreply.github.com>
Co-authored-by: Colin Grambow <grambowc@gene.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants