Conversation
…to ume-modality-embeddings
| RNS quantifies the fraction of non-biological (random) neighbors among the | ||
| k-nearest neighbors of an embedding. | ||
| A higher RNS value indicates greater uncertainty in the embedding's representation. | ||
| Lower is better. |
There was a problem hiding this comment.
can you add a Reference heading with the citation
Quantifying uncertainty in Protein Representations Across Models and Task
R Prabakaran, Y Bromberg
bioRxiv 2025.04.30.651545; doi: https://doi.org/10.1101/2025.04.30.651545
There was a problem hiding this comment.
100%! forgot to add this
| random_embeddings : Tensor | ||
| Reference embeddings from randomly generated non-biological sequences. | ||
| k : int, optional | ||
| Number of nearest neighbors to consider, by default 500. |
There was a problem hiding this comment.
can you add any comment on a reasonable range of values to consider?
|
|
||
| Parameters | ||
| ---------- | ||
| biological_embeddings : Tensor |
There was a problem hiding this comment.
any shape constraints on these?
| ---------- | ||
| biological_embeddings : Tensor | ||
| Reference embeddings from biological sequences. | ||
| random_embeddings : Tensor |
There was a problem hiding this comment.
TODO: cache some reasonable default set
There was a problem hiding this comment.
Ideally, we'd just provide the name of the eval set (e.g. OAS, ZINC, etc.) and an embedding function. Then the random sequences would just be generated automatically with the same vocab. I'll leave that as a TODO
There was a problem hiding this comment.
Pull Request Overview
This PR implements the Random Neighbor Score (RNS) metric based on a referenced paper, adding core functionality to compute the metric and corresponding tests.
- Introduces the RandomNeighborScore class with support for cosine and euclidean distance metrics.
- Implements balancing of reference embeddings and the computation logic for the metric.
- Adds parameterized tests covering different scenarios for the metric.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/lobster/metrics/test__random_neighbor_score.py | Provides tests for initialization and compute behavior. |
| src/lobster/metrics/_random_neighbor_score.py | Implements the RandomNeighborScore metric with balancing logic. |
| src/lobster/metrics/init.py | Exports the new RandomNeighborScore metric. |
Comments suppressed due to low confidence (1)
tests/lobster/metrics/test__random_neighbor_score.py:8
- Consider adding a test case to verify that the compute method raises a ValueError when no query embeddings have been added, ensuring full test coverage for error conditions.
class TestRandomNeighborScore:
| self.biological_embeddings = self.biological_embeddings[indices] | ||
|
|
||
| elif n_rand > n_bio: | ||
| indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio] | ||
| self.random_embeddings = self.random_embeddings[indices] |
There was a problem hiding this comment.
[nitpick] Reassigning registered buffers in _balance_reference_sets may lead to potential issues with the registered state in torchmetrics; consider updating them in place or re-registering buffers after re-sampling for clarity and consistency.
| self.biological_embeddings = self.biological_embeddings[indices] | |
| elif n_rand > n_bio: | |
| indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio] | |
| self.random_embeddings = self.random_embeddings[indices] | |
| self.biological_embeddings[:n_rand] = self.biological_embeddings[indices] | |
| self.biological_embeddings = self.biological_embeddings[:n_rand] | |
| elif n_rand > n_bio: | |
| indices = torch.randperm(n_rand, device=self.random_embeddings.device)[:n_bio] | |
| self.random_embeddings[:n_bio] = self.random_embeddings[indices] | |
| self.random_embeddings = self.random_embeddings[:n_bio] |
* add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test
* peer fixes, add evaluate method * dataloader checkpoint callback (#60) * dataloader callback * utils * ume * gitignore dev * tests * update flash attention wheels (#61) * lock * torch 2.5 * torch 2.5 * part * .env * unpin flash attn (#62) * fix scheduler params (#64) * scheduler * fix scheduler * fix scheduler * Add AtomicaDataset (#63) Processed Atomica interactions dataset * Ume conversion/interaction tokenizer + fix SMILES and nucleotide tokenizers (#65) add two special tokens: <convert> and <interact> for later stages of Ume training: will be used as this: (or something like that) [CLS] PROT_SEQ [SEP] <convert> PROT_STRUCT(masked) [SEP] [CLS] PROT_SEQ [SEP] <interact> SMILES(masked) [SEP] extend functionality of UmeTokenizerTransform to handle dual modalities change the name of Ume embedding method and allow embedding from existing input_ids fix existing tokenizers: add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters) BPE handled SMILES tokenization incorrectly, switch to WordLevel * Ume SMILES tokenizer fix (#66) * tokenizer * fix tests * lowercase normalizer for nt * tests * remove mod conv dataset * embed * Test * merge 2mod into UmeTokenizerTransform * fix tests * all * type hints * docstrings * tests * fix SMILES tokenizer * switch all tokenizer to BPE * Revert "switch all tokenizer to BPE" This reverts commit 367e77d. * tok * fix SMILES tokenizer * remove print statement * Ume perplexity logging (#67) * pplx * tests * src * ignore torchmetrics warnings * docstrings * docstrings * Update README.md (#69) * Ume fix perplexity device (#68) * pplx as attr * pplx as attr * pplx * comments * on step * comment * update tests, fix ruff * ruff * ruff ruff * Add <cls_modality> to Ume tokenizers (#71) * add <cls_modality> tokens * add <cls_modality> tokens * docstring * RNS metric implementation (#73) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test * Ume modality-specific embeddings (#72) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * add conversion transforms (#74) * add initial smiles to peptide and peptide to smiles transforms * remove smiles -> * transforms and touch up conversion functions * rename * add option to randomize smiles and caps --------- Co-authored-by: Colin Grambow <grambowc@gene.com> * fix def pad token, replace process_and_embed w/ ume.embed * update tests w -100 pad token --------- Co-authored-by: Taylor Joren <joren.taylor@gene.com> Co-authored-by: Karina Zadorozhny <karina.zadorozhny@gmail.com> Co-authored-by: Nathan Frey <ncfrey@users.noreply.github.com> Co-authored-by: Colin Grambow <17198155+cgrambow@users.noreply.github.com> Co-authored-by: Colin Grambow <grambowc@gene.com>
Based on a paper shared by @ncfrey https://www.biorxiv.org/content/10.1101/2025.04.30.651545v1