Add Symile multi-modal contrastive loss to Ume by karinazad · Pull Request #109 · prescient-design/lobster

karinazad · 2025-06-16T12:37:11Z

Adds the Symile loss function from the symile package for multi-modal contrastive learning
Adds UmeStreamingDataset based on litdata which supports transforms that return multiple modality views/representations
Adds ModalityAwareTranform that works with UmeStreamingDataset

Notes
Credits to Omar Mahmood for suggesting this loss for contrastive learning with multiple modalities

Reference:
https://github.com/rajesh-lab/symile
https://arxiv.org/pdf/2411.01053

Here's how InfoNCE and Symile loss compare on 2 inputs:

>>> import torch
>>> import torch.nn.functional as F
>>> from symile import Symile
>>> 
>>> # Create random embeddings (batch_size=32, embedding_dim=256)
>>> embeddings_a = torch.rand(32, 256)
>>> embeddings_b = torch.rand(32, 256)
>>> 
>>> # Normalize embeddings
>>> embeddings_a = F.normalize(embeddings_a, p=2.0, dim=1)
>>> embeddings_b = F.normalize(embeddings_b, p=2.0, dim=1)
>>> 
>>> # Temperature parameter
>>> temperature = 0.07
>>> 
>>> # Compute InfoNCE loss
>>> logits = embeddings_a @ embeddings_b.T / temperature
>>> labels = torch.arange(embeddings_a.shape[0])
>>> infonce_loss = F.cross_entropy(logits, labels)
>>> 
>>> # Compute Symile loss
>>> symile_loss = Symile()([embeddings_a, embeddings_b], temperature)
>>> 
>>> print(f"InfoNCE Loss: {infonce_loss.item():.4f}")
InfoNCE Loss: 3.5327
>>> print(f"Symile Loss: {symile_loss.item():.4f}")
Symile Loss: 3.4659

Copilot

Pull Request Overview

This PR adds multi‐modal contrastive learning functionality using Symile loss to Ume along with supporting transforms and a new streaming dataset. Key changes include:

Integration of a new Symile loss function in the Ume model and updating contrastive loss scaling.
Addition of UmeStreamingDataset leveraging litdata for multi-modal tokenization and data loading.
Renaming and updating several transform classes (e.g. AminoAcidToNucleotidePairTransform, AminoAcidToSmilesPairTransform) to reflect modality‐specific processing.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/lobster/transforms/test_modality_aware_transform.py	Added tests for modality-aware transforms.
tests/lobster/transforms/test__equivalence_transforms.py	Updated transform test cases and rename tests to reflect amino acid based transforms.
src/lobster/transforms/functional/_convert_seqs.py	Updated probabilistic conversion with an additional skip_unknown parameter.
src/lobster/transforms/_modality_aware_transform.py	Introduced modality-aware transform wrappers and composition.
src/lobster/transforms/_equivalence_transforms.py	Renamed and updated equivalence transforms to amino acid centric versions with new parameters.
src/lobster/model/_ume.py	Integrated Symile loss, updated contrastive loss scaling, and enhanced batch splitting for multi-view inputs.
src/lobster/model/_symile_loss.py	Added a new Symile loss implementation supporting two negative sampling strategies.
src/lobster/hydra_config/trainer.yaml	Updated trainer configuration with new dependencies and settings.
src/lobster/datasets/_ume_streaming_dataset.py	Added a new streaming dataset class supporting modality-specific tokenization via litdata.
pyproject.toml	Added required dependencies (litdata and polars) for the new functionality.

Comments suppressed due to low confidence (2)

src/lobster/transforms/_equivalence_transforms.py:340

Consider updating the init docstring for AminoAcidToNucleotidePairTransform to document the new 'skip_unknown' parameter.

        skip_unknown: bool = False,

src/lobster/model/_ume.py:566

Confirm that switching the scaling from division to multiplication with self.contrastive_temperature is intentional and consistent with the logit_scale initialization in SymileLoss.

        similarities = embeddings_a @ embeddings_b.T * self.contrastive_temperature

src/lobster/datasets/_ume_streaming_dataset.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

karinazad · 2025-06-17T13:21:41Z

pyproject.toml

    "B007",  # unused-loop-control-variable
    "E741",  # ambiguous-variable-name
    "E902",  # file not found error
+    "UP038",  # Use X | Y in isinstance call instead of (X, Y)


ruff complains about the use of isinstance(item, (str, float) which is the standard

ncfrey · 2025-06-17T14:35:28Z

src/lobster/model/_ume.py

        self.contrastive_temperature = contrastive_temperature

+        # Initialize SymileLoss with the correct logit scale
+        self.symile_loss_fn = SymileLoss(logit_scale=1.0 / contrastive_temperature)


thinking about if we should make the loss configurable. probably fine for now since we're pretty set on losses for this version

I was planning on making a refactor of the Ume model class since it's a bit bloated right now but didn't want to include it in this MR since there is already a lot of code change

src/lobster/transforms/_equivalence_transforms.py

karinazad added 11 commits June 14, 2025 12:33

symile loss

ef6485b

Transforms for triplets

245e89c

symile loss

9ea2fa1

ume training

99f1e5a

tests

c2cf5cc

debug steps for symile loss

b710fff

num out modalities

4317168

register logit scale buffer

b38c783

modalities outputs

a2f91d2

ComposedModalityAwareTransform

1801dc3

upper case modalities

cbef457

karinazad temporarily deployed to test.pypi.org June 16, 2025 12:46 — with GitHub Actions Inactive

upper case modalities

5325216

karinazad temporarily deployed to test.pypi.org June 16, 2025 13:13 — with GitHub Actions Inactive

lower case modalities

24ccae7

karinazad temporarily deployed to test.pypi.org June 16, 2025 13:20 — with GitHub Actions Inactive

karinazad added 2 commits June 16, 2025 10:18

modality tests

ba45f24

remove ume training example

eb90a45

karinazad temporarily deployed to test.pypi.org June 16, 2025 14:18 — with GitHub Actions Inactive

remove tab

8812f0c

karinazad temporarily deployed to test.pypi.org June 16, 2025 14:47 — with GitHub Actions Inactive

logit scale

b63fe20

karinazad temporarily deployed to test.pypi.org June 16, 2025 15:01 — with GitHub Actions Inactive

logit scale as param

957387b

karinazad temporarily deployed to test.pypi.org June 16, 2025 15:30 — with GitHub Actions Inactive

modalities

b1eb59c

karinazad temporarily deployed to test.pypi.org June 17, 2025 02:27 — with GitHub Actions Inactive

karinazad requested a review from ncfrey June 17, 2025 02:45

remove symile

89742f7

karinazad temporarily deployed to test.pypi.org June 17, 2025 02:46 — with GitHub Actions Inactive

ncfrey requested a review from Copilot June 17, 2025 12:56

Copilot AI reviewed Jun 17, 2025

View reviewed changes

src/lobster/datasets/_ume_streaming_dataset.py Outdated Show resolved Hide resolved

Update src/lobster/datasets/_ume_streaming_dataset.py

eb54147

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

karinazad temporarily deployed to test.pypi.org June 17, 2025 13:12 — with GitHub Actions Inactive

ignore X|Y suggestions for isinstance check

9f7ffa4

karinazad temporarily deployed to test.pypi.org June 17, 2025 13:20 — with GitHub Actions Inactive

karinazad commented Jun 17, 2025

View reviewed changes

remove trainer config

8da1f47

karinazad temporarily deployed to test.pypi.org June 17, 2025 13:22 — with GitHub Actions Inactive

ncfrey approved these changes Jun 17, 2025

View reviewed changes

docs codon table

7472cb3

karinazad temporarily deployed to test.pypi.org June 17, 2025 14:47 — with GitHub Actions Inactive

karinazad merged commit 7a844d7 into main Jun 17, 2025
5 checks passed

karinazad deleted the k/symile-loss-v2 branch June 17, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Symile multi-modal contrastive loss to Ume#109

Add Symile multi-modal contrastive loss to Ume#109
karinazad merged 24 commits intomainfrom
k/symile-loss-v2

karinazad commented Jun 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

karinazad Jun 17, 2025 •

edited

Loading

Uh oh!

ncfrey Jun 17, 2025

Uh oh!

karinazad Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

karinazad commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

karinazad Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncfrey Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karinazad commented Jun 16, 2025 •

edited

Loading

karinazad Jun 17, 2025 •

edited

Loading