UME-2 with auxiliary tasks by karinazad · Pull Request #190 · prescient-design/lobster

karinazad · 2025-09-05T13:43:50Z

Description

Add UME-2 class for sequence-only encoders
Add support for auxiliary training tasks

Type of Change

taylormjs · 2025-09-05T17:41:50Z

src/lobster/_ensure_package.py

+import importlib
+
+
+def ensure_package(


Great idea!

taylormjs · 2025-09-05T17:49:39Z

src/lobster/constants/__init__.py

 from ._calm_tasks import CALM_TASK_SPECIES, CALM_TASKS, CALMSpecies, CALMTask, MAX_SEQUENCE_LENGTH
 from ._codon_table import CODON_TABLE_PATH, CODON_TABLE_PATH_VENDOR
-from ._descriptor_descs import RDKIT_DESCRIPTOR_DISTRIBUTIONS
+from ._rdkit_descriptor_distributions import RDKIT_DESCRIPTOR_DISTRIBUTIONS


Are these for normalization or are these directly predicted?

David added these for normalization

taylormjs · 2025-09-05T18:15:36Z

src/lobster/datasets/s3_datasets/base.py

@@ -61,6 +60,7 @@ def __init__(
        seed: int = 0,
        cache_dir: str | None = None,
        transform_fn: Callable | None = None,


Why not have just one transform_fns instead of both transform_fn and extra_transform_fns?

transform_fn would be applied to the sequence before it goes to tokenization (e.g replace | for . in protein complexes) and extra_transform_fns are applied alongside the tokenized result to give something like

{input_ids: ..., attention_mask: ..., extra_output_1: ...., extra_output2:...

maybe there is a better name for the parameter though

taylormjs · 2025-09-05T18:18:55Z

src/lobster/hydra_config/experiment/ume-2/base.yaml

+    num_warmup_steps: 10_000
+  weight_decay: 0.01
+  mask_token_id: 8
+  pad_token_id: 6


Why are the masks and pad ids 8 and 6, resp?

it's a pretty arbitrary order but that's what out tokenizers are using https://github.com/prescient-design/lobster/blob/main/src/lobster/tokenization/_ume_tokenizers.py#L105

taylormjs · 2025-09-05T18:23:29Z

src/lobster/model/neobert/neobert_module.py

+
+        masked_embeddings = output["last_hidden_state"] * mask
+
+        sum_embeddings = masked_embeddings.sum(dim=1)


In a separate MR, we should probably update this with a pooling function that could be passed to the model. Something like making aggregate (bool) into aggregator (fn)

taylormjs · 2025-09-05T18:29:21Z

src/lobster/model/ume2/_ume_sequence_encoder.py

+            raise ValueError(f"Unsupported task type: {self.task_type}")
+
+
+class AuxiliaryRegressionTaskHead(nn.Module):


I wonder if this should live elsewhere, like lobster.finetune (lobster.post_train?). I know we're still pre-training here, but having a dedicated place for pooling, regression heads, etc. that could be used for auxiliary task and post-training might be better organizationally

yeah I think that's a good idea, we can move it once finetune is ready

taylormjs · 2025-09-05T18:34:08Z

tests/lobster/transforms/functional/test__rdkit_descs.py

 )
-def test_smiles_to_rdkit_descs(mock_calc, smiles, expected):
-    mock_calc.return_value = {"desc1": 1.0, "desc2": 2.0}
+def test_smiles_to_rdkit_descs(mock_smiles_to_desc, smiles, expected):


Are there tests for ume2 outputting both masked token & auxiliary task preds?

…to k/ume2-auxiliary-tasks

karinazad added 10 commits September 4, 2025 14:57

auxiliary tasks

6c8a63f

auxiliary tasks

a0a8f9a

auxiliary tasks

31217c0

auxiliary tasks

8efd71b

aux rdkit slurm

491fbd1

rdkit feats as auxiliary tasks

29bf491

ume-2 config files

4aa0fe6

rename slurm script

c2408db

remove top20 rdkit from constants

3d0608c

tests

5b721cb

taylormjs approved these changes Sep 5, 2025

View reviewed changes

karinazad added 5 commits September 8, 2025 17:59

aux

68eb1b4

tets

c892eac

Merge branch 'main' of https://github.com/prescient-design/lobster in…

1fbeb1a

…to k/ume2-auxiliary-tasks

update

6528253

Merge branch 'main' of https://github.com/prescient-design/lobster in…

5f91227

…to k/ume2-auxiliary-tasks

karinazad merged commit 40ed9dd into main Sep 10, 2025
2 of 4 checks passed

karinazad deleted the k/ume2-auxiliary-tasks branch September 10, 2025 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UME-2 with auxiliary tasks#190

UME-2 with auxiliary tasks#190
karinazad merged 15 commits intomainfrom
k/ume2-auxiliary-tasks

karinazad commented Sep 5, 2025 •

edited

Loading

Uh oh!

taylormjs Sep 5, 2025

Uh oh!

taylormjs Sep 5, 2025

Uh oh!

karinazad Sep 10, 2025

Uh oh!

taylormjs Sep 5, 2025

Uh oh!

karinazad Sep 10, 2025

Uh oh!

taylormjs Sep 5, 2025

Uh oh!

karinazad Sep 10, 2025

Uh oh!

taylormjs Sep 5, 2025

Uh oh!

taylormjs Sep 5, 2025

Uh oh!

karinazad Sep 10, 2025

Uh oh!

taylormjs Sep 5, 2025

Uh oh!

karinazad Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		masked_embeddings = output["last_hidden_state"] * mask

		sum_embeddings = masked_embeddings.sum(dim=1)

		raise ValueError(f"Unsupported task type: {self.task_type}")


		class AuxiliaryRegressionTaskHead(nn.Module):

Conversation

karinazad commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karinazad commented Sep 5, 2025 •

edited

Loading