Conversation
| import importlib | ||
|
|
||
|
|
||
| def ensure_package( |
| from ._calm_tasks import CALM_TASK_SPECIES, CALM_TASKS, CALMSpecies, CALMTask, MAX_SEQUENCE_LENGTH | ||
| from ._codon_table import CODON_TABLE_PATH, CODON_TABLE_PATH_VENDOR | ||
| from ._descriptor_descs import RDKIT_DESCRIPTOR_DISTRIBUTIONS | ||
| from ._rdkit_descriptor_distributions import RDKIT_DESCRIPTOR_DISTRIBUTIONS |
There was a problem hiding this comment.
Are these for normalization or are these directly predicted?
There was a problem hiding this comment.
David added these for normalization
| @@ -61,6 +60,7 @@ def __init__( | |||
| seed: int = 0, | |||
| cache_dir: str | None = None, | |||
| transform_fn: Callable | None = None, | |||
There was a problem hiding this comment.
Why not have just one transform_fns instead of both transform_fn and extra_transform_fns?
There was a problem hiding this comment.
transform_fn would be applied to the sequence before it goes to tokenization (e.g replace | for . in protein complexes) and extra_transform_fns are applied alongside the tokenized result to give something like
{input_ids: ..., attention_mask: ..., extra_output_1: ...., extra_output2:...
maybe there is a better name for the parameter though
| num_warmup_steps: 10_000 | ||
| weight_decay: 0.01 | ||
| mask_token_id: 8 | ||
| pad_token_id: 6 |
There was a problem hiding this comment.
Why are the masks and pad ids 8 and 6, resp?
There was a problem hiding this comment.
it's a pretty arbitrary order but that's what out tokenizers are using https://github.com/prescient-design/lobster/blob/main/src/lobster/tokenization/_ume_tokenizers.py#L105
|
|
||
| masked_embeddings = output["last_hidden_state"] * mask | ||
|
|
||
| sum_embeddings = masked_embeddings.sum(dim=1) |
There was a problem hiding this comment.
In a separate MR, we should probably update this with a pooling function that could be passed to the model. Something like making aggregate (bool) into aggregator (fn)
| raise ValueError(f"Unsupported task type: {self.task_type}") | ||
|
|
||
|
|
||
| class AuxiliaryRegressionTaskHead(nn.Module): |
There was a problem hiding this comment.
I wonder if this should live elsewhere, like lobster.finetune (lobster.post_train?). I know we're still pre-training here, but having a dedicated place for pooling, regression heads, etc. that could be used for auxiliary task and post-training might be better organizationally
There was a problem hiding this comment.
yeah I think that's a good idea, we can move it once finetune is ready
| ) | ||
| def test_smiles_to_rdkit_descs(mock_calc, smiles, expected): | ||
| mock_calc.return_value = {"desc1": 1.0, "desc2": 2.0} | ||
| def test_smiles_to_rdkit_descs(mock_smiles_to_desc, smiles, expected): |
There was a problem hiding this comment.
Are there tests for ume2 outputting both masked token & auxiliary task preds?
…to k/ume2-auxiliary-tasks
…to k/ume2-auxiliary-tasks
Description
UME-2class for sequence-only encodersType of Change