Skip to content

Ume remove paired tokenization#92

Merged
karinazad merged 4 commits intomainfrom
ume-tokenizer-simplify
May 27, 2025
Merged

Ume remove paired tokenization#92
karinazad merged 4 commits intomainfrom
ume-tokenizer-simplify

Conversation

@karinazad
Copy link
Collaborator

We switched from masked pairs to contrastive training and this tokenization scheme is no longer needed

@karinazad karinazad requested a review from ncfrey May 27, 2025 14:35
@ncfrey ncfrey requested a review from Copilot May 27, 2025 14:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the dual modality tokenization functionality as the project has transitioned to contrastive training.

  • Removed dual modality test cases and related parameter handling from tests.
  • Updated UmeTokenizerTransform in the source to support only single modality inputs and adjusted corresponding documentation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
tests/lobster/tokenization/test__ume_tokenizers.py Dual modality tests removed to align with the new tokenization scheme
src/lobster/tokenization/_ume_tokenizers.py Dual modality branch and parameters removed; docstrings and type hints updated for single modality
Comments suppressed due to low confidence (1)

src/lobster/tokenization/_ume_tokenizers.py:511

  • Since dual modality support has been removed, consider removing unused functions such as _encode_no_padding_no_special_tokens and _combine_and_pad to further simplify and clean up the code.
def _encode_no_padding_no_special_tokens(self, item: str | list[str], tokenizer, modality: Modality) -> dict[str, list[int]]:

Examples: "amino_acid", "smiles", "nucleotide",
or ("amino_acid", "smiles"),...
modality : ModalityType or str
Single modality. Examples: "amino_acid", "smiles", "nucleotide", "coordinates_3d"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal with options

@karinazad karinazad merged commit e78e6fc into main May 27, 2025
5 checks passed
@karinazad karinazad deleted the ume-tokenizer-simplify branch May 27, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants