Skip to content

extend smiles tokenizer#49

Merged
karinazad merged 15 commits intomainfrom
extend-smiles-tokenizer
Mar 24, 2025
Merged

extend smiles tokenizer#49
karinazad merged 15 commits intomainfrom
extend-smiles-tokenizer

Conversation

@cgrambow
Copy link
Collaborator

  • tokenized chembl, m3-20m, geom, and zinc20 smiles
  • updated smiles tokenizer vocab in order of token count (some tokens already existed, which were added to the end of the file)
  • ran _make_smiles_tokenizer to update tokenizer.json and tokenizer_config.json in lobster/assets/smiles_tokenizer

@cgrambow cgrambow requested review from karinazad and ncfrey March 18, 2025 23:41
@karinazad
Copy link
Collaborator

  • adds Ume utilities: modalities and get_vocab
  • enables to instantiate Ume without a checkpoint and adds Ume.load_from_checkpoint
  • applies @cgrambow 's SMILES tokenizer vocab in the ume tokenizers
  • distinguishes which reserved tokens are used (extra special tokens, reserved for amino acids, reserved for smiles...)

@cgrambow
Copy link
Collaborator Author

  • removes duplicate lobster/assets/ume_tokenizers/smiles_tokenizer/vocab.txt and lobster/assets/ume_tokenizers/latent_generator_tokenizer/vocab.txt files
  • modifies _load_vocabularies in _ume_tokenizers.py to read from lobster/assets/smiles_tokenizer/vocab.txt and lobster/assets/latent_generator_tokenizer/vocab.txt instead and remove special tokens from these files
  • does not modify amino acid and nucleotide tokenizers because these do non have vocab files in the non-UME tokenizers (they're also less likely to change, so duplication is less of an issue)

Colin Grambow added 3 commits March 21, 2025 16:21
@karinazad karinazad merged commit 9b80e99 into main Mar 24, 2025
5 checks passed
@karinazad karinazad deleted the extend-smiles-tokenizer branch March 24, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants