dataloader checkpoint callback by karinazad · Pull Request #60 · prescient-design/lobster

karinazad · 2025-04-11T18:56:04Z

No description provided.

taylormjs · 2025-04-11T19:06:06Z

src/lobster/callbacks/_dataloader_checkpoint_callback.py

+        if self._is_s3_uri:
+            with tempfile.NamedTemporaryFile() as tmp_file:
+                temp_path = tmp_file.name
+                torch.save(dataloader.state_dict(), temp_path)


Just curious, about how large are the dataloader state dicts?

I don't have an example on hand but I think they are pretty small since it just stores the item indices and some metadata about input_dir etc

see here: https://github.com/Lightning-AI/litData/blob/06353697d17bd71f1bd1af3e47c5c3d5760a2e93/src/litdata/streaming/dataloader.py#L680
and here:
https://github.com/Lightning-AI/litData/blob/ee033833589ce82119892a1640fa0e3864061678/src/litdata/streaming/dataset.py#L438

taylormjs · 2025-04-11T19:16:58Z

src/lobster/model/_ume.py


        # Instantiate the model
-        if ckpt_path is not None:
-            self.model = FlexBERT.load_from_checkpoint(ckpt_path)


Good catch. So by default, always use Ume.load_from_checkpoint rather than specify a ckpt_path in model instantiation?

yes exactly, ckpt_path in the model parameters is only needed because of this line: https://github.com/prescient-design/lobster/blob/main/src/lobster/cmdline/_train.py#L54

taylormjs

Looks great!

* dataloader callback * utils * ume * gitignore dev * tests

* peer fixes, add evaluate method * dataloader checkpoint callback (#60) * dataloader callback * utils * ume * gitignore dev * tests * update flash attention wheels (#61) * lock * torch 2.5 * torch 2.5 * part * .env * unpin flash attn (#62) * fix scheduler params (#64) * scheduler * fix scheduler * fix scheduler * Add AtomicaDataset (#63) Processed Atomica interactions dataset * Ume conversion/interaction tokenizer + fix SMILES and nucleotide tokenizers (#65) add two special tokens: <convert> and <interact> for later stages of Ume training: will be used as this: (or something like that) [CLS] PROT_SEQ [SEP] <convert> PROT_STRUCT(masked) [SEP] [CLS] PROT_SEQ [SEP] <interact> SMILES(masked) [SEP] extend functionality of UmeTokenizerTransform to handle dual modalities change the name of Ume embedding method and allow embedding from existing input_ids fix existing tokenizers: add lowercase normalized to nucleotide tokenizer (OG2 dataset contains a mix of upper and lowercase letters) BPE handled SMILES tokenization incorrectly, switch to WordLevel * Ume SMILES tokenizer fix (#66) * tokenizer * fix tests * lowercase normalizer for nt * tests * remove mod conv dataset * embed * Test * merge 2mod into UmeTokenizerTransform * fix tests * all * type hints * docstrings * tests * fix SMILES tokenizer * switch all tokenizer to BPE * Revert "switch all tokenizer to BPE" This reverts commit 367e77d. * tok * fix SMILES tokenizer * remove print statement * Ume perplexity logging (#67) * pplx * tests * src * ignore torchmetrics warnings * docstrings * docstrings * Update README.md (#69) * Ume fix perplexity device (#68) * pplx as attr * pplx as attr * pplx * comments * on step * comment * update tests, fix ruff * ruff * ruff ruff * Add <cls_modality> to Ume tokenizers (#71) * add <cls_modality> tokens * add <cls_modality> tokens * docstring * RNS metric implementation (#73) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * RNS implementation * restore from main * restore * docstrings * docstrings * review * test * Ume modality-specific embeddings (#72) * add <cls_modality> tokens * add <cls_modality> tokens * modality embeddings * module dict * embeddings * tests * modality and device * rank zero only * rank zero * fix back modality mask * sync dist * add conversion transforms (#74) * add initial smiles to peptide and peptide to smiles transforms * remove smiles -> * transforms and touch up conversion functions * rename * add option to randomize smiles and caps --------- Co-authored-by: Colin Grambow <grambowc@gene.com> * fix def pad token, replace process_and_embed w/ ume.embed * update tests w -100 pad token --------- Co-authored-by: Taylor Joren <joren.taylor@gene.com> Co-authored-by: Karina Zadorozhny <karina.zadorozhny@gmail.com> Co-authored-by: Nathan Frey <ncfrey@users.noreply.github.com> Co-authored-by: Colin Grambow <17198155+cgrambow@users.noreply.github.com> Co-authored-by: Colin Grambow <grambowc@gene.com>

karinazad added 2 commits April 11, 2025 14:55

dataloader callback

9e442b3

utils

f6a56f2

karinazad temporarily deployed to test.pypi.org April 11, 2025 18:58 — with GitHub Actions Inactive

ume

a5b8565

karinazad temporarily deployed to test.pypi.org April 11, 2025 19:00 — with GitHub Actions Inactive

gitignore dev

9722262

karinazad temporarily deployed to test.pypi.org April 11, 2025 19:00 — with GitHub Actions Inactive

tests

73fa70c

karinazad temporarily deployed to test.pypi.org April 11, 2025 19:02 — with GitHub Actions Inactive

taylormjs reviewed Apr 11, 2025

View reviewed changes

taylormjs approved these changes Apr 11, 2025

View reviewed changes

karinazad merged commit de5caba into main Apr 11, 2025
5 checks passed

karinazad deleted the dataloader-callback branch April 11, 2025 23:02

taylormjs pushed a commit that referenced this pull request Apr 29, 2025

dataloader checkpoint callback (#60)

d21266e

* dataloader callback * utils * ume * gitignore dev * tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataloader checkpoint callback#60

dataloader checkpoint callback#60
karinazad merged 5 commits intomainfrom
dataloader-callback

karinazad commented Apr 11, 2025

Uh oh!

taylormjs Apr 11, 2025

Uh oh!

karinazad Apr 11, 2025

Uh oh!

karinazad Apr 11, 2025

Uh oh!

taylormjs Apr 11, 2025

Uh oh!

karinazad Apr 11, 2025

Uh oh!

taylormjs left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

karinazad commented Apr 11, 2025

Uh oh!

taylormjs Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

taylormjs Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

taylormjs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants