PEER dataset(s) + callback#47

Merged

taylormjs merged 25 commits intomainfrom

Mar 18, 2025

Collaborator

taylormjs commented Mar 17, 2025

PEER dataset loading of 17 protein and protein-ligand benchmark tasks (datasets added to HF)
Callback for all tasks
Tests for callbacks
(Btw Dataset loading tests done offline rather than w/ unit tests)

taylormjs temporarily deployed to test.pypi.org

March 17, 2025 17:10

— with

GitHub Actions Inactive

taylormjs requested review from karinazad and ncfrey

March 17, 2025 17:12

Collaborator Author

taylormjs commented Mar 17, 2025

@karinazad @ncfrey The diffs include some other non-peer commits. I'll highlight the changes to focus on

taylormjs commented

View reviewed changes

src/lobster/constants/__init__.py

Collaborator Author

taylormjs Mar 17, 2025

@karinazad @ncfrey Updated constants

taylormjs commented

View reviewed changes

src/lobster/callbacks/_peer_evaluation_callback.py

Collaborator Author

taylormjs Mar 17, 2025

@karinazad @ncfrey Main peer callback, one for all 17 tasks

taylormjs commented

View reviewed changes

src/lobster/constants/_peer_tasks.py

Collaborator Author

taylormjs Mar 17, 2025

@karinazad @ncfrey constants

taylormjs commented

View reviewed changes

src/lobster/datasets/_peer_dataset.py

Collaborator Author

taylormjs Mar 17, 2025

@karinazad @ncfrey Dataset, loading from HF

taylormjs commented

View reviewed changes

tests/lobster/callbacks/test__peer_evaluation_callback.py

Collaborator Author

taylormjs Mar 17, 2025

@karinazad @ncfrey Tests

karinazad reviewed

View reviewed changes

src/lobster/datasets/_peer_dataset.py Outdated

+                          self.cache_path.parent.mkdir(parents=True, exist_ok=True)
+                          dataset = load_dataset(huggingface_repo, data_files=self.hf_data_file, split="train")
+                          df = dataset.to_pandas()
+                          df.to_parquet(self.cache_path, index=False)

Collaborator

karinazad Mar 17, 2025

I think this and maybe the whole download arg is not necessary since HF uses cached local version by default anyway (and downloads it if it's stale)

taylormjs temporarily deployed to test.pypi.org

March 18, 2025 16:47

— with

GitHub Actions Inactive

Taylor Joren and others added 19 commits

March 18, 2025 16:54


          add peer datasets

d5006fa


          fix ruff

535424c


          add peer datasets, fix get item for all tasks

c7d5da2


          lg tokenizer

e072d17


          lg tokenizer assets

b2cce97


          lint

ac35809


          added test and new wor level model v bpe

3e51d27


          rename to include coord tokenization explicity

0f7a5af


          ruff tests

ee1d84c


          dataset hg

f5e086c


          ruff

a831172


          nathans comments

d92757a


          forgot to add files

e0552fd


          Add 3D Pinder to Ume datamodule (#45)

d6d8356


          Ume tokenizers and dataset sampling options (#46)

e7ad41a

Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation


          add peer datasets

4600d9d


          peer callback 1

cf27293


          add tests

ce5cd80


          fix ruff reformat

76c0291

Taylor Joren added 2 commits

March 18, 2025 18:25


          fix per-residue task logic

37f576b


          ruff readd spaces

taylormjs force-pushed the peer branch from 4c56914 to 7886111 Compare

March 18, 2025 18:26

taylormjs temporarily deployed to test.pypi.org

March 18, 2025 18:26

— with

GitHub Actions Inactive


          ruff check

7a38d79

taylormjs temporarily deployed to test.pypi.org

March 18, 2025 18:29

— with

GitHub Actions Inactive

taylormjs marked this pull request as ready for review

March 18, 2025 18:41


          remove unnecessary download arg

15cbc54

taylormjs temporarily deployed to test.pypi.org

March 18, 2025 21:55

— with

GitHub Actions Inactive


          ruff checks

b77a3e9

taylormjs temporarily deployed to test.pypi.org

March 18, 2025 21:58

— with

GitHub Actions Inactive


          uv update

a373f6d

taylormjs temporarily deployed to test.pypi.org

March 18, 2025 22:04

— with

GitHub Actions Inactive

taylormjs self-assigned this

karinazad approved these changes

View reviewed changes

taylormjs merged commit e61fdb3 into main

5 checks passed

taylormjs deleted the peer branch

March 18, 2025 23:44

ncfrey pushed a commit that referenced this pull request


          PEER dataset(s) + callback (#47)

89ac797

* add peer datasets

* fix ruff

* add peer datasets, fix get item for all tasks

* lg tokenizer

* lg tokenizer assets

* lint

* added test and new wor level model v bpe

* rename to include coord tokenization explicity

* ruff tests

* dataset hg

* ruff

* nathans comments

* forgot to add files

* Add 3D Pinder to Ume datamodule (#45)

* Ume tokenizers and dataset sampling options (#46)

Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation

* add peer datasets

* peer callback 1

* add tests

* fix ruff reformat

* fix per-residue task logic

* ruff readd spaces

* ruff check

* remove unnecessary download arg

* ruff checks

* uv update

---------

Co-authored-by: Taylor Joren <joren.taylor@gene.com>
Co-authored-by: Sidney Lisanza <lisanzas@gene.com>
Co-authored-by: karinazad <karina.zadorozhny@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet