Skip to content

PEER dataset(s) + callback#47

Merged
taylormjs merged 25 commits intomainfrom
peer
Mar 18, 2025
Merged

PEER dataset(s) + callback#47
taylormjs merged 25 commits intomainfrom
peer

Conversation

@taylormjs
Copy link
Collaborator

  • PEER dataset loading of 17 protein and protein-ligand benchmark tasks (datasets added to HF)
  • Callback for all tasks
  • Tests for callbacks
    (Btw Dataset loading tests done offline rather than w/ unit tests)

@taylormjs
Copy link
Collaborator Author

@karinazad @ncfrey The diffs include some other non-peer commits. I'll highlight the changes to focus on

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karinazad @ncfrey Updated constants

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karinazad @ncfrey Main peer callback, one for all 17 tasks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karinazad @ncfrey constants

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karinazad @ncfrey Dataset, loading from HF

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.cache_path.parent.mkdir(parents=True, exist_ok=True)
dataset = load_dataset(huggingface_repo, data_files=self.hf_data_file, split="train")
df = dataset.to_pandas()
df.to_parquet(self.cache_path, index=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and maybe the whole download arg is not necessary since HF uses cached local version by default anyway (and downloads it if it's stale)

@taylormjs taylormjs marked this pull request as ready for review March 18, 2025 18:41
@taylormjs taylormjs self-assigned this Mar 18, 2025
@taylormjs taylormjs merged commit e61fdb3 into main Mar 18, 2025
5 checks passed
@taylormjs taylormjs deleted the peer branch March 18, 2025 23:44
ncfrey pushed a commit that referenced this pull request Mar 19, 2025
* add peer datasets

* fix ruff

* add peer datasets, fix get item for all tasks

* lg tokenizer

* lg tokenizer assets

* lint

* added test and new wor level model v bpe

* rename to include coord tokenization explicity

* ruff tests

* dataset hg

* ruff

* nathans comments

* forgot to add files

* Add 3D Pinder to Ume datamodule (#45)

* Ume tokenizers and dataset sampling options (#46)

Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation

* add peer datasets

* peer callback 1

* add tests

* fix ruff reformat

* fix per-residue task logic

* ruff readd spaces

* ruff check

* remove unnecessary download arg

* ruff checks

* uv update

---------

Co-authored-by: Taylor Joren <joren.taylor@gene.com>
Co-authored-by: Sidney Lisanza <lisanzas@gene.com>
Co-authored-by: karinazad <karina.zadorozhny@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants