Conversation
Collaborator
Author
|
@karinazad @ncfrey The diffs include some other non-peer commits. I'll highlight the changes to focus on |
taylormjs
commented
Mar 17, 2025
taylormjs
commented
Mar 17, 2025
Collaborator
Author
There was a problem hiding this comment.
@karinazad @ncfrey Main peer callback, one for all 17 tasks
taylormjs
commented
Mar 17, 2025
taylormjs
commented
Mar 17, 2025
Collaborator
Author
There was a problem hiding this comment.
@karinazad @ncfrey Dataset, loading from HF
taylormjs
commented
Mar 17, 2025
karinazad
reviewed
Mar 17, 2025
| self.cache_path.parent.mkdir(parents=True, exist_ok=True) | ||
| dataset = load_dataset(huggingface_repo, data_files=self.hf_data_file, split="train") | ||
| df = dataset.to_pandas() | ||
| df.to_parquet(self.cache_path, index=False) |
Collaborator
There was a problem hiding this comment.
I think this and maybe the whole download arg is not necessary since HF uses cached local version by default anyway (and downloads it if it's stale)
Creates mutually-compatible multi-modal Ume tokenizers Ensures that the vocab size is a multiple of 64 Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation) Adds stopping conditions to round robin concatenation
added 2 commits
March 18, 2025 18:25
karinazad
approved these changes
Mar 18, 2025
ncfrey
pushed a commit
that referenced
this pull request
Mar 19, 2025
* add peer datasets * fix ruff * add peer datasets, fix get item for all tasks * lg tokenizer * lg tokenizer assets * lint * added test and new wor level model v bpe * rename to include coord tokenization explicity * ruff tests * dataset hg * ruff * nathans comments * forgot to add files * Add 3D Pinder to Ume datamodule (#45) * Ume tokenizers and dataset sampling options (#46) Creates mutually-compatible multi-modal Ume tokenizers Ensures that the vocab size is a multiple of 64 Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation) Adds stopping conditions to round robin concatenation * add peer datasets * peer callback 1 * add tests * fix ruff reformat * fix per-residue task logic * ruff readd spaces * ruff check * remove unnecessary download arg * ruff checks * uv update --------- Co-authored-by: Taylor Joren <joren.taylor@gene.com> Co-authored-by: Sidney Lisanza <lisanzas@gene.com> Co-authored-by: karinazad <karina.zadorozhny@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(Btw Dataset loading tests done offline rather than w/ unit tests)