Skip to content

Ume tokenizers and dataset sampling options#46

Merged
karinazad merged 7 commits intomainfrom
ume-tokenizers
Mar 14, 2025
Merged

Ume tokenizers and dataset sampling options#46
karinazad merged 7 commits intomainfrom
ume-tokenizers

Conversation

@karinazad
Copy link
Collaborator

@karinazad karinazad commented Mar 14, 2025

  • Creates mutually-compatible multi-modal Ume tokenizers
  • Ensures that the vocab size is a multiple of 64
  • Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
  • Adds stopping conditions to round robin concatenation

@karinazad karinazad changed the title Ume tokenizers Ume tokenizers and dataset sampling options Mar 14, 2025
@karinazad karinazad merged commit b1d4b62 into main Mar 14, 2025
5 checks passed
@karinazad karinazad deleted the ume-tokenizers branch March 14, 2025 16:14
taylormjs pushed a commit that referenced this pull request Mar 17, 2025
Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation
taylormjs pushed a commit that referenced this pull request Mar 18, 2025
Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation
taylormjs added a commit that referenced this pull request Mar 18, 2025
* add peer datasets

* fix ruff

* add peer datasets, fix get item for all tasks

* lg tokenizer

* lg tokenizer assets

* lint

* added test and new wor level model v bpe

* rename to include coord tokenization explicity

* ruff tests

* dataset hg

* ruff

* nathans comments

* forgot to add files

* Add 3D Pinder to Ume datamodule (#45)

* Ume tokenizers and dataset sampling options (#46)

Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation

* add peer datasets

* peer callback 1

* add tests

* fix ruff reformat

* fix per-residue task logic

* ruff readd spaces

* ruff check

* remove unnecessary download arg

* ruff checks

* uv update

---------

Co-authored-by: Taylor Joren <joren.taylor@gene.com>
Co-authored-by: Sidney Lisanza <lisanzas@gene.com>
Co-authored-by: karinazad <karina.zadorozhny@gmail.com>
ncfrey pushed a commit that referenced this pull request Mar 19, 2025
* add peer datasets

* fix ruff

* add peer datasets, fix get item for all tasks

* lg tokenizer

* lg tokenizer assets

* lint

* added test and new wor level model v bpe

* rename to include coord tokenization explicity

* ruff tests

* dataset hg

* ruff

* nathans comments

* forgot to add files

* Add 3D Pinder to Ume datamodule (#45)

* Ume tokenizers and dataset sampling options (#46)

Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation

* add peer datasets

* peer callback 1

* add tests

* fix ruff reformat

* fix per-residue task logic

* ruff readd spaces

* ruff check

* remove unnecessary download arg

* ruff checks

* uv update

---------

Co-authored-by: Taylor Joren <joren.taylor@gene.com>
Co-authored-by: Sidney Lisanza <lisanzas@gene.com>
Co-authored-by: karinazad <karina.zadorozhny@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants