Ume tokenizers and dataset sampling options by karinazad · Pull Request #46 · prescient-design/lobster

karinazad · 2025-03-14T01:14:55Z

Creates mutually-compatible multi-modal Ume tokenizers
Ensures that the vocab size is a multiple of 64
Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation)
Adds stopping conditions to round robin concatenation

src/lobster/data/_ume_datamodule.py

Creates mutually-compatible multi-modal Ume tokenizers Ensures that the vocab size is a multiple of 64 Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation) Adds stopping conditions to round robin concatenation

* add peer datasets * fix ruff * add peer datasets, fix get item for all tasks * lg tokenizer * lg tokenizer assets * lint * added test and new wor level model v bpe * rename to include coord tokenization explicity * ruff tests * dataset hg * ruff * nathans comments * forgot to add files * Add 3D Pinder to Ume datamodule (#45) * Ume tokenizers and dataset sampling options (#46) Creates mutually-compatible multi-modal Ume tokenizers Ensures that the vocab size is a multiple of 64 Adds an option to use multiplex sampler in Ume datamodule (in addition to round robin concatenation) Adds stopping conditions to round robin concatenation * add peer datasets * peer callback 1 * add tests * fix ruff reformat * fix per-residue task logic * ruff readd spaces * ruff check * remove unnecessary download arg * ruff checks * uv update --------- Co-authored-by: Taylor Joren <joren.taylor@gene.com> Co-authored-by: Sidney Lisanza <lisanzas@gene.com> Co-authored-by: karinazad <karina.zadorozhny@gmail.com>

karinazad added 2 commits March 13, 2025 21:13

ume tokenizer

b8a9876

comments

433a267

karinazad temporarily deployed to test.pypi.org March 14, 2025 01:19 — with GitHub Actions Inactive

tokenizer transform and round robin concat dataset stopping condition

2074a58

karinazad temporarily deployed to test.pypi.org March 14, 2025 12:42 — with GitHub Actions Inactive

docstrings

3ac681d

karinazad temporarily deployed to test.pypi.org March 14, 2025 12:49 — with GitHub Actions Inactive

weighted sampler

ab00c50

karinazad temporarily deployed to test.pypi.org March 14, 2025 14:36 — with GitHub Actions Inactive

multiplex

efdc431

karinazad temporarily deployed to test.pypi.org March 14, 2025 14:45 — with GitHub Actions Inactive

karinazad changed the title ~~Ume tokenizers~~ Ume tokenizers and dataset sampling options Mar 14, 2025

ncfrey approved these changes Mar 14, 2025

View reviewed changes

src/lobster/data/_ume_datamodule.py Show resolved Hide resolved

docstrings

6f1579d

karinazad temporarily deployed to test.pypi.org March 14, 2025 16:12 — with GitHub Actions Inactive

karinazad merged commit b1d4b62 into main Mar 14, 2025
5 checks passed

karinazad deleted the ume-tokenizers branch March 14, 2025 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ume tokenizers and dataset sampling options#46

Ume tokenizers and dataset sampling options#46
karinazad merged 7 commits intomainfrom
ume-tokenizers

karinazad commented Mar 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

karinazad commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karinazad commented Mar 14, 2025 •

edited

Loading