Add OpenGenome2 by taylormjs · Pull Request #53 · prescient-design/lobster

taylormjs · 2025-03-24T19:35:47Z

Iterable dataset for OpenGenome2 (8.84T tokens from genomic sequences)
Added DatasetInfo to UME datamodule (WARNING: about 64 times larger than AMPLIFY ito token size)

taylormjs · 2025-03-24T20:26:55Z

src/lobster/data/_ume_datamodule.py

        modality=Modality.NUCLEOTIDE,
        supported_splits={Split.TRAIN},  # TODO: add splits
-        train_size=8_780_000,
+        train_size=8_780_000,  # NOTE - this is an underestimate (whole genomes much longer)


Though if we are just truncating to the first 512-1024 tokens, then I suppose this is accurate, assuming all train_sizes estimated with num samples and not num tokens

karinazad · 2025-03-25T14:47:12Z

looks good! should we implement a cap on the training size? could be in the HF iterator dataset

taylormjs · 2025-03-25T16:28:13Z

@karinazad A cap's a great idea. Would opengenome2 exceed the cap size?
Closing for now, but perhaps we can add a cap in the next MR

Taylor Joren and others added 2 commits March 24, 2025 19:33

add opengenome2

019cf5f

typo fix

d7f7495

taylormjs marked this pull request as ready for review March 24, 2025 20:10

taylormjs temporarily deployed to test.pypi.org March 24, 2025 20:10 — with GitHub Actions Inactive

taylormjs requested review from karinazad and ncfrey March 24, 2025 20:11

taylormjs commented Mar 24, 2025

View reviewed changes

ncfrey approved these changes Mar 25, 2025

View reviewed changes

taylormjs merged commit 8a02c6b into main Mar 25, 2025
5 checks passed

taylormjs deleted the opengenome2 branch March 25, 2025 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenGenome2#53

Add OpenGenome2#53
taylormjs merged 2 commits intomainfrom
opengenome2

taylormjs commented Mar 24, 2025

Uh oh!

taylormjs Mar 24, 2025

Uh oh!

karinazad commented Mar 25, 2025

Uh oh!

taylormjs commented Mar 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

taylormjs commented Mar 24, 2025

Uh oh!

taylormjs Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad commented Mar 25, 2025

Uh oh!

taylormjs commented Mar 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants