Skip to content

Add OpenGenome2#53

Merged
taylormjs merged 2 commits intomainfrom
opengenome2
Mar 25, 2025
Merged

Add OpenGenome2#53
taylormjs merged 2 commits intomainfrom
opengenome2

Conversation

@taylormjs
Copy link
Collaborator

  • Iterable dataset for OpenGenome2 (8.84T tokens from genomic sequences)
  • Added DatasetInfo to UME datamodule (WARNING: about 64 times larger than AMPLIFY ito token size)

Taylor Joren and others added 2 commits March 24, 2025 19:33
@taylormjs taylormjs marked this pull request as ready for review March 24, 2025 20:10
@taylormjs taylormjs requested review from karinazad and ncfrey March 24, 2025 20:11
modality=Modality.NUCLEOTIDE,
supported_splits={Split.TRAIN}, # TODO: add splits
train_size=8_780_000,
train_size=8_780_000, # NOTE - this is an underestimate (whole genomes much longer)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though if we are just truncating to the first 512-1024 tokens, then I suppose this is accurate, assuming all train_sizes estimated with num samples and not num tokens

@karinazad
Copy link
Collaborator

looks good! should we implement a cap on the training size? could be in the HF iterator dataset

@taylormjs
Copy link
Collaborator Author

@karinazad A cap's a great idea. Would opengenome2 exceed the cap size?
Closing for now, but perhaps we can add a cap in the next MR

@taylormjs taylormjs merged commit 8a02c6b into main Mar 25, 2025
5 checks passed
@taylormjs taylormjs deleted the opengenome2 branch March 25, 2025 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants