Skip to content

Calm splits#58

Merged
taylormjs merged 11 commits intomainfrom
calm-splits
Apr 2, 2025
Merged

Calm splits#58
taylormjs merged 11 commits intomainfrom
calm-splits

Conversation

@taylormjs
Copy link
Collaborator

@taylormjs taylormjs commented Apr 1, 2025

  • Add calm heldout split from paper, create iid train, val splits (95/05)
  • Update calm data module
  • Change HF directory to taylor-joren/calm
  • Prevent caching in ~/.cache to avoid filling up homefs

Taylor Joren and others added 5 commits March 31, 2025 23:08
karinazad and others added 2 commits April 2, 2025 05:42
* config

* max length

* max length

* beignet
Co-authored-by: freyn6 <freyn6@gene.com>
@taylormjs taylormjs marked this pull request as ready for review April 2, 2025 05:42
@taylormjs taylormjs requested review from karinazad and ncfrey April 2, 2025 05:43
dataset_class=CalmIterableDataset,
modality=Modality.NUCLEOTIDE,
supported_splits={Split.TRAIN}, # TODO: add splits
supported_splits={"train_full", "train_iid", "val_iid", "heldout"},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the available splits for calm. Didn't add enum for these, but can add if we want

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether for training we should just keep it simple as train, test and have more robust evaluation outside of the datamodule?

Training data is originally from FASTA files containing coding DNA sequences.
"""

SUPPORTED_SPLITS: ClassVar[list[str]] = ["train_full", "train_iid", "val_iid", "heldout"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New calm splits

Function or transform to apply to the data, by default None.
split_mode : str, optional
Split mode to use. Options:
- "pre_split": Use pre-created IID splits (train_iid, val_iid) and heldout for test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make this the only default option and not allow customized splits so it's more in line with the rest of the HF datasets?

@taylormjs taylormjs self-assigned this Apr 2, 2025
@taylormjs taylormjs merged commit 3e71dbc into main Apr 2, 2025
5 checks passed
@taylormjs taylormjs deleted the calm-splits branch April 2, 2025 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants