Adding integration for HuggingFace datasets by peymanvahidi · Pull Request #124 · sacdallago/biotrainer

peymanvahidi · 2024-11-29T21:50:11Z

Description

This PR adds support for integrating HuggingFace datasets into the project, addressing Issue #118 . Users can now directly specify a HuggingFace dataset in their configuration, eliminating the need for manually prepared sequence and label files.

Key Features

HuggingFace Dataset Configuration: New options for hf_dataset (path, subset, sequence_column, target_column).
Subset Support: Allows loading specific dataset subsets if specified.
FASTA Conversion: Added a hf_to_fasta utility to convert datasets into FASTA format.
Improved Config Handling: Updated parsing and validation to include HuggingFace datasets, ensuring compatibility with existing protocols and rules.

Example Configuration

sequence_file: sequences.fasta
labels_file: labels.fasta
protocol: residue_to_class
hf_dataset:
  path: heispv/protein_data_test
  subset: split_3
  sequence_column: protein_sequence
  target_column: secondary_structure
model_choice: FNN
optimizer_choice: adam
loss_choice: cross_entropy_loss
num_epochs: 200
use_class_weights: False
learning_rate: 1e-3
batch_size: 128
device: cpu
embedder_name: one_hot_encoding

This simplifies dataset integration and leverages HuggingFace’s dataset ecosystem for streamlined workflows.

Closes #118.

…handling

…TA format

- Updating all_options_dict to include HuggingFace dataset options - Adding methods to load, split, and process HuggingFace datasets - Implementing _get_hf_dataset_map to parse HuggingFace dataset settings - Modifying _get_config_maps to include HuggingFace dataset mappings - Updating get_verified_config to process and include HuggingFace dataset configurations in the final config - Adding logging for dataset processing steps to inform about file overwrites and processing status - Ensuring compatibility with existing protocol and validation rules

… config.yml

- Adding explicit handling for HuggingFace dataset 'subset' option with detailed error messages. - Improving handling of datasets with varying numbers of splits: - Automatically merge and split datasets if three predefined splits are not available. - Enhancing logging for split processing. - Streamlining _load_and_split_dataset method for better clarity and robustness. - Minor logging and documentation improvements.

peymanvahidi · 2024-11-30T17:06:19Z

Everything is done @SebieF :)
I'm also preparing the unit tests. Could you please review the files whenever you had time?

SebieF

Thanks for your efforts :) The changes look good overall, I'd suggest some re-structuring to ensure separation of concerns and some other rather small adjustments that we can also discuss tomorrow!

biotrainer/config/hf_dataset_options.py

biotrainer/config/configurator.py

examples/hf_dataset/README.md

… and refactoring FASTA utilities - Making hf_dataset mutually exclusive with sequence_file, labels_file, and mask_file - Supporting only datasets with three splits, removing handling for single or dual splits - Moving FASTA processing and dataset transformation functions to utilities/fasta.py - Saving processed files in a dedicated directory - Implementing additional minor improvements

…to this file - Refactoring and separating functions for better readability and modularity. - Enhancing hf_to_fasta to support masks and multiple outputs.

peymanvahidi · 2024-12-04T12:20:13Z

All the reviews have been applied in these commits. I should mention that you suggested writing the options for the configurator manually in the config_dict, but since I also wanted to check some rules, I preferred to manually add the hf_dataset option for the files to the config_maps. This way, there is no need to modify the _check_rules function or how the ConfigRule.apply method works inside it.

SebieF

Thanks for applying the suggestions. I have a couple of notes left, then we are ready to merge :)

biotrainer/utilities/fasta.py

docs/config_file_options.md

biotrainer/utilities/fasta.py

examples/hf_dataset/README.md

biotrainer/config/configurator.py

biotrainer/utilities/fasta.py

peymanvahidi · 2024-12-04T19:29:08Z

I think these convention for naming is good. Showing that we're dealing with subsets from the HuggingFace, also not confusing like before.

SebieF · 2024-12-05T08:35:16Z

Yes, you can keep that new naming :)

- Merging load_and_split_hf_dataset and hf_to_fasta into process_hf_dataset_to_fasta in utilities - Simplifying _create_hf_files by delegating to the new utility function - Reducing configurator complexity by moving all HuggingFace dataset preprocessing to hf_dataset_to_fasta.py, keeping only configuration options in the configurator.

- Moveing process_split, load_and_split_hf_dataset, and related HuggingFace preprocessing functions from fasta.py to hf_dataset_to_fasta.py in utilities - Improving code organization by separating concerns between general FASTA utilities and HuggingFace dataset processing

peymanvahidi · 2024-12-05T18:41:35Z

All the updates are done :)

SebieF

Great, thank you :) Also did one test run with proteinea/fluroescence - worked as expected.

peymanvahidi added 10 commits November 28, 2024 21:41

Adding HuggingFace dataset configuration options for dynamic dataset …

e6c7437

…handling

Adding hf_to_fasta function for converting HuggingFace dataset to FAS…

ca3ec09

…TA format

Including hf_to_fasta in module exports

bae8883

Ensuring sequence_file and labels_file paths are resolved relative to…

4b6caed

… config.yml

Renaming HFSubsetName's name to subset

125cee0

Adding support for loading specific subsets of HuggingFace datasets

0b7219c

Adding datasets library

4499c35

Adding example for hf_dataset configuration

da67328

SebieF self-requested a review December 1, 2024 17:42

SebieF requested changes Dec 1, 2024

View reviewed changes

peymanvahidi added 6 commits December 4, 2024 12:20

Moving and refactoring dataset functions from config/configurator.py …

98841b4

…to this file - Refactoring and separating functions for better readability and modularity. - Enhancing hf_to_fasta to support masks and multiple outputs.

Adding load_and_split_hf_dataset to imports

baba049

Adding support for mask column in HuggingFace dataset options

83ff752

Separating the doc with the example README file

f412eeb

Writing unittests for hf_dataset option

c1e8776

SebieF self-requested a review December 4, 2024 12:44

SebieF requested changes Dec 4, 2024

View reviewed changes

peymanvahidi commented Dec 4, 2024

View reviewed changes

biotrainer/utilities/fasta.py Outdated Show resolved Hide resolved

peymanvahidi commented Dec 4, 2024

View reviewed changes

biotrainer/utilities/fasta.py Outdated Show resolved Hide resolved

peymanvahidi commented Dec 4, 2024

View reviewed changes

biotrainer/utilities/fasta.py Outdated Show resolved Hide resolved

peymanvahidi added 3 commits December 5, 2024 19:24

Making minor improvements in docs and examples

7f1069f

Aligning test assertion with updated naming convention

296f909

SebieF self-requested a review December 6, 2024 10:05

SebieF approved these changes Dec 9, 2024

View reviewed changes

SebieF merged commit c3da0c4 into sacdallago:develop Dec 9, 2024

SebieF mentioned this pull request Dec 9, 2024

v0.9.5 #129

Merged

Conversation

peymanvahidi commented Nov 29, 2024 • edited by SebieF Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peymanvahidi commented Nov 30, 2024

Uh oh!

SebieF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peymanvahidi commented Dec 4, 2024

Uh oh!

SebieF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peymanvahidi commented Dec 4, 2024

Uh oh!

SebieF commented Dec 5, 2024

Uh oh!

peymanvahidi commented Dec 5, 2024

Uh oh!

SebieF left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peymanvahidi commented Nov 29, 2024 •

edited by SebieF

Loading