Skip to content

Adding integration for HuggingFace datasets#124

Merged
SebieF merged 20 commits intosacdallago:developfrom
peymanvahidi:feature/hf_dataset
Dec 9, 2024
Merged

Adding integration for HuggingFace datasets#124
SebieF merged 20 commits intosacdallago:developfrom
peymanvahidi:feature/hf_dataset

Conversation

@peymanvahidi
Copy link
Contributor

@peymanvahidi peymanvahidi commented Nov 29, 2024

Description

This PR adds support for integrating HuggingFace datasets into the project, addressing Issue #118 . Users can now directly specify a HuggingFace dataset in their configuration, eliminating the need for manually prepared sequence and label files.

Key Features

  • HuggingFace Dataset Configuration: New options for hf_dataset (path, subset, sequence_column, target_column).
  • Subset Support: Allows loading specific dataset subsets if specified.
  • FASTA Conversion: Added a hf_to_fasta utility to convert datasets into FASTA format.
  • Improved Config Handling: Updated parsing and validation to include HuggingFace datasets, ensuring compatibility with existing protocols and rules.

Example Configuration

sequence_file: sequences.fasta
labels_file: labels.fasta
protocol: residue_to_class
hf_dataset:
  path: heispv/protein_data_test
  subset: split_3
  sequence_column: protein_sequence
  target_column: secondary_structure
model_choice: FNN
optimizer_choice: adam
loss_choice: cross_entropy_loss
num_epochs: 200
use_class_weights: False
learning_rate: 1e-3
batch_size: 128
device: cpu
embedder_name: one_hot_encoding

This simplifies dataset integration and leverages HuggingFace’s dataset ecosystem for streamlined workflows.

Closes #118.

- Updating all_options_dict to include HuggingFace dataset options
- Adding methods to load, split, and process HuggingFace datasets
- Implementing _get_hf_dataset_map to parse HuggingFace dataset settings
- Modifying _get_config_maps to include HuggingFace dataset mappings
- Updating get_verified_config to process and include HuggingFace dataset configurations in the final config
- Adding logging for dataset processing steps to inform about file overwrites and processing status
- Ensuring compatibility with existing protocol and validation rules
- Adding explicit handling for HuggingFace dataset 'subset' option with detailed error messages.
- Improving handling of datasets with varying numbers of splits:
  - Automatically merge and split datasets if three predefined splits are not available.
  - Enhancing logging for split processing.
- Streamlining _load_and_split_dataset method for better clarity and robustness.
- Minor logging and documentation improvements.
@peymanvahidi
Copy link
Contributor Author

Everything is done @SebieF :)
I'm also preparing the unit tests. Could you please review the files whenever you had time?

@SebieF SebieF self-requested a review December 1, 2024 17:42
Copy link
Collaborator

@SebieF SebieF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts :) The changes look good overall, I'd suggest some re-structuring to ensure separation of concerns and some other rather small adjustments that we can also discuss tomorrow!

… and refactoring FASTA utilities

- Making hf_dataset mutually exclusive with sequence_file, labels_file, and mask_file
- Supporting only datasets with three splits, removing handling for single or dual splits
- Moving FASTA processing and dataset transformation functions to utilities/fasta.py
- Saving processed files in a dedicated directory
- Implementing additional minor improvements
…to this file

- Refactoring and separating functions for better readability and modularity.
- Enhancing hf_to_fasta to support masks and multiple outputs.
@peymanvahidi
Copy link
Contributor Author

All the reviews have been applied in these commits. I should mention that you suggested writing the options for the configurator manually in the config_dict, but since I also wanted to check some rules, I preferred to manually add the hf_dataset option for the files to the config_maps. This way, there is no need to modify the _check_rules function or how the ConfigRule.apply method works inside it.

@SebieF SebieF self-requested a review December 4, 2024 12:44
Copy link
Collaborator

@SebieF SebieF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for applying the suggestions. I have a couple of notes left, then we are ready to merge :)

@peymanvahidi
Copy link
Contributor Author

I think these convention for naming is good. Showing that we're dealing with subsets from the HuggingFace, also not confusing like before.

@SebieF
Copy link
Collaborator

SebieF commented Dec 5, 2024

Yes, you can keep that new naming :)

- Merging load_and_split_hf_dataset and hf_to_fasta into process_hf_dataset_to_fasta in utilities
- Simplifying _create_hf_files by delegating to the new utility function
- Reducing configurator complexity by moving all HuggingFace dataset preprocessing to hf_dataset_to_fasta.py, keeping only configuration options in the configurator.
- Moveing process_split, load_and_split_hf_dataset, and related HuggingFace preprocessing functions from fasta.py to hf_dataset_to_fasta.py in utilities
- Improving code organization by separating concerns between general FASTA utilities and HuggingFace dataset processing
@peymanvahidi
Copy link
Contributor Author

All the updates are done :)

@SebieF SebieF self-requested a review December 6, 2024 10:05
Copy link
Collaborator

@SebieF SebieF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you :) Also did one test run with proteinea/fluroescence - worked as expected.

@SebieF SebieF merged commit c3da0c4 into sacdallago:develop Dec 9, 2024
@SebieF SebieF mentioned this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants