Adding integration for HuggingFace datasets#124
Adding integration for HuggingFace datasets#124SebieF merged 20 commits intosacdallago:developfrom peymanvahidi:feature/hf_dataset
Conversation
- Updating all_options_dict to include HuggingFace dataset options - Adding methods to load, split, and process HuggingFace datasets - Implementing _get_hf_dataset_map to parse HuggingFace dataset settings - Modifying _get_config_maps to include HuggingFace dataset mappings - Updating get_verified_config to process and include HuggingFace dataset configurations in the final config - Adding logging for dataset processing steps to inform about file overwrites and processing status - Ensuring compatibility with existing protocol and validation rules
- Adding explicit handling for HuggingFace dataset 'subset' option with detailed error messages. - Improving handling of datasets with varying numbers of splits: - Automatically merge and split datasets if three predefined splits are not available. - Enhancing logging for split processing. - Streamlining _load_and_split_dataset method for better clarity and robustness. - Minor logging and documentation improvements.
|
Everything is done @SebieF :) |
SebieF
left a comment
There was a problem hiding this comment.
Thanks for your efforts :) The changes look good overall, I'd suggest some re-structuring to ensure separation of concerns and some other rather small adjustments that we can also discuss tomorrow!
… and refactoring FASTA utilities - Making hf_dataset mutually exclusive with sequence_file, labels_file, and mask_file - Supporting only datasets with three splits, removing handling for single or dual splits - Moving FASTA processing and dataset transformation functions to utilities/fasta.py - Saving processed files in a dedicated directory - Implementing additional minor improvements
…to this file - Refactoring and separating functions for better readability and modularity. - Enhancing hf_to_fasta to support masks and multiple outputs.
|
All the reviews have been applied in these commits. I should mention that you suggested writing the options for the configurator manually in the config_dict, but since I also wanted to check some rules, I preferred to manually add the |
SebieF
left a comment
There was a problem hiding this comment.
Thanks for applying the suggestions. I have a couple of notes left, then we are ready to merge :)
|
I think these convention for naming is good. Showing that we're dealing with subsets from the HuggingFace, also not confusing like before. |
|
Yes, you can keep that new naming :) |
- Merging load_and_split_hf_dataset and hf_to_fasta into process_hf_dataset_to_fasta in utilities - Simplifying _create_hf_files by delegating to the new utility function - Reducing configurator complexity by moving all HuggingFace dataset preprocessing to hf_dataset_to_fasta.py, keeping only configuration options in the configurator.
- Moveing process_split, load_and_split_hf_dataset, and related HuggingFace preprocessing functions from fasta.py to hf_dataset_to_fasta.py in utilities - Improving code organization by separating concerns between general FASTA utilities and HuggingFace dataset processing
|
All the updates are done :) |
SebieF
left a comment
There was a problem hiding this comment.
Great, thank you :) Also did one test run with proteinea/fluroescence - worked as expected.
Description
This PR adds support for integrating HuggingFace datasets into the project, addressing Issue #118 . Users can now directly specify a HuggingFace dataset in their configuration, eliminating the need for manually prepared sequence and label files.
Key Features
Example Configuration
This simplifies dataset integration and leverages HuggingFace’s dataset ecosystem for streamlined workflows.
Closes #118.