Releases: sacdallago/biotrainer
Releases · sacdallago/biotrainer
v1.3.0
13.01.2026 - Version 1.3.0
Features
- Adding additional pbc tasks (binding, conservation, disorder, membrane)
- Adding direct input of embeddings via input_data config option
- Adding improved autoeval report API (Including a summary function)
- Adding a random-sampling baseline based on training set distributions
- Adding optimized huggingface embedder models for ProtT5, ProstT5 and the ESM-2 family
- Optimizing async embeddings saving to speed up embeddings calcultion
- Improving memory estimation for embedding batch sizes on MacOS
- Adding an optional scaling step and option to scale input features if required (
scaling_method)
Fixes
- Adding a link to the config bank in plm_eval jupyter notebook and adapting to improved report api
Breaking
- Replacing eval() with ast.literal_eval (Breaking: Removes support for list comprehensions in cross validation
parameters) - Optimized autoeval pbc config bank
v1.2.1
18.11.2025 - Version 1.2.1
Fixes
- Added additional verifications for h5 files and embeddings calculation to improve reliability in
autoeval_pipeline. - Fixed issues with example embedding functions in the
plm_evalJupyter notebook for accurate demonstrations. - Using Pydantic for
BiotrainerSeqRecordclass for validation and consistency. - Exported
OutputDatain theoutput_filesmodule for better modularity.
Features
- Introduced a new
input_dataconfiguration option, allowing sequence records to be directly provided in code for
flexibility.
v1.2.0
29.10.2025 - Version 1.2.0
Feature
- Added autoeval framework PBC and example notebook.
- Introduced autoeval_pipeline parameters for loading precomputed embeddings.
- Added automatic dimensionality reduction to the inferencer.
Fixes
- Replaced t-SNE with PCA for embeddings reduction (because t-SNE cannot be retroactively applied).
- Fixed embedding functions compatibility with class_weights and classification protocols.
- Updated dataset links to point to the PBC repository.
- Corrected issues in autoeval tutorial notebooks and inference test configurations.
Refactors
- Added support for Python 3.13 [BETA].
- Enhanced flexibility for test set naming and output paths.
- Centralized shared functionality of data handler in the base class.
Tests
- Added unit tests for autoeval pipeline and class weight training validation.
v1.1.0
25.09.2025 - Version 1.1.0
Feature
- Adding blosum62 predefined embedder via the
blosumpython package using the blosum substitution matrix as embeddings - Adding AAOntology predefined embedder from https://doi.org/10.1016/j.jmb.2024.168717 using amino acid feature
scales - Adding biotrainer-ready quickstart
datasets (subcellular location
and secondary structure) in theREADME.md - Adding masked language modeling (MLM) task via residue_to_class protocol, CNN decoder and
random_maskingoption in
finetuning config - Adding lora examples for MLM and downstream tasks
- [BETA] Adding
residue_to_valueprotocol
Breaking
- Refactoring confidence range calculation to use empirical distribution.
Bootstrapping and MCD used assumption of normal distribution, which is okay for large sample sizes due to CLT. But it
is more appropriate to use the empirical distribution, giving better upper and lower bounds especially for small
sample sizes - autoeval: Adding framework name to task name in autoeval. This makes it easier to add multiple frameworks in the
future - autoeval: Changing autoeval FLIP scl protocol to sequence_to_class. This requires less resources but is also valid
to evaluate plms
Maintenance
- Updating dependencies
Fixes
- Fixing broken use_half_precision embeddings mode and adding comment about downstream float32 precision usage
v1.0.0
03.07.2025 - Version 1.0.0
Feature
- Adding an
OutputManagerclass that can be customized by adding observers for easier integration with external
tools such as MLFlow, WanB and tensorboard (the latter is already supported) - Adding the
autoevalmodule to biotrainer that enables evaluating protein language models on downstream tasks.
Currently, a curated subset of the FLIP datasets is supported. - Adding an improved CLI, including
train,predict,convert(deprecated files) andautoevalcommands - Adding an
InputValidatorandInputValidationStepthat validates the giveninput_file. Can be
deactivated by settingvalidate_inputtoFalsein the config file. - Adding LoRA finetuning via
finetuning_config. Implementation is currently in beta state
(some modes like auto_resume and ppi are not supported), but finetuning can already be applied for all protocols. - Adding a
random_embedderto calculate random embeddings as a baseline for predefined embedders.
Maintenance
- Replaced biopython dependency with custom read, write and filter functions
- Refactored the large trainer class into a pipeline with distinct steps for better readability, maintainability and
customization - Enforced bootstrapping for sanity checks
- Refactoring embedding_service to allow embedding computation as generator function. Embeddings are now directly
stored in the h5 file after computation. Experiments show that this is about as efficient as the old batching approach,
while allowing for better code readability. - Adding PyPi release
- Adding official macOS support
Breaking
- Refactoring file input to a single
input_file.
sequence_file,labels_fileandmask_fileare no longer supported. - Naming changes in the output file, documented in this issue: #137
embedder_nameandembeddings_fileare no longer mutually exclusive. If anembeddings_fileis provided,
it will be used instead of calculating the embeddings- Embeddings are now stored by hash in the result h5 file. The behaviour can be turned off for special use-cases
in thecompute_embeddingsfunction by setting thestore_by_hashflag toFalse. In that case, the original
sequence id is now used (over a running integer) as h5 index. The sequence id also always saved in theoriginal_id
attribute of the h5 dataset. - Ending support for Python 3.10, adding support for Python 3.12
- Migrating build and dependency system from
poetrytouvfor better performance
v0.9.8
05.05.2025 - Version 0.9.8
Features
- Adding multiple distinct test sets (e.g. >Seq SET=test ... >Seq2 SET=test2 ... >Seq3 SET=test5 ...)
- Adding a prediction set (unlabeled sequence,
that is predicted with the best model after the training, e.g. >SeqPred SET=pred) - Adding a unique model hash (16 chars sha256, based on input files, config options and custom trainer)
- Improving monte carlo dropout predictions: Adding new test cases, checks for correct parameters and
all_predictionsandconfidence_rangeto result dict
Maintenance
- Adding validation for per-residue masks (must each contain at least one resolved (1) value)
- Adapting .pt API usage for new PyTorch version
- Updating dependencies
v0.9.7
08.04.2025 - Version 0.9.7
Features
- Adding
external_writerconfig option to (de-)activate external writer - Adding embedding via onnx embedder. A
custom_tokenizer_configcan be provided if necessary (default: T5Tokenizer).
For further information, see the "exmaples/onnx_embedder" directory. - Adding heuristic to set default batch size for embedder models
- Adding new cli function
headless_main_with_custom_trainer. It can be used to provide a custom trainer that allows
to overwrite most functionality of biotrainer. Note that using this will add the entry "custom_trainer: True" to the
out.ymlfile. - Improving biotrainer training output in jupyter notebook environments
v0.9.6
06.02.2025 - Version 0.9.6
Features
- Adding a random model baseline (model with randomly initialized weights)
- Adding bootstrapping to all sanity check baselines
- Adding a biotrainer intro notebook about creating a model with
one_hot_encodings and compare the results to an existing model
Maintenance
- Disabling the test of very long sequences in test_embeddings_service.py for the CI.
This should speed up the CI considerably. - Simplifying the config module by removing config options as classes and replacing them with dataclasses-based objects
- Adding a new overview about all available config options
- Improve stability of ONNX conversion, especially on Windows
- Making bootstrapping results reproducible with a seeded number generator
- Improve the first_steps.md tutorial
- Simplifying readme
- Update dependencies
v0.9.5
09.12.2024 - Version 0.9.5
Features
- Added integration for huggingface datasets by @heispv in #124
- Added per-sequence dimension reduction methods by @nadyadevani3112 in #123
- Improving one_hot_encoding embedder with numpy functions @SebieF
Maintenance
- Fixing "precis
sion" typo inclasification_solver.py - Updating dependencies
- Improving documentation of the config module by @heispv in #121
- Improving
compute_embeddingsfunction to handleDict,strandPathasinput_data - Reducing log level of onnx and dynamo to ERROR to decrease logging output
- Fixing first_steps documentation
- Adding links to biocentral app, repository and biotrainer documentation
v0.9.4
29.10.2024 - Version 0.9.4
Bug fixes
Maintenance
- Updating dependencies: removing python3.9 support
- Updating CI workflow to be compatible with Windows
Known problems
- Currently, there are compatibility problems with ONNX on some machines, please refer to the following issue: #111