Releases · sacdallago/biotrainer

13.01.2026 - Version 1.3.0

Features

Adding additional pbc tasks (binding, conservation, disorder, membrane)
Adding direct input of embeddings via input_data config option
Adding improved autoeval report API (Including a summary function)
Adding a random-sampling baseline based on training set distributions
Adding optimized huggingface embedder models for ProtT5, ProstT5 and the ESM-2 family
Optimizing async embeddings saving to speed up embeddings calcultion
Improving memory estimation for embedding batch sizes on MacOS
Adding an optional scaling step and option to scale input features if required (scaling_method)

Fixes

Adding a link to the config bank in plm_eval jupyter notebook and adapting to improved report api

Breaking

Replacing eval() with ast.literal_eval (Breaking: Removes support for list comprehensions in cross validation
parameters)
Optimized autoeval pbc config bank

18.11.2025 - Version 1.2.1

Fixes

Added additional verifications for h5 files and embeddings calculation to improve reliability in autoeval_pipeline.
Fixed issues with example embedding functions in the plm_eval Jupyter notebook for accurate demonstrations.
Using Pydantic for BiotrainerSeqRecord class for validation and consistency.
Exported OutputData in the output_files module for better modularity.

Features

Introduced a new input_data configuration option, allowing sequence records to be directly provided in code for
flexibility.

29.10.2025 - Version 1.2.0

Feature

Added autoeval framework PBC and example notebook.
Introduced autoeval_pipeline parameters for loading precomputed embeddings.
Added automatic dimensionality reduction to the inferencer.

Fixes

Replaced t-SNE with PCA for embeddings reduction (because t-SNE cannot be retroactively applied).
Fixed embedding functions compatibility with class_weights and classification protocols.
Updated dataset links to point to the PBC repository.
Corrected issues in autoeval tutorial notebooks and inference test configurations.

Refactors

Added support for Python 3.13 [BETA].
Enhanced flexibility for test set naming and output paths.
Centralized shared functionality of data handler in the base class.

Tests

Added unit tests for autoeval pipeline and class weight training validation.

25.09.2025 - Version 1.1.0

Feature

Adding blosum62 predefined embedder via the blosum python package using the blosum substitution matrix as embeddings
Adding AAOntology predefined embedder from https://doi.org/10.1016/j.jmb.2024.168717 using amino acid feature
scales
Adding biotrainer-ready quickstart
datasets (subcellular location
and secondary structure) in the README.md
Adding masked language modeling (MLM) task via residue_to_class protocol, CNN decoder and random_masking option in
finetuning config
Adding lora examples for MLM and downstream tasks
[BETA] Adding residue_to_value protocol

Breaking

Refactoring confidence range calculation to use empirical distribution.
Bootstrapping and MCD used assumption of normal distribution, which is okay for large sample sizes due to CLT. But it
is more appropriate to use the empirical distribution, giving better upper and lower bounds especially for small
sample sizes
autoeval: Adding framework name to task name in autoeval. This makes it easier to add multiple frameworks in the
future
autoeval: Changing autoeval FLIP scl protocol to sequence_to_class. This requires less resources but is also valid
to evaluate plms

Maintenance

Updating dependencies

Fixes

Fixing broken use_half_precision embeddings mode and adding comment about downstream float32 precision usage

03.07.2025 - Version 1.0.0

Feature

Adding an OutputManager class that can be customized by adding observers for easier integration with external
tools such as MLFlow, WanB and tensorboard (the latter is already supported)
Adding the autoeval module to biotrainer that enables evaluating protein language models on downstream tasks.
Currently, a curated subset of the FLIP datasets is supported.
Adding an improved CLI, including train, predict, convert (deprecated files) and autoeval commands
Adding an InputValidator and InputValidationStep that validates the given input_file. Can be
deactivated by setting validate_input to False in the config file.
Adding LoRA finetuning via finetuning_config. Implementation is currently in beta state
(some modes like auto_resume and ppi are not supported), but finetuning can already be applied for all protocols.
Adding a random_embedder to calculate random embeddings as a baseline for predefined embedders.

Maintenance

Replaced biopython dependency with custom read, write and filter functions
Refactored the large trainer class into a pipeline with distinct steps for better readability, maintainability and
customization
Enforced bootstrapping for sanity checks
Refactoring embedding_service to allow embedding computation as generator function. Embeddings are now directly
stored in the h5 file after computation. Experiments show that this is about as efficient as the old batching approach,
while allowing for better code readability.
Adding PyPi release
Adding official macOS support

Breaking

Refactoring file input to a single input_file.
sequence_file, labels_file and mask_file are no longer supported.
Naming changes in the output file, documented in this issue: #137
embedder_name and embeddings_file are no longer mutually exclusive. If an embeddings_file is provided,
it will be used instead of calculating the embeddings
Embeddings are now stored by hash in the result h5 file. The behaviour can be turned off for special use-cases
in the compute_embeddings function by setting the store_by_hash flag to False. In that case, the original
sequence id is now used (over a running integer) as h5 index. The sequence id also always saved in the original_id
attribute of the h5 dataset.
Ending support for Python 3.10, adding support for Python 3.12
Migrating build and dependency system from poetry to uv for better performance

05.05.2025 - Version 0.9.8

Features

Adding multiple distinct test sets (e.g. >Seq SET=test ... >Seq2 SET=test2 ... >Seq3 SET=test5 ...)
Adding a prediction set (unlabeled sequence,
that is predicted with the best model after the training, e.g. >SeqPred SET=pred)
Adding a unique model hash (16 chars sha256, based on input files, config options and custom trainer)
Improving monte carlo dropout predictions: Adding new test cases, checks for correct parameters and
all_predictions and confidence_range to result dict

Maintenance

Adding validation for per-residue masks (must each contain at least one resolved (1) value)
Adapting .pt API usage for new PyTorch version
Updating dependencies

08.04.2025 - Version 0.9.7

Features

Adding external_writer config option to (de-)activate external writer
Adding embedding via onnx embedder. A custom_tokenizer_config can be provided if necessary (default: T5Tokenizer).
For further information, see the "exmaples/onnx_embedder" directory.
Adding heuristic to set default batch size for embedder models
Adding new cli function headless_main_with_custom_trainer. It can be used to provide a custom trainer that allows
to overwrite most functionality of biotrainer. Note that using this will add the entry "custom_trainer: True" to the
out.yml file.
Improving biotrainer training output in jupyter notebook environments

06.02.2025 - Version 0.9.6

Features

Adding a random model baseline (model with randomly initialized weights)
Adding bootstrapping to all sanity check baselines
Adding a biotrainer intro notebook about creating a model with
one_hot_encodings and compare the results to an existing model

Maintenance

Disabling the test of very long sequences in test_embeddings_service.py for the CI.
This should speed up the CI considerably.
Simplifying the config module by removing config options as classes and replacing them with dataclasses-based objects
Adding a new overview about all available config options
Improve stability of ONNX conversion, especially on Windows
Making bootstrapping results reproducible with a seeded number generator
Improve the first_steps.md tutorial
Simplifying readme
Update dependencies

@nadyadevani3112

09.12.2024 - Version 0.9.5

Features

Added integration for huggingface datasets by @heispv in #124
Added per-sequence dimension reduction methods by @nadyadevani3112 in #123
Improving one_hot_encoding embedder with numpy functions @SebieF

Maintenance

Fixing "precission" typo in clasification_solver.py
Updating dependencies
Improving documentation of the config module by @heispv in #121
Improving compute_embeddings function to handle Dict, str and Path as input_data
Reducing log level of onnx and dynamo to ERROR to decrease logging output
Fixing first_steps documentation
Adding links to biocentral app, repository and biotrainer documentation

@SebieF

29.10.2024 - Version 0.9.4

Bug fixes

Hotfix for incorrect precision mode setting by @SebieF in #116

Maintenance

Updating dependencies: removing python3.9 support
Updating CI workflow to be compatible with Windows

Known problems

Currently, there are compatibility problems with ONNX on some machines, please refer to the following issue: #111

Releases: sacdallago/biotrainer

v1.3.0

13.01.2026 - Version 1.3.0

Features

Fixes

Breaking

Uh oh!

v1.2.1

18.11.2025 - Version 1.2.1

Fixes

Features

Uh oh!

v1.2.0

29.10.2025 - Version 1.2.0

Feature

Fixes

Refactors

Tests

Uh oh!

v1.1.0

Feature

Breaking

Maintenance

Fixes

Uh oh!

v1.0.0

03.07.2025 - Version 1.0.0

Feature

Maintenance

Breaking

Uh oh!

v0.9.8

05.05.2025 - Version 0.9.8

Features

Maintenance

Uh oh!

v0.9.7

08.04.2025 - Version 0.9.7

Features

Uh oh!

v0.9.6

06.02.2025 - Version 0.9.6

Features

Maintenance

Uh oh!

v0.9.5

09.12.2024 - Version 0.9.5

Features

Maintenance

Contributors

Uh oh!

v0.9.4

29.10.2024 - Version 0.9.4

Bug fixes

Maintenance

Known problems

Contributors

Uh oh!