Skip to content

Releases: sacdallago/biotrainer

v1.3.0

13 Jan 11:39

Choose a tag to compare

13.01.2026 - Version 1.3.0

Features

  • Adding additional pbc tasks (binding, conservation, disorder, membrane)
  • Adding direct input of embeddings via input_data config option
  • Adding improved autoeval report API (Including a summary function)
  • Adding a random-sampling baseline based on training set distributions
  • Adding optimized huggingface embedder models for ProtT5, ProstT5 and the ESM-2 family
  • Optimizing async embeddings saving to speed up embeddings calcultion
  • Improving memory estimation for embedding batch sizes on MacOS
  • Adding an optional scaling step and option to scale input features if required (scaling_method)

Fixes

  • Adding a link to the config bank in plm_eval jupyter notebook and adapting to improved report api

Breaking

  • Replacing eval() with ast.literal_eval (Breaking: Removes support for list comprehensions in cross validation
    parameters)
  • Optimized autoeval pbc config bank

v1.2.1

18 Nov 14:22

Choose a tag to compare

18.11.2025 - Version 1.2.1

Fixes

  • Added additional verifications for h5 files and embeddings calculation to improve reliability in autoeval_pipeline.
  • Fixed issues with example embedding functions in the plm_eval Jupyter notebook for accurate demonstrations.
  • Using Pydantic for BiotrainerSeqRecord class for validation and consistency.
  • Exported OutputData in the output_files module for better modularity.

Features

  • Introduced a new input_data configuration option, allowing sequence records to be directly provided in code for
    flexibility.

v1.2.0

29 Oct 11:27

Choose a tag to compare

29.10.2025 - Version 1.2.0

Feature

  • Added autoeval framework PBC and example notebook.
  • Introduced autoeval_pipeline parameters for loading precomputed embeddings.
  • Added automatic dimensionality reduction to the inferencer.

Fixes

  • Replaced t-SNE with PCA for embeddings reduction (because t-SNE cannot be retroactively applied).
  • Fixed embedding functions compatibility with class_weights and classification protocols.
  • Updated dataset links to point to the PBC repository.
  • Corrected issues in autoeval tutorial notebooks and inference test configurations.

Refactors

  • Added support for Python 3.13 [BETA].
  • Enhanced flexibility for test set naming and output paths.
  • Centralized shared functionality of data handler in the base class.

Tests

  • Added unit tests for autoeval pipeline and class weight training validation.

v1.1.0

25 Sep 14:05

Choose a tag to compare

25.09.2025 - Version 1.1.0

Feature

  • Adding blosum62 predefined embedder via the blosum python package using the blosum substitution matrix as embeddings
  • Adding AAOntology predefined embedder from https://doi.org/10.1016/j.jmb.2024.168717 using amino acid feature
    scales
  • Adding biotrainer-ready quickstart
    datasets (subcellular location
    and secondary structure) in the README.md
  • Adding masked language modeling (MLM) task via residue_to_class protocol, CNN decoder and random_masking option in
    finetuning config
  • Adding lora examples for MLM and downstream tasks
  • [BETA] Adding residue_to_value protocol

Breaking

  • Refactoring confidence range calculation to use empirical distribution.
    Bootstrapping and MCD used assumption of normal distribution, which is okay for large sample sizes due to CLT. But it
    is more appropriate to use the empirical distribution, giving better upper and lower bounds especially for small
    sample sizes
  • autoeval: Adding framework name to task name in autoeval. This makes it easier to add multiple frameworks in the
    future
  • autoeval: Changing autoeval FLIP scl protocol to sequence_to_class. This requires less resources but is also valid
    to evaluate plms

Maintenance

  • Updating dependencies

Fixes

  • Fixing broken use_half_precision embeddings mode and adding comment about downstream float32 precision usage

v1.0.0

03 Jul 14:38

Choose a tag to compare

03.07.2025 - Version 1.0.0

Feature

  • Adding an OutputManager class that can be customized by adding observers for easier integration with external
    tools such as MLFlow, WanB and tensorboard (the latter is already supported)
  • Adding the autoeval module to biotrainer that enables evaluating protein language models on downstream tasks.
    Currently, a curated subset of the FLIP datasets is supported.
  • Adding an improved CLI, including train, predict, convert (deprecated files) and autoeval commands
  • Adding an InputValidator and InputValidationStep that validates the given input_file. Can be
    deactivated by setting validate_input to False in the config file.
  • Adding LoRA finetuning via finetuning_config. Implementation is currently in beta state
    (some modes like auto_resume and ppi are not supported), but finetuning can already be applied for all protocols.
  • Adding a random_embedder to calculate random embeddings as a baseline for predefined embedders.

Maintenance

  • Replaced biopython dependency with custom read, write and filter functions
  • Refactored the large trainer class into a pipeline with distinct steps for better readability, maintainability and
    customization
  • Enforced bootstrapping for sanity checks
  • Refactoring embedding_service to allow embedding computation as generator function. Embeddings are now directly
    stored in the h5 file after computation. Experiments show that this is about as efficient as the old batching approach,
    while allowing for better code readability.
  • Adding PyPi release
  • Adding official macOS support

Breaking

  • Refactoring file input to a single input_file.
    sequence_file, labels_file and mask_file are no longer supported.
  • Naming changes in the output file, documented in this issue: #137
  • embedder_name and embeddings_file are no longer mutually exclusive. If an embeddings_file is provided,
    it will be used instead of calculating the embeddings
  • Embeddings are now stored by hash in the result h5 file. The behaviour can be turned off for special use-cases
    in the compute_embeddings function by setting the store_by_hash flag to False. In that case, the original
    sequence id is now used (over a running integer) as h5 index. The sequence id also always saved in the original_id
    attribute of the h5 dataset.
  • Ending support for Python 3.10, adding support for Python 3.12
  • Migrating build and dependency system from poetry to uv for better performance

v0.9.8

05 May 12:15

Choose a tag to compare

05.05.2025 - Version 0.9.8

Features

  • Adding multiple distinct test sets (e.g. >Seq SET=test ... >Seq2 SET=test2 ... >Seq3 SET=test5 ...)
  • Adding a prediction set (unlabeled sequence,
    that is predicted with the best model after the training, e.g. >SeqPred SET=pred)
  • Adding a unique model hash (16 chars sha256, based on input files, config options and custom trainer)
  • Improving monte carlo dropout predictions: Adding new test cases, checks for correct parameters and
    all_predictions and confidence_range to result dict

Maintenance

  • Adding validation for per-residue masks (must each contain at least one resolved (1) value)
  • Adapting .pt API usage for new PyTorch version
  • Updating dependencies

v0.9.7

08 Apr 13:02

Choose a tag to compare

08.04.2025 - Version 0.9.7

Features

  • Adding external_writer config option to (de-)activate external writer
  • Adding embedding via onnx embedder. A custom_tokenizer_config can be provided if necessary (default: T5Tokenizer).
    For further information, see the "exmaples/onnx_embedder" directory.
  • Adding heuristic to set default batch size for embedder models
  • Adding new cli function headless_main_with_custom_trainer. It can be used to provide a custom trainer that allows
    to overwrite most functionality of biotrainer. Note that using this will add the entry "custom_trainer: True" to the
    out.yml file.
  • Improving biotrainer training output in jupyter notebook environments

v0.9.6

06 Feb 15:20

Choose a tag to compare

06.02.2025 - Version 0.9.6

Features

  • Adding a random model baseline (model with randomly initialized weights)
  • Adding bootstrapping to all sanity check baselines
  • Adding a biotrainer intro notebook about creating a model with
    one_hot_encodings and compare the results to an existing model

Maintenance

  • Disabling the test of very long sequences in test_embeddings_service.py for the CI.
    This should speed up the CI considerably.
  • Simplifying the config module by removing config options as classes and replacing them with dataclasses-based objects
  • Adding a new overview about all available config options
  • Improve stability of ONNX conversion, especially on Windows
  • Making bootstrapping results reproducible with a seeded number generator
  • Improve the first_steps.md tutorial
  • Simplifying readme
  • Update dependencies

v0.9.5

09 Dec 16:36

Choose a tag to compare

09.12.2024 - Version 0.9.5

Features

  • Added integration for huggingface datasets by @heispv in #124
  • Added per-sequence dimension reduction methods by @nadyadevani3112 in #123
  • Improving one_hot_encoding embedder with numpy functions @SebieF

Maintenance

  • Fixing "precission" typo in clasification_solver.py
  • Updating dependencies
  • Improving documentation of the config module by @heispv in #121
  • Improving compute_embeddings function to handle Dict, str and Path as input_data
  • Reducing log level of onnx and dynamo to ERROR to decrease logging output
  • Fixing first_steps documentation
  • Adding links to biocentral app, repository and biotrainer documentation

v0.9.4

01 Nov 10:05

Choose a tag to compare

29.10.2024 - Version 0.9.4

Bug fixes

  • Hotfix for incorrect precision mode setting by @SebieF in #116

Maintenance

  • Updating dependencies: removing python3.9 support
  • Updating CI workflow to be compatible with Windows

Known problems

  • Currently, there are compatibility problems with ONNX on some machines, please refer to the following issue: #111