~/

Rethinking how I use git and reduce my dependency on Github

2025-12-25T00:00:00+01:00

I have been wanting to reduce my usage and dependency on Github for quite some time now.

My first initiative was to join codeberg.org, on 9 June 2024. I am really happy I did so, but Codeberg is a non-profit association and has storage limits, and rightly so. But my usage of Github exceeds what they can offer, so I had to (1) carefully choose what to migrate and (2) adapt what/how I store files in my repos (by switching to git-annex for larger blobs, for example).

My university has a Gitlab server but with a limited number of private projects. Ignoring this limitation, it is the right place for (a limited number of) projects related to my teaching activities, for example.

I concluded that my best solution was to self-host private repositories that are not meant to become public. Yesterday, I experimented a bit with gitolite, but today I realised how trivial it is to self-host git repositories once you have a server with ssh access: this video and the git book show how.

Github is still inevitable: I have many projects/software that have been there for a long time, and that are expected to remain there. Githun is still the de facto infrastructure for many collaborative projects, and that is unlikely to change.

Ironically, you (still) read this page served on Github. I will eventually migrate my personal page and other websites to codeberg. It’s still a work in progress, but since I started cleaning up in October 2025, I have deleted hundreds of repositories and even two inactive organisation - I have now less than 100 (mostly public) repos.

EDIT

2025-12-26: here’s a related article by members of rOpenSci: Code Hosting Options Beyond GitHub.
2026-01-11: this website has now migrated to lgatto.codeberg.page.

The tragic death of open source research software

2025-10-21T00:00:00+02:00

This post is a write up of my contribution to the Fast forward Open Science event organised by Circle U. I’ll be sharing some thoughts about research software maintenance and survival as part of the ‘Open Source Codes and Software’ discussion.

Introduction

Reserach software has become an central player in scientific research, to the point that it is hard to imagine scientific research without software.

But because of its nature and how it is funded/valued, it can also be a single point of failure.

Setting the stage

Imagine that 6 months ago, you, a brilliant and motivated early career researcher in biomedical sciences, defined the ideal experiment to answer an important biological question in your domain. After several months of hard work and thousands of euros of consumables, you have acquired the precious data.

You have even identified a research paper that tackles a similar question using exactly the same technology and type of data. That paper describes a data analysis method and published a piece of software that are ideally suited to answer your question with your data.

Experimental design + data + software = results

You have generated good quality data and found the right software.

Your results are at arm’s length, aren’t they?

What could go wrong?

Possible causes of death

Sadely, lots of unfortunate events stand in the way of your results:

The software isn’t available (anymore).
The software is available, but it can’t be installed.
The software can be installed, but it doesn’t work.
The software “works” on test data , but you can’t get it to run on your data.
The software “runs” with your data, but the results don’t make any sense.
…

Software collapse

The software doesn’t work

Software collapse (or software rot) is the fact that software will stop working at some point if is not actively maintained. Collapse can be the results of bugs, accidental changes or voluntary breaking changes (i.e. that don’t guarantee backward compatibility) in the software itself, changes in software (and service) depedendencies, …
Or simply disappearance of the software (or more generally, the page where it was available), or the lack of responses when originally available on request only.
Or maybe that the “software” was never meant to last beyond that one use case/paper. In such cases, it should have clearly been labelled as a protoype, not a tool/software can other can reuse.

Or the software works but

There is no example data, and it’s not clear what the input should look like.
There is no documentation - the software works (with the example/test data or with yours), but the commands and/or output don’t make any sense.
Even though the software (correctly?) runs, the lack of documentation or its inadequacy make it too difficult to use.

Making software survive longer

There exist many steps that one can take to minimise the risks described above. These steps are technical to write better, and more maintainable software, or non-technical, to grow and foster a community and support around the software and their developers. These opportunities aren’t presented in any order of importance. Different situations and constrains will define what is possible in terms of possibilities.

If there’s one thing to take away, it is not to stay alone in the development and maintenance of a piece of software, especially for junior researchers/developers.

Administration

Stay withing the law, which includes any legal constrains or limitations, intellectual property, author- and copyrights, funding obligations, licencing, academia vs industry, policies and regulations, … Stay informed and identify allies that have the required experience.

If possible, make your software widerly available under an open source licenses increase usage, contributions, and visibility (see below).

Open source development

Choose an open source license to publish your software (ideally as a piece of software and as a research paper) and archive (Zenodo, Software Heritage, …) it.

Making your software known allows to foster a collaborative environment and a user and (co-)developer community around your software. To facilitate this, consider having a code of conduct, onboarding documents, contribution guides, and support forum for users, … This is particularly relevant if your software is itself part of a larger ecosystem, and it is possible to adopt or adapt what is already available in that software ecosystem.

Development

Good software development is paramount to minimse software collapse. But everybody starts at some point, and sharing code is a good way to move towards the next steps. Even if one feels that the code isn’t ready for prime time because of lack of ‘formal’ training (many lack it and still become respected developers and contributors), it is much better in the short and long run to share code.

Here are some tips:

Implement modularity to deal for instance with software collapse. It is much easier to maintain and extend small independent components rather than a large monolithic code base.
Do learn and follow best practice when in comes to research software development. These include automation and manual tasks, unit and integration testing, version control, software versioning, … Finding a well meaning community will help with this.
Don’t reinvent the wheel, and try re-use existing and robust infrastructure when possible/available - stand on the shoulders of giants. But beware of fragile dependencies, even though this is difficult without the experience.
Document you code and you software. Forget the silly myth that real developers focus on writing code and not documenting it - that certainly holds for bad developers. Writing documentation forces to put oneself in the position of a user, which is very often enlightening on the usability of what is produced. There are many types of documentations: manuals, tutorials, example data, installation, user and developer guides, slides, videos, web page, … There’s no need to have all of them - focus on a few high quality ones.
Focus on traceability and reproducibility when analysing data and developing software to do so. Without traceability and reproducibility, there’s no science, only anectodal evidence, at best.

Software life cycle

This is a point of particular interest to more senior developers and PIs. There’s pressure to produce new features and software, but planning beyond is important.

Think of your software’s life cycle: maintenance, new features (if possible), new developers, …
Plan for sunsetting you software. Consider ending, pausing, or handing off.
Also consider distaster planning, when funding suddenly gets cut off: make a thread model considering social, financial and technical vulnerabilities.

Community

Consider user and developer communities:

Maintain your software, answer questions, accept contribution, credit contributors, … make you software findable, citable (DOI), and re-usable.
Announce your software, promote it (on social media, mailing lists, forums, …) and through more formal academic publications (think of the different audiences), conference presentations, posters and workshops.

Building a community and addressing its needs is also time consuming. There is some responsability that comes with release a tool that is meant to be used by others. If one isn’t prepared (or able) to consider this investment, it might be better to release a prototype.

Use(rs)

The best way to produce software that is used and useful is to ‘eat your own dog food’ - use the sofware you develop, to assess, in real time, the usability and relevant of software.
Produce software that users can install and use - avoid root/admin priviledges.
Make it (easy to) run on other’s computer (no hard coded paths, …). “It runs on my computer” is irrelevant.
Be explicit on the code: are you publishing one-off analysis scripts or prototpyes without any gurantees, scripts supporting results that come some some guarantees/efforts to be reproduces, or tool/software for wider consumption?

Training

Appropriate training in data analysis, data management, and software developement/usage is absolutely essentiel. (Some of) these should be delivered early, and indeally as part of the university curriculum to students that train in a software-heavy field (all STEM).

Here are some well-known examples:

Incentives and funding

It is important to recognise of the role of those that develope and maintain software, often called research software engineers, and offer them a (stable) career paths.

Decicated funding for sofware are rare, but they do exist: Software Sustainability Institute Research Software Maintenance Fund (UK), CZI’s Essential Open Source Software for Science, …

Conclusions

Your results are only as good as the method and the software you use. Without decent software, there’s hardly any trustworthy science.

This means that we need to offer the opportunities and support for developers to develop, release and maintain “good enough” software. Collectively, we can:

Offer training to researchers who write code or develope software, but haven’t had any training. These can within an official curriculum, or workshops such as those cited above.
If researchers offer ways to cite their code/software, let’s give them credit for their work.
Ideally, there should be well defined career paths for researchers whose main tasks centre around software. In many countries, research software engineers are emerging or are already established.
Being able to fund software development and maintenance is essential. Currently, too much software is developed as explicit or implicit side effects of research projects.
It is also crucial to maintain the infrastructure that supports software development and maintenant, such as proper archiving.
There are many actors that can and should support software: researchers, of course, but also their institutions, funders, publishers and libraries.

References

Dealing With Software Collapse. Hinsen K. (2019)
10 quick tips for making your software outlive your job. Littauer R et al. (2025).
For long-term sustainable software in bioinformatics. Coelho LP. (2024).
Ten simple rules for making research software more robust. Taschuk and Wilson (2017).
CODE beyond FAIR. Di Cosmo et al. (2025).

CBIO’s EuroBioc2025 posters and talks

2025-09-06T00:00:00+02:00

Before I forget posting this year’s lab EuroBioc contribution, here are the abstract of the work we will present at the EuroBioc2025 in Barcelona

PSMatch: a Bioconductor Package for Handling Peptide-Spectrum Matches Data

Guillaume Deflandre, Sebastian Gibb and Laurent Gatto

Loading, exploring and analysing the resulting Peptide-Spectrum Matches (PSMs) from a database search in Mass Spectrometry (MS)-based proteomics can be time-consuming. PSMatch is an R/Bioconductor package designed to handle this process by offering functionalities to streamline exploration and visualisation of PSM data. It provides functions to load PSM data from mzId or tabular files, generate theoretical fragment ions, model peptide-protein relations and facilitate visualisations.

Recent developments in PSMatch have focused on extending these functionalities to support post-translational modifications, enabling more accurate characterisation of modified peptides. Effort in identifying modified peptides is needed as it is these peptides that are expected to constitute a significant proportion of unidentified spectra. In fields such as single-cell proteomics or metaproteomics, where the identification rates pale by comparison with bulk approaches, this becomes even more prominent. Enabling users to benefit from a powerful and flexible R ecosystem to further explore these unidentified spectra is therefore paramount.

PSMatch is part of the R for Mass Spectrometry initiative, that develops an open and collaborative ecosystem of MS-based proteomics and metabolomics, offering efficient, scalable, and stable infrastructure for MS-based proteomics.

And here’s the relevant preprint.

An Open Software Development-based Ecosystem of R Packages for Mass Spectrometry Data Analysis

Laurent Gatto, Sebastian Gibb and Johannes Rainer

A frequent problem with scientific research software is the lack of support, maintenance and further development. In particular, development by a single researcher can easily result in orphaned and dysfunctional software packages, especially if combined with poor documentation, missing unit tests or lack of adherence to open software development standards. The RforMassSpectrometry (https://www.rformassspectrometry.org/) initiative aims to develop an efficient, scalable, and stable infrastructure for mass spectrometry (MS) based proteomics and metabolomics data analysis. As part of this initiative, a growing ecosystem of R software packages is being developed covering different aspects of metabolomics and proteomics data analysis. To avoid the aforementioned problems, community contributions are fostered, and open development, documentation and long-term support emphasised.

At the heart of the package ecosystem lies the Spectra package that provides the core infrastructure to handle, process and visualise MS data. Its design allows easy expansion to support existing and new file or data formats, including data representations with minimal memory footprint or remote data access. For proteomics data analysis, two packages in particular are dedicated to the analysis or quantitative and identification data. The PSMatch package handles and manages peptide identification data. It also provides functions to model and visualise peptide-protein relations to make informed decision about shared peptide filtering. The package also provides functions to calculate and visualise MS2 fragment ions, in conjunction with the Spectra package. The QFeatures package is the working horse for quantitative proteomics data. It builds on the familiar SummarizedExperiment and MultiAssayExperiment infrastructure and provides a familiar Bioconductor user experience to manage bulk and single-cell quantitative data across different assay levels (such as peptide spectrum matches, peptides and proteins) in a coherent and tractable way.

For metabolomics data analysis, xcms is one of the core software packages for the required preprocessing of LC-MS data. This Bioconductor package was recently updated to reuse the R for Mass Spectrometry infrastructure, enabling now also the analysis of very large, and/or remote, data. This integration simplifies in addition complete analysis workflows which can include functionality from the MsFeatures package for compounding, and from the MetaboAnnotation package facilitating annotation of untargeted metabolomics experiments. Public annotation resources can be easily accessed through packages such as MsBackendMassbank, MsBackendMsp or CompoundDb, the latter also allowing to create and manage lab-specific compound databases. These packages rely on the MsCoreUtils and MetaboCoreUtils packages for efficient implementations of commonly used algorithms, designed to be re-used by other R packages.

In contrast to a monolithic software design, the R for Mass Spectrometry ecosystem enables to build customised, modular, and reproducible analysis workflows. Future proteomics- and metabolomics-related development will focus on improved data structures and analysis methods, better support for third-party data import, and better interoperability with other open source software.

Gatto, L., Gibb, S., & Rainer, J. (2025). An Open Software Development-based Ecosystem of R Packages for Mass Spectrometry Data Analysis. European Bioconductor Conference (EuroBioc2025), Barcelona, Spain. Zenodo. https://doi.org/10.5281/zenodo.17105729

Benchmark of Module Detection Methods for Single Cell Proteomics

Enes Sefa Ayar and Laurent Gatto

Proteins are the key molecules in executing biological functions, and they cooperate as part of protein complexes or biological pathways. Correlation in their abundance suggests functional interdependence, offering insights into biological functions. Thus, identifying biologically meaningful protein groups (modules) is a critical step in understanding cellular processes. While many module detection methods exist, they were developed for bulk or transcriptomic data and rely on the assumption that gene expression levels can identify functionally related protein groups.

Single-cell proteomics (SCP) quantifies protein levels at single-cell resolution, eliminating this assumption and offering a more accurate view of functional protein relationships. Moreover, SCP preserves cellular heterogeneity, enabling the discovery of dynamic and context-specific protein modules are often masked in bulk measurements. Despite these advantages, existing module detection methods may not be well suited for SCP data, which presents unique challenges such as batch effects and missing values.

Moreover, these methods also differ by various features, for instance, whether they incorporate (or not) prior biological knowledge, whether they allow (or not) overlapping modules, or to what extent they use differential correlation analysis. Parameter choices further influence the identified modules, often leading to arbitrary decisions. However, all share a more critical limitation: they generate modules even when applied to random data. It is therefore essential to distinguish biologically relevant modules from artifacts.

In this work, we systematically evaluate module detection methods on SCP datasets. Our assessment framework integrates (1) internal clustering metrics to evaluate compactness and separation, (2) external validation against known biological annotations, and (3) network-based analyses incorporating protein-protein interaction data to enhance biological interpretability. Our findings reveal notable differences in the biological relevance of the identified modules and offer practical recommendations for selecting and validating module detection approaches for single-cell proteomics. Furthermore, we propose strategies for addressing missing values and batch effects, thereby improving the accuracy and reliability of module detection.

e-OMIX, a new visual interface for analyzing and managing omics data

Molka Anaghim Ftouhi, Loïc Guille, Jérôme Linden, Sébastien Jodogne and Laurent Gatto

The e-OMIX project (https://www.eomix.be) aims to lower the barrier of entry into omics research by providing an interactive platform where users will be able to perform analyses, as well as storing the resulting data and metadata, without the need for advanced coding skills. e-OMIX is developed under AGPL 3 license as an Angular/Java-based web-app, making use of several innovative technologies. Pre-built pipelines are implemented from nf-core, a repository of publicly available workflows, maximizing their reproducibility and ease of use. Result matrices are stored in a database optimized for fast querying of large datasets (TileDB) and can be exported in several objects notably ‘Bioconductor’s SingleCell Expriment (SCE) or other (anndata, or Seurat), while metadata are stored as per-sample individual documents in a document-oriented database (CouchDB). To increase the interoperability of metadata, e-OMIX also offers the possibility to manage and export them using Fast Healthcare Interoperability Resources (FHIR), a widely used standard in healthcare and clinical research. Finally, data visualization is made possible by using the iSEE R/Biocondutor package. As a first use case, we demonstrate the end-to-end execution of single-cell RNA-seq pipeline, starting from metadata and raw files upload, and leading to actionable data, such as annotated cell types, individual gene expression or marker gene identification.

Tools and Strategies for Systematic Benchmarking of R Packages: A Case Study with QFeatures

Léopold Guyot and Laurent Gatto

As bioinformatics continues to evolve, it must keep pace with experimental techniques that generate increasingly large volumes of data. In this context, optimizing the performance of code and packages that handle these data becomes essential. This work presents the optimization efforts carried out on the QFeatures R/Bioconductor package, which is used for the analysis of quantitative proteomics data. We also highlight a set of tools and methods that are valuable for performance optimization, with a particular focus on VerR, an R package designed to create isolated and reproducible environments. These environments allow for the installation of specific package versions, enabling systematic benchmarking to assess the performance impact of different versions. As a result of these optimization efforts, we observed a 90% reduction in the runtime of a classical single-cell proteomics (scp) workflow and a 50% decrease in memory usage, demonstrating the significant impact of targeted optimizations.

CBIO’s EuroBioc2024 posters and talks

2025-06-09T00:00:00+02:00

The lab is preparing for EuroBioc2025 in Barcelona, and I realise that I forgot to post our contributions to EuroBioc2024 in Oxford. So here they are (with papers that were published in the meantime).

Differential Correlation Analysis and Biological Function Inference on Single Cell Proteomics

Author(s): Enes Sefa Ayar, Laurent Gatto

Proteins are the key molecules in executing biological functions within cells. They operate in cooperation with other proteins to carry out these functions as part of protein complexes, or biological pathways. Thus, the correlation among these proteins implies a functional interdependence, offering insights into both biological functions and mechanisms. Differential correlation analysis promises to infer these biological functions and even underlying mechanisms by identifying similar or different correlation patterns in groups of proteins across conditions (ex. cell types, treatments). However, current approaches, particularly those developed for bulk measurements, may not be suitable for single-cell proteomics (SCP) datasets as they may overlook false positives and false negatives emerging due to batch effects or missing values.

We aim to investigate the most suitable approach for uncovering functional correlation in SCP datasets. We compared two approaches used in SCP [1, 2] and two other network-based methods [3, 4], commonly used in RNAseq studies. This benchmark involves comparing these methods across various SCP datasets from scpdata package, each with different properties including sample size, protein coverage, and missing values. Thus far, our observations indicate the importance of addressing batch effect-driven correlations. Our benchmark assesses the methods based on biological relevance, statistical significance, and data simulations.

References

[1] Hu, M., Zhang, Y., Yuan, Y., Ma, W., Zheng, Y., Gu, Q., & Xie, X. S. (2023). Correlated protein modules revealing functional coordination of interacting proteins are detected by single-cell proteomics. The Journal of Physical Chemistry B, 127(27), 6006–6014.

[2] Khan, S., Conover, R., Asthagiri, A. R., & Slavov, N. (2023). Dynamics of Single-Cell Protein Covariation during Epithelial–Mesenchymal Transition.

[3] Langfelder, P., & Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics, 9(1).

[4] Song, W.-M., & Zhang, B. (2015). Multiscale embedded gene co-expression network analysis. PLOS Computational Biology, 11(11).

Bulk vs single-cell proteomics: is there a need for identification optimization?

Author(s): Guillaume Deflandre, Samuel Grégoire, Laurent Gatto

Single-cell proteomics (SCP) has emerged as a powerful tool for elucidating cellular heterogeneity, offering opportunities beyond traditional bulk sample analysis. However, the application of current peptide identifications crafted for bulk samples may lead to false discoveries in SCP. Challenges such as reduced peak counts, lower peak intensities, and degraded signal-to-noise ratios (as identified by Boekweg et al. [1]) raise the question: do current peptide scoring methods in search engines adequately perform in the context of SCP? To address these limitations, we explore the effectiveness of search engines and rescoring tools with the use of Bioconductor packages PSMatch and Spectra. Rescoring tools take profit of as many mass spectrometry-based features as possible, such as spectral characteristics and retention time models, which can be particularly relevant to mitigate the poor quality of SCP spectra. We used MS²Rescore to generate new features, Mokapot to rescore the SCP peptides as well as the above-mentioned packages to assess the efficiency of rescoring tools and potentially improve current scoring methods in the context of SCP. Our findings demonstrate a significant increase in confidently identified peptides upon rescoring. In addition, we suggest a 4-step methodology to evaluate the usefulness of current and new potential features. Finally, our results shed light on the differences between bulk and single-cell samples whilst providing insights that can inform more accurate and reliable data interpretation in the context of SCP.

References

[1] Hannah Boekweg and Samuel H. Payne. Challenges and Opportunities for Single-cell Computational Proteomics. Molecular & Cellular Proteomics, 22(4):100518, April 2023. ISSN 15359476. doi: 10.1016/j.mcpro.2023.100518. URL https://linkinghub.elsevier.com/retrieve/pii/S1535947623000282.

From Cancer-Testis genes to Cancer-Testis enhancers

Author(s): Julie Devis, Axelle Loriot, Charles De Smet and Laurent Gatto

Cancer-Testis (CT) genes are normally expressed only in germ cells and not in healthy somatic tissues. However, they are aberrantly activated in many tumours. Many CT genes are regulated by methylation. Their promoters are highly methylated in all healthy somatic tissues and demethylated in germ cells. They are also demethylated in tumours in which they are activated. This is a consequence of the global demethylation process often observed in cancer. These characteristics give them clinical potential, as they produce cancer-specific antigens and can thus be used as target for cancer immunotherapy. We have recently developed the CTexploreR Bioconductor package, an updated database for CT genes.

Promoters are not the only regulatory regions that can be affected by DNA methylation. It has been shown that many enhancers, that are activating distal regulatory regions, can be methylated. Their methylation can be altered in tumours, affecting the expression of their target genes. We hence wondered if we could find CT enhancers that would behave like CT genes promoters. We compared ENCODE cis-regulatory elements and whole genome bisulfite-seq data in somatic and germinal healthy tissues and in cancer to find enhancers that are active and demethylated exclusively in germ cells and tumours. We identified CT-like enhancer candidates that will be further defined.

An Open Software Development-based Ecosystem of R Packages for Proteomics Data Analysis

Author(s): Laurent Gatto and RforMassSpectrometry contributors

The RforMassSpectrometry (https://www.rformassspectrometry.org/) initiative aims to develop an efficient, scalable, and stable infrastructure for mass spectrometry (MS) based proteomics (Gatto et al. poster) and metabolomics (Rainer et al. poster) data analysis. As part of this initiative, a growing ecosystem of R software packages is being developed covering different aspects of metabolomics and proteomics data analysis. To avoid the aforementioned problems, community contributions are fostered, and open development, documentation and long-term support emphasised.

At the heart of the package ecosystem lies the Spectra package that provides the core infrastructure to handle, process and visualise MS data. Its design allows easy expansion to support existing and new file or data formats, including data representations with minimal memory footprint or remote data access. For proteomics data analysis, two packages in particular are dedicated to the analysis or quantitative and identification data. The PSMatch package handles and manages peptide identification data. It also provides functions to model and visualise peptide-protein relations to make informed decision about shared peptide filtering. The package also provides functions to calculate and visualise MS2 fragment ions, in conjunction with the Spectra package. The QFeatures package is the working horse for quantitative proteomics data. It builds on the familiar SummarizedExperiment and MultiAssayExperiment infrastructure and provides a familiar Bioconductor user experience to manage bulk and single-cell quantitative data across different assay levels (such as peptide spectrum matches, peptides and proteins) in a coherent and tractable way. These three packages rely on MsCoreUtils for efficient implementations of commonly used algorithms, designed to be re-used by other R packages.

In contrast to a monolithic software design, the RforMassSpectrometry ecosystem enables to build customised, modular, and reproducible analysis workflows. Future proteomics-related development will focus on improved data structures and analysis methods, better support for third-party data import, and better interoperability with other open source software including a direct integration with Python MS libraries.

Publication

Loriot, Axelle, Julie Devis, Laurent Gatto, and Charles De Smet. 2025. “A Survey of Human Cancer-Germline Genes: Linking X Chromosome Localization, DNA Methylation and Sex-Biased Expression in Early Embryos.” bioRxiv. https://doi.org/10.1101/2025.05.19.654804.

Mass spectrometry-based proteomics/metabolomics and Bioconductor: from the early days to 2024

Author(s): Laurent Gatto, Sebastien Gibb, Johannes Rainer

The Bioconductor project has always been best known for its state-of-the-art infrastructure for genomics data analysis and comprehension. Starting with packages for microarrays, and later RNA Sequencing, transcriptomics has been the most visible part of the Bioconductor iceberg. Proteomics has been part of the early days of the project, with the PROcess package to process SELDI-TO-MS data, that was cited/documented in the very first Bioconductor paper (2004) and monograph (2005). Proteomics and metabolomics have grown substantially since these early days, both in terms of packages, community contributions, and user base, culminating in the R for Mass Spectrometry initiative. In this short talk, I will provide an overview of how the mass spectrometry-based proteomics and metabolomics infrastructure has evolved since the early days, and what the goals for the future are.

Comprehensive and standardised workflow for single-cell proteomics data analysis using scp and scplainer.

Author(s): Samuel Grégoire, Christophe Vanderaa and Laurent Gatto

Single cell proteomics (SCP) via mass spectrometry has become achievable thanks to technological advancements innovated by various research teams, resulting in a broad landscape of cutting-edge methodologies [1]. While this progress has enabled the measurement of thousands of proteins at the single cell resolution, it has also resulted in various complex and divergent analysis workflows. To efficiently tackle biologically relevant questions, the field of SCP must confront the challenges inherent in SCP data. SCP data are particularly prone to technical variations, batch effects, and missing values [2].

To address these challenges, our team has developed several tools packaged within the scp R/Bioconductor package. The latest addition is the scplainer approach, which offers a standardised approach grounded in linear modeling. scplainer provides key tools to extracts meaningful insights from SCP data through variance analysis, differential abundance analysis and component analysis, while streamlining the visualisation of the results. Integrated into the scp package, scplainer leverages QFeatures and SingleCellExperiment infrastructures, providing a comprehensive interface with numerous data processing functions. In addition, we also developed scpdata, a package containing standardised and annotated single-cell proteomics data, which we are still actively extending.

In this work, we provide a comprehensive overview of SCP data processing using the scp package, starting from the output table generated by the search engine software through data processing, modeling and downstream analyses.

References

[1] Petrosius et Schoof (2023), Recent advances in the field of single-cell proteomics.

[2] Vanderaa et Gatto (2021), Replication of single-cell proteomics data reveals important computational challenges.

Publication

Grégoire, Samuel, Christophe Vanderaa, Sébastien Pyr dit Ruys, Christopher Kune, Gabriel Mazzucchelli, Didier Vertommen, and Laurent Gatto. 2024. “Standardized Workflow for Mass-Spectrometry-Based Single-Cell Proteomics Data Processing and Analysis Using the Scp Package.” In Methods in Molecular Biology, 177–220. Methods in Molecular Biology (Clifton, N.J.). New York, NY: Springer US (pre-print).

scpGUI and QFeaturesGUI: Graphical Interfaces for Single-Cell and Bulk Proteomics

Author(s): Léopold Guyot, Christophe Vanderaa, Laurent Gatto

In recent years, significant advancements have been made in the field of proteomics data analysis. However, the complexity of workflows involving programming languages such as R and Python can pose challenges for practitioners without any coding backgrounds. To address this issue, we introduce two user-friendly packages: scpGUI and QFeaturesGUI.

scpGUI is tailored for downstream visual analysis in single-cell proteomics, using the outcomes of the popular package scp. Developed using Shiny and built upon the elegant and efficient iSEE suite, scpGUI offers interactive data visualisations specifically crafted for single-cell proteomics downstream analysis. The app’s interactivity helps users in comprehending their results through various visualisations.

Similarly, QFeaturesGUI is designed for single-cell and bulk proteomics analysis, capitalising on the strengths of the QFeatures package. Providing a suite of Shiny apps, QFeaturesGUI offers comprehensive tools for data import and basic processing steps, simplifying pre-treatment of quantitative proteomics data. Its modular design ensures flexibility and adaptability to specific requirements. These apps also enhance transparency and facilitate replication by generating reproducible R code.

Together, scpGUI and QFeaturesGUI offer valuable support for proteomics data analysis. By combining user-friendly graphical interfaces with powerful back-end tools, they make advanced analysis techniques more accessible to the wider proteomics community.

Batch effect detection and visual quality control with CytoMDS, a Bioconductor package for low dimensional representation of distances between cytometry samples

Author(s): Philippe Hauchamps,Dan Lin,Laurent Gatto

Quality Control (QC) of samples is an essential preliminary step in cytometry data analysis. Notably, identification of potential batch effects and sample outliers is paramount, to avoid mistaking these effects for true biological signal in downstream analyses. However, this task can prove to be delicate and tedious, especially for datasets with many samples.

Here, we present CytoMDS, a Bioconductor package implementing a dedicated method for low dimensional representation of cytometry samples composed of marker expressions for up to millions of single cells. This method combines Earth Mover’s Distance (EMD) [1] for assessing dissimilarities between multidimensional distributions, and Multi Dimensional Scaling (MDS) [2] for low dimensional projection of distances. Some additional visual tools, both for projection quality diagnosis and for user interpretation of the projection axes, are also provided in the package.

We demonstrate the strengths and advantages of CytoMDS for QC of cytometry data on real biological datasets, revealing the presence of low quality samples, batch effects and biological signal between sample groups.

References

[1] Haidong Yi and Natalie Stanley. 2022. “CytoEMD: Detecting and Visualizing between-Sample Variation in Relation to Phenotype with Earth Mover’s Distance.” In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 1–14. BCB ’22 28. New York, NY, USA: Association for Computing Machinery.

[2] Jan de Leeuw and Patrick Mair. 2009. “Multidimensional Scaling Using Majorization: SMACOF in R.” Journal of Statistical Software 31 (3): 1–30.

Publication

Hauchamps, Philippe, Simon Delandre, Stéphane T. Temmerman, Dan Lin, and Laurent Gatto. 2025. “Visual Quality Control with CytoMDS, a Bioconductor Package for Low Dimensional Representation of Cytometry Sample Distances.” Cytometry. Part A: The Journal of the International Society for Analytical Cytology, March. https://doi.org/10.1002/cyto.a.24921 (pre-print).

6 Questions We Should Ask Before Adopting a New Technology

2025-05-18T00:00:00+02:00

I stumbled across this message on mastodon, that directed me to 6 questions what one should ask before adopting a new technology. These questions where posed by Neil Postman, a writer and academic quite some time ago. A reply in that same thread add a link to a 1988 talk where he expands one them.

What is the problem that this new technology addresses?
Who’s problem is it?
What problems do we create by solving this problem?
Which people and which institutions might be harmed by a technological solution?
What changes in language occur as the result of technological change?
Which people and which institutions will acquire economic and political power when this technology is adopted?

These questions and the whole talk are acutely relevant today, in the light of generative artificial intelligence, and how it is imposed on everyone.

Whenever I hear about about advantages or usages of generative AI, I always think about the problem they supposedly address, and whether they actually really solve it rather than the symptoms, and what would be needed to actually solve the problem. And unsurprisingly, generative AI hardly ever solves any real problem when it comes to research or education. The latter is also addressed in Neil Postman’s talk above.

Here are two books that are on my reading list, that are relevant to the talk and topic:

The AI Con - How to Fight Big Tech’s Hype and Create the Future We Want, by Alex Hanna and Emily M. Bender
The Mechanic and the Luddite - A Ruthless Criticism of Technology and Capitalism, by Jathan Sadowski.

Here is another set of questions, specifically about AI/LLMs, that are proposed by Dr Gwen Varley in her Ethics of ChatGPT and AI video. The question come towards the end the video (31:18), but I suggest you listen to the whole thing.

Whenever we face a new technology, especially when branded as new and inevitable by those that sell it, we should ask ourselves:

Is it actually new? In what way, specifically?
- If it’s not new, who is trying to present it as novel, and why? What is the narrative? *Whose** technology are they attempting to rebrand, and why?
Who (if anyone) was asking for this technology?
- Is it a hammer in search of nails?
  - Does it have a discernible business model?
Who benefits from it? How do they benefit?
- (Profits? Political power? Shifts in social norms?)
What are the costs of the technology? Who bears those costs?
- Includes economic bit also environmental and social costs.
What resources (including labour - whose?) are powering this technology? Are any of these resources being stolen or gained through exploitation/
How is the introduction of this technology similar to (or different from) historical examples of past technologies? What can we learn from this?
- Does the technology claim to be ‘neutral’ or ‘unbiaised’?
  - Technology is a mirror (of the values their creator)

On cats, farts and parrots.

2024-05-30T00:00:00+02:00

I was invited to contribute to a seminar/discussion on AI, and using language models in research. These are the notes I prepared my short presentation. I am not presenting anything new, or original - I will merely be sharing what I consider being the main take home messages from information I have been collecting since April 2023. I also realised that these don’t seem to be widely known among my immediate peers.

Who am I

I feel it is important to say a few words to put my talk into perspective:

I am a Computational Biologist, heading the CBIO lab at the de Duve Institute, UCLouvain. We occasionally use/develop of DL in the CBIO lab as part of our research. I am not an expert in AI.
I have never used ChatGPT or any similar tools, and I’ll tell you why! I however actively follow discussions on AI and its impact on society.
I will focus on ChatGTP and similar LMs that are released by large and very powerful commercial entities for wide public consumption. I am not focusing on application of DL, LM, or AI in general in research.

Reference

If there’s one reference to remember, it is this seminal paper from March 2021, published in the Proceedings 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘21)

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell.

Emily Bender is a professor of computational linguistics at Washington University. Angelina McMillan-Major is/was a PhD student in her lab.
Timnit Gebru is one of the most well-known and respected Black female scientists working in AI. She was a co-lead of Google’s ethical AI research team. In December 2020, her employment with Google ended after Google management asked her to either withdraw the paper before publication, or remove the names of all the Google employees from the paper.
Margaret Mitchell was later also fired from Google.

More about the Stochastic parrot paper here.

When can I use ChatGPT?

Following Yves Deville and Christine Jacqmot’s recommendations (ChatGPT : Menace ou opportunité pour l’enseignement supérieur, March 2023):

Given that

LLMs have absolutely no notion of “true” or “false”, nor any understanding of what it is asked.

Use it if

You don’t can about the validity of the results.
You are an expert in the field.

Note: lots has been said about ChatGPT’s “occasional” hallucinations (beware of the anthropomorphising word here). They always hallucinate. It just happens so that sometimes, what is made up, is not wrong. I will come back to some of these points in the later stochastic parrot section.

At what cost?

Many have tested ChatGPT. Some (New AI tools much hyped but not much used, study says) possibly make regular use of the free and/or the paid version. It might be used for important or minor/mundane tasks. But at what costs?

Human cost

Human cost is real and current. It is not a potential science-fiction picture of AI vs humanity. Such a picture diminishes the current human cost of AI, as is force-fed by big tech.

Here are some investigations about worker exploitation at OpenAI and Google with respect to the curation of AI-generated content:

TIME: OpenAI Used Kenyan Workers on Less Than $2 Per Hour.
The Guardian: ‘It’s destroyed me completely’: Kenyan moderators decry toll of training of AI models.
Business Insider: Kenyan Workers Paid $2/hr Labeled Horrific Content for OpenAI.

and

Fortune: I’m paid $14 an hour to rate AI-generated Google search results. Subcontractors like me do key work but don’t get fair wages or benefits.

Note that this isn’t specific to ChatGPT. Similar workers exploitation has been documented for Meta/Facebook reviewers from the Global South.

Already marginalised communities suffer the highest human cost.

Environmental cost

Here are some relevant articles and illustrative quotes;

Nature Machine Intelligence: The carbon impact of artificial intelligence.
technologyreview.com: We’re getting a better idea of AI’s true carbon footprint.
Nature Climate Change: Aligning artificial intelligence with climate change mitigation.
nature.com: Generative AI’s environmental costs are soaring - and mostly secret.
nature.com: How to shrink AI’s ballooning carbon footprint.
The GuardianThe ugly truth behind ChatGPT: AI is guzzling resources at planet-eating rates.

Despite its name, the infrastructure used by the “cloud” accounts for more global greenhouse emissions than commercial flights. In 2018, for instance, the 5bn YouTube hits for the viral song Despacito used the same amount of energy it would take to heat 40,000 US homes annually.

Furthermore, while minerals such as lithium and cobalt are most commonly associated with batteries in the motor sector, they are also crucial for the batteries used in datacentres. The extraction process often involves significant water usage and can lead to pollution, undermining water security. The extraction of these minerals are also often linked to human rights violations and poor labour standards. Trying to achieve one climate goal of limiting our dependence on fossil fuels can compromise another goal, of ensuring everyone has a safe and accessible water supply.
The Guardian: As the AI industry booms, what toll will it take on the environment? (citing - Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model)

[Luccioni et al.] tallied the amount of energy used to train […] Bloom, on a supercomputer; the energy used to manufacture the supercomputer’s hardware and maintain its infrastructure; and the electricity used to run the program once it launched. They found that it generated about 50 metric tons of carbon dioxide emissions, the equivalent of an individual taking about 60 flights between London and New York.

By contrast, limited publicly available data suggests about 500 metric tonnes of CO2 were produced just in the training of ChatGPT’s GPT3 model 3 – the equivalent of over a million miles driven by average gasoline-powered cars, the researchers noted.

Even more unclear is the amount of water consumed in the creation and use of various AI models. Data centers use water in evaporative cooling systems to keep equipment from overheating. One non-peer-reviewed study, led by researchers at UC Riverside, estimates that training GPT3 in Microsoft’s state-of-the-art US data centers could potentially have consumed 700,000 liters of freshwater.
theconversation.com: AI has a large and growing carbon footprint, but there are potential solution on the horizon.

Since the AI boom started in the early 2010s, the energy requirements of AI systems known as large language models (LLMs) – the type of technology that’s behind ChatGPT – have gone up by a factor of 300,000. With the increasing ubiquity and complexity of AI models, this trend is going to continue, potentially making AI a significant contributor of CO₂ emissions. In fact, our current estimates could be lower than AI’s actual carbon footprint due to a lack of standard and accurate techniques for measuring AI-related emissions.
tomshardware.com: AI may eventually consume a quarter of America’s power by 2030, warns Arm CEO.
bloomberg.com: Microsoft’s AI Investment Imperils Climate Goal As Emissions Jump 30%.

How ironic!!

“The company’s goal to be carbon negative by 2030 is harder to reach, but President Brad Smith says the good AI can do for the world will outweigh its environmental impact.”

Note that this is also relevant for other cloud services, such as video on demande (detail for Netflix here).

Already marginalised communities (will) suffer the highest environmental cost.

Intellectual property

Where does all that training data come from?

What about the credit and licensing of text, voice and images of those that produced that data used for training.

Stochastic parrot

I’ll borrow here directly from the paper, to highlight specific issues with the vast amounts of data needed to train these large models, and the (absence of) meaning output by the models.

Unfathomable training data

Size doesn’t guarantee diversity: from initial participation, to data filtering, the data reflect the hegemonic viewpoint.
Data is static data, but social views change.
Biais is encoding and amplified in the training data, in particular stereotypical associations and negative sentiment towards specific groups.
Large data and the lack of curation, documentation and accountability lead to a major documentation debt, that can’t be addressed after the fact.

Systematic biais against already marginalised communities.

Stochastic parrot

Coherence is in the eye of the beholder

There is no meaning, no model of the world, no intend to communicate in ChatGPT’s output.
Perceived “fluency” and “confidence” give the illusion of (implicit) meaning and expertise.
We tend to mistake the coherence of LLM outputs for meaningful text or expertise.

Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.

It is important to note that, in addition to highlight the risks, the authors do propose paths forward for LM research and development.

AI contamination

AI-generated text is already ubiquitous on-line, and it becomes more and more difficult to identify AI-generated text. How long until AI-generated (meaningless) text (including as answers in Q&A sites), will be (or are) re-used for training.

Outlets are terminating journalist contract to replace them by AI, and independent writers are ‘competing’ against AI.

We have all faced AI chat-bots in so-called help-desks. But AI chatbots are intruding into online communities where people are trying to connect with other humans.

Both of these responses were lies. That child does not exist and neither do the camera or air conditioner. The answers came from an artificial intelligence chatbot.

According to a Meta help page, Meta AI will respond to a post in a group if someone explicitly tags it or if someone “asks a question in a post and no one responds within an hour.”

There are prime examples of enshittification (from Wikipedia):

Enshittification is the pattern of decreasing quality observed in online services and products such as Amazon, Facebook, Google Search, Twitter, Bandcamp, Reddit, Uber, and Unity. The term was used by writer Cory Doctorow in November 2022, and the American Dialect Society selected it as its 2023 Word of the Year. Doctorow has also used the term platform decay to describe the same concept.

ChatGPT in research

Reproducibility? AlphaFold3 — why did Nature publish it without its code?

When AlphaFold2 was published, the full underlying code was made accessible to all researchers. But AlphaFold3 comes with ‘pseudocode’ — a detailed description of what the code can do and how it works.

[…] for AlphaFold2, the DeepMind team worked with the European Molecular Biology Laboratory’s European Bioinformatics Institute […] Now, DeepMind has partnered with Isomorphic Labs, a London-based drug-development company owned by Google’s parent, Alphabet. In addition to the non-availability of the full code, there are other restrictions on the use of the tool — for example, in drug development. There are also daily limits on the numbers of predictions that individual researchers can perform.
Science journals ban listing of ChatGPT as co-author on papers
Paper writing (paper mills) and reviews (ChatGPT is polluting/influencing peer review).

Who benefits from ChatGTP/AI?

AI, as a hyped-up surveillance business model, force-fed by big tech:

In search engines (Google’s “AI Overviews”). Not the users.
Use your social media photos, posts, info, … to train AI. Not the users.
Facial recognition. Not the citizens.
Microsoft Windows Recall. Not the employees.

Already marginalised communities likely to benefit the least. Privileged communities to benefit the most.

What about regulations?

In the light of what has been said so far, I think it is reasonable to wonder whether regulations shouldn’t be put in place, to address current and future impact and scope of the technologies put in place, their concrete risks and harms, and their implications in terms of systematic (private) data collection and use. Every major big tech company is investing vast amounts of money in AI technologies, data centres, and data collection. And they are demanding returns on these investments.

These same companies are actively lobbying to assure support in their vested interested. This becomes clear when reviewing their implications in various working groups and how AI is framed and communicated to the public and various stakeholders.

Here are two examples, one very recent from the Guardian, and one that directly relates to the influence of Silicon Valley on academia:

OpenAI forms safety council as it trains latest artificial intelligence model: The safety committee is filled with company insiders, including Sam Altman, the OpenAI CEO, and its chairman, Bret Taylor, and four OpenAI technical and policy experts. It also includes the board members Adam D’Angelo, who is the CEO of Quora, and Nicole Seligman, a former Sony general counsel.
How Big Tech Manipulates Academia to Avoid Regulation: The discourse of “ethical AI” was aligned strategically with a Silicon Valley effort seeking to avoid legally enforceable restrictions of controversial technologies.

Conclusions

Despite some notable failures with ‘AI for public consumption’ , one can’t ignore that there there are also success stories, and possibly still untapped opportunities. But …

AI can be kind of useful, but I’m not sure that a “kind of useful” tool justifies the harm.

AI isn’t useless. But is it worth it?, Molly White

Update (2024-10-10)

There are many more relevant articles that could be added and referenced here, too many for me to keep up with. But the following Mastodon post by sneedy maccreedy and article it links to seem particularly relevant:

hinton getting the nobel is a good time to re-read @emilymbender ‘s excellent piece on so-called ‘AI safety’ and different take on hinton than you’re likely to see in the next few days

The article is Talking about a ‘schism’ is ahistorical by Emily M. Bender (also cited above), documenting the phony shism rhetoric of AI safety fantasy on one hand and very real AI ethics on the other.

Podcasts du LLL: Et si on oubliait les notes?

2024-04-24T00:00:00+02:00

Deuxième billet en français référençant un podcast du Louvain Learning Lab (le premier est ici). Cette fois, je fais un peu d’auto-promotion:

Repensons l’évaluation… Et si on oubliait les notes?

Dans cet épisode, nous sortons des sentiers battus…

Avec mes invités nous questionnons le rapport entre les notes et l’apprentissage. Nous ouvrons la possibilité de se défaire des notes le plus longtemps possible… pour poursuivre un but d’apprentissage plutôt que de performance.

Pour ces échanges, j’ai eu le plaisir d’inviter Laurent Gatto, Professeur à l’UCLouvain ainsi que Pascal Wilhelm et Frank van den Berg de l’Université de Twente aux Pays-Bas.

Merci au Louvain Learning Lab de nous permettre de consacrer du temps à cette collection.

Et merci à Emilie Malcourant d’avoir préparer ce podcast!

HUPO ECR Online Panel Discussion - Getting recognised for your work

2024-02-28T00:00:00+01:00

The HUPO Early Career Researcher (ECR) committee has organised a discussion panel on Getting recognised for your work and have asked me to participate - thank you! I am always keen on such events, organised by and for ECRs.

Introductions

The first part of the panel is a short 5-minute introduction of the panellists, including Prof Stacy Malaker from Yale University and Dr Juan Antonio Vizcaino from the EMBL-EBI and myself.

I prepared this career flow chart as a visual aid:

I earned my PhD in 2006, from the Free University of Brussels (ULB). My PhD work focused on the evaluation of different types of evolutionary genetic markers to study cetaceans phylogeny.
During my PhD (probably around 2004 or so), when my work and interests shifted toward bioinformatics, I started a part-time degree in computer science at the University of Namur. That lasted until I left Belgium for the UK in 2010, after completing all my exams, but before finishing my masters project - I never graduated.
After my PhD, I worked for 3 years in industry, in a small company (we probably were about 15 employees). I didn’t see much point in continuing in academia at that point, considering my experience so far and my personal situation. The environment and general atmosphere was very much like an academic lab, with many more collaborations within the team, and clear and common objectives. This goal-oriented work environment was a very refreshing experience that has been influential for the next steps of my career. At some point, I felt I was starting to run in circles and got a chance to move back to academia, at the University of Cambridge nonetheless.
In 2010, I started a post-doctoral research associate (PDRA) position in the Cambridge Centre for Proteomics, working on mass spectrometry-based proteomics.
In 2013, I got promoted to senior research associate (SRA), which allowed me to earn some grants as main PI and develop a small research team.
In 2018, I joined the UCLouvain as a professor of bioinformatics. I teach in the faculty of pharmacy and biomedical sciences (FASB) and run the CBIO computational research group in the de Duve Institute.

An interesting fact is that I started working on DNA during my PhD, moved on with RNA in the private company, and since moving back to academia, I have been focusing on proteins: my career followed the main path of the central dogma of molecular biology.

To give more context to my career path, I also highlight some other activities and interests, that have guided and supported my academic activities.

I started to realise the importance of open and reproducible research around 2010, both with respect to the rigour of doing research, but also in the light of the (at times) oppressive and restrictive global research environment ECR have to endure. The desire for others to benefit from my research by making it as open, collaborative and reproducible as possible, and being vocal about it, has followed me since then.
The Bioconductor project has been instrumental for me. It has allowed me over the years to meet and be influenced by outstanding scientists, and has offered an international environment in which I was able to grow and flourish. I published my first (now retired) Bioconductor package around 2007 (Bioconductor 2.2 and R 2.7), and many more followed. I am a Bioconductor package reviewer, a member of the European Bioconductor (EuroBioc) conference organisation committee (I was a local organiser for a handful EuroBioc conferences in Cambridge, in 2019 in Brussels, and in 2023 in Ghent), have been part until recently of the Code of Conduct committee, am part of the social media working group, I co-lead the Teaching committee, am since 2018 member of the Technical Advisory Board, and co-created, in 2021, the European Bioconductor Society.

Recognition

The panellists were also asked to comment on how they have been recognised for their work, with a particular emphasis their time as ECRs. This is of course very subjective, and I’m not sure if my answer will reflect how I have been recognised (assuming I have), or how I hope I have been. I will also try to look beyond papers, the obvious academic outputs - without those, there is little chance to get academic recognition. This is of course a major problem, as there’s much more to research than papers:

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures.

[Buckheit and Donoho 1995, after Claerbout]

I think I’m known for the computational development and applications in spatial (2010…) and single-cell proteomics (2018…), my efforts to produce open and reproducible research, open and collaborative software development, my R/Bioconductor contributions (some packages have been around for since 2010) as well as my involvement in teaching, such as international workshops (for example the mythical - for me at least - Bioconductor CSAMA course workshop).

One noteworthy aspect of my publication strategy, that highlights my efforts for openness and reproducibility, is the workflow that typically starts with the release of the software (more often than not after review, on Bioconductor), then the publication of a pre-print with code to reproduce the analyses and, eventually, a peer-reviewed paper.

I think I have gained some reputation as someome having expertise in computational quantitative proteomics, including demonstrable technical skills, in addition to more standard scientific/academic output.

In terms of recognition, I suppose that invitations to give talks (for scientific outputs), teach at workshops (pedagogical and technical skills) and to submit papers are obvious goals. Being recognised for my open and collaborative contributions with a Bioconductor community award is one of my proudest moments. High on the list are also the many contributions that the MSnbase package benefited from - some of these indirectly initiated the collaborations that lead to the creation of the R for Mass Spectrometry initiative.

But what matters the most, in my eyes, and what in the end is the most meaningul recognition, are the (shared) values that we promote with the research we do, and the intrinsic motivation that drive us.

Questions

We were also asked to prepare answers to three short questions. These have been pre-determined by the HUPO ECR to get things going and give us, the panellists, a chance to think about the comments we would like to make.

When hiring a new postdoctoral researcher for your group, what are the most important attributes for them to have on their CV? What do you look for other than publication history?

Here are a couple of things I look for, and that I consider absolutely essential, much more important than papers. Papers are only one of the attributes that will help me assess the following:

Does the candidate’s skills match the project’s needs?
What are concrete signs of mastery? I perform regular (and constructive) appraisals with the researchers in my group, and one question in that appraisal is “What do you want to become an expert in?”. In a CV, I want to find what the candidate is an expert in, whey they can teach me/bring to the lab.
I also need to see public/open code, such as for example active Github/Gitlab profiles and repositories and contributions (to their or other’s code base).

And of course, last but not least, will the person be a good lab/team member? It is of course very difficult (and arguably subjective) to assess, but we (the lab) will be attentive to red flags pointing to the contrary. In case of doubt, I will invite the candidate on site if the interview (always with the whole group) was remote. I wouldn’t want to take any risks that could harm the cohesion and well-being of the group.

How would you recommend that ECRs promote their work other than research e.g., teaching, outreach, committee work? Is there anything they can do other than add a line on their CV?

Yes, ‘add lines’ to your CV, but not at all costs. Be pragmatic! Not need to run or teach workshops several time a year to demonstrate that you have done some teaching. Don’t forget that your post-doc years should be the most productive research-wise of your career!
Promote your research, don’t be a vehicle for some else’s research (typically your advisor), don’t limit yourself to merely doing it. Show how you go the extra mile - for example by delivering reproducible research.
Do things you like! Nothing beats motivation when it comes to convincing others that you are good at what you do.

How much does networking (either via social media or in-person meetings) play a role in promoting your work?

It is very important! Networking are opportunities to learn, share, discuss, and make yourself known, … Networking can be hard though, so don’t be too hard on yourselves. It takes time.

Here’s a simple example illustrating the importance of building a network: I happily spare myself organising and running interviews when I can find a candidate in my direct or indirect network.

To build that network, there’s of course the in-person or remote conferences and workshops, there may be social media (might not be for everybody), but also Github issues and code and documentation contributions and typo fixes. The large and small contributions are very concrete examples that address the first question above.

The Grid poster, in R

2024-02-18T00:00:00+01:00

The MuseeL is the UCLouvain University museum in Louvain-la-Neuve. Highly recommended. It’s located to the lively place des sciences, in a nice brutalist style building, formerly the science university library. If you ever spend some time in Louvain-la-Neuve, do spare a couple of hours to visit it.

As an academic and an ‘amis du musée’, I can get in for free, and sometimes enjoy the quiet and rather unique atmosphere to get some work. The previous exhibition, named The Grid, was dedicated to the use of a grid in science. The poster and book of the exhibition, shown below, shows a grid, formed of smaller, slightly irregular squares. I thought this was a funny example to reproduce in R.

The first thing I need is the be able to draw squares. The plotSquare() function below plots on of width width at positions x and y.

plotSquare <- function(x, y, width) {
    x1 <- x - (width / 2)
    y1 <- y - (width / 2)
    x2 <- x1 + width
    y2 <- y1 + width
    rect(x1, y1, x2, y2)
}

Assuming I want an nsq by nsq grid of squares, below, I define that value to be 10, to draw a total of 100 squares.

## Number of squares
nsq <- 10

I also want some jitter, i.e. some random displacements from a perfect 10 by 10 alignment, set by the amount variables.

## amount of square jittering
amount <- 1.2

Finally, I need to define how much space is dedicated to the border between the squares.

## border ratio
ratio <- 0.2

Assuming that the grid will have a width and a height of 100 (arbitrary) unites, below I define the width sq_w of a square, considering the number of squares and the space that is dedicated to the border between squares. One I have the with of a square, I can compute the width border_w of the border between two squares.

sq_w <- (100 / nsq) * (1 - ratio)
border_w <- (100 - (nsq * sq_w)) / (nsq + 1)

I can now compute the x and y position of my squares. Given that my final grid is a square itself, these x and y positions apply to rows and columns of squares.

pos <- seq(border_w, 100 - border_w,
           length.out = nsq)

We can now produce the figure. I first define the margins of my plot with the par function: the margins have width 1 and outer margins 0. The plot() function doesn’t plot anything (`type = “n”), no axes, no frame, no labels. It however sets a grid itself, ranging from -2 to 100, to accommodate my squares and borders.

par(mar = rep(1, 4), oma = rep(0, 4))
plot(-2:(100 - border_w + 2), -2:(100 - border_w + 2),
     type = "n", xaxt = "n", yaxt = "n",
     xlab = "", ylab = "",
     frame.plot = FALSE)

The last step is to place the squares. The x and y positions are symmetrical, i.e defined by the pos variable above: the lines and columns are pos[1], pos[2], …, respectively, and the squares are added line by line, starting at line at pos[1]. A little amount of noise (defined by amount above) is added to the actual x and y position by the jitter() function.

for (y in pos) {
    pos_x <- jitter(rep(y, nsq), amount = amount)
    pos_y <- jitter(pos, amount = amount)
    plotSquare(pos_x, pos_y, sq_w)
}

The final output (with set.seed(123)), with the parameter above is show here.

The full script is available here. The fun part is of course to play with the parameters, which is left as an exercise for the reader :-).

Update 2025-04-07:

I can across this mastodon post with this elegant code chunk to generate a similar figure:

library(tidyverse)
crossing(x = 0:10, y = x) |>
    mutate(dx  =  rnorm(n(), 0, (y/20)^1.5),
           dy  =  rnorm(n(), 0, (y/20)^1.5)) |>
    ggplot() +
    geom_tile(aes(x = x+dx, y = y+dy, fill = y),
              colour = 'black', lwd = 2,
              width = 1, height = 1,
              alpha = 0.8, show.legend = FALSE) +
    scale_fill_gradient(high = '#9f025e', low = '#f9c929') +
    scale_y_reverse() + theme_void()

Podcasts du LLL: repensons l’évaluation

2024-02-17T00:00:00+01:00

Une fois n’est pas coutume, un billet en français, pour attirer votre attention sur les podcasts du LLL. Le LLL, ou Louvain Learning Lab accompagne tous les acteurs et actrices de la formation de l’UCLouvain dans leurs activités d’enseignement. En plus, c’est une équipe super chouette!

Parmi leurs podcasts, il y a celui qui se penche sur l’évaluation des acquis des étudiant(e)s, réalisé et écrit par Emilie Malcourant, que je vous conseille.

Parmi les points abordés, il y avait la question

Quelle serait l’évaluation idéale?

qui a suscité la réflexion suivante.

L’évaluation se doit avant tout d’être au service de la formation. L’évaluation certificative est pour moi une voie sans issue, que j’ai beaucoup de mal à percevoir comment elle fait partie de ma mission d’enseignement.

Pour moi, une évaluation idéale, c’est une formalité, c’est une évaluation qui n’a pas raison d’être, car les intervenant(e)s du cours savent que les étudiant(e)s maîtrisent la matière, et que l’évaluation finale n’est qu’une formalité, et qu’elle n’est donc plus nécessaire.

Le but d’un enseignement serait donc de rendre l’évaluation certificative irrelevante, de faire en sorte qu’elle ne soit plus pertinente, de la rendre impertinente.

Edit (2024-02-19): The emoticons in my simple chart above use the number 8 for the eyes, rather than the standard : because the column has a specific meaning in ditaa, the cool mini-language used to generate the figure, and these special characters can’t easily be escaped. But I just learnt that I was 1 character away of calling students a 8-E.