Schedule

time type author title
Wed. - Sep. 17, ’25
9:00 Opening Organizing committee Welcome to EuroBioC
9:15 Keynote Vince Carey A coherent ecosystem for genomic data science: 25 years of Bioconductor
9:45 Keynote Helena Crowell Colorectal cancer through the lens of whole transcriptome imaging
10:15 Break
11:00 Short talks
Vilhelm Suksi
There is increasing interest in the interplay of metabolism, pathologies and biological systems such as the immune system and the gut microbiome. Untargeted liquid chromatography-mass spectrometry (LC-MS) attracts many new practitioners in quantitative metabolomics research, largely due to its sensitivity and broad coverage of the metabolome: the small molecules in a biological sample. However, due to experimental reasons and the extensive data analysis, untargeted LC-MS metabolomics data analysis meets challenges with regard to quality and reproducibility. In an effort to meet these challenges, the notame R package was developed in parallel with an associated protocol article published in the “Metabolomics Data Processing and Data Analysis—Current Best Practices” special issue of the Metabolites journal. The focus is on a solid starting point for new practitioners. The main outcome is identifying interesting features for laborious downstream steps relating to biological context, such as metabolite identification and pathway analysis, which fall outside the purview of notame. To further promote quality and reproducibility, notame has now been substantially upgraded to be included in Bioconductor, a repository focused on high-quality open software for omics research.
Philippine Louail
Untargeted LC-MS/MS is a powerful approach for large-scale metabolomics studies, yet reproducible and efficient analysis of such data remains a major challenge. While R offers highly customizable workflows suited to diverse experimental and instrumental setups, the integration of specialized packages into coherent, scalable pipelines—especially for large cohort analyses—is often complex and fragmented. To address this gap, we present Metabonaut, an educational resource comprising a series of reproducible tutorials for untargeted LC-MS/MS metabolomics data analysis using R and Bioconductor. Built around a representative LC-MS/MS dataset, the tutorials demonstrate how to construct an end-to-end analysis workflow using tools such as xcms and other packages from the RforMassSpectrometry ecosystem. Each tutorial guides users step-by-step through the analysis process—from raw data preprocessing and feature detection to statistical analysis and annotation—emphasizing reproducibility, adaptability, and interoperability. As a case study, we include an analysis of human plasma samples comparing individuals with cardiovascular disease to healthy controls, illustrating quality control, normalization, and differential abundance analysis. Beyond core workflows, Metabonaut offers modules on data inspection and quality assessment, flexible alignment for integrating new data into existing preprocessed sets, and cross-language interoperability—highlighted through spectral annotation using Python’s matchms library. All tutorials are designed to be executable over time and can be used independently or combined into a comprehensive “super-vignette.” This work is supported by the European Union under the HORIZON-MSCA-2021 project 101073062: HUMAN – Harmonising and Unifying Blood Metabolic Analysis Networks.
Alexandre Segers
Technical advancements in mass spectrometry have enabled large-scale proteomics studies and even proteome profiling at the single cell level. However, the huge number of missing values and technical batch effects result in challenges for data analysis. This hinders a first key step in the data analysis workflow, i.e., dimensionality reduction, important for data exploration, visualization, and QC, as well as for downstream applications such as clustering cells. To this end, chained workflows are currently used that sequentially impute missing values, correct for batch effects, and perform conventional principal component analysis. However, their results depend on the order of the tools used in the workflow. Moreover, missingness is influenced by batch effects, so there is no guarantee that chained workflows lead to optimal, interpretable, or reliable results. We present omicsGMF, the first package for proteomics that integrates dimensionality reduction, batch correction and missing value imputation within a single framework. It builds on our sgdGMF framework, which uses a stochastic gradient descent for generalized matrix factorization and was previously used for dimensionality reduction on single cell RNA-sequencing data. omicsGMF is optimized for omics data and is here applied to a diverse set of bulk and single-cell proteomics data. omicsGMF can be easily used on SummarizedExperiment, SingleCellExperiment and QFeatures classes, and can model the data using different members of the exponential family, such as Gaussian, Poisson or negative binomial distributions. In this contribution we first show that omicsGMF improves the quality of dimensionality reduction and visualization, while simultaneously addressing batch effect removal and missing values. Second, omicsGMF can assist the user to select the optimal dimensionality by cross-validation, which improves downstream analysis. Furthermore, we illustrate how it can be used for imputation of missing values, resulting in superior imputation performance than state-of-the-art imputation tools, which in turn leads to superior sensitivity and specificity in downstream differential abundance analyses. By providing an all-in-one solution for dimensionality reduction, batch correction and imputation that is highly interpretable, omicsGMF addresses a critical gap in proteomics data analysis for single cell and large-scale applications.
Geraldson T. Muluh
Understanding how microbial communities change over time is essential for studying ecological stability, disease progression, and treatment response. Longitudinal microbiome studies enable these insights but pose analytical challenges, including repeated measures, missing data, batch effects, and the complexity of temporal modeling. miaTime is a new R/Bioconductor package designed to directly address these challenges. miaTime extends the mia (Microbiome Analysis) framework with a standardized, scalable approach for managing, preprocessing, and analyzing temporal microbiome data. Like other mia tools, it builds on TreeSummarizedExperiment (TreeSE), an advanced data structure for storing microbiome feature tables linked to phylogenetic trees. The package implements a suite of analysis methods tailored for time-series microbiome studies, including measures of community stability, detection of bimodal abundance patterns, identification of short-term changes, and quantification of stepwise and baseline divergence. Designed for full interoperability within the R/Bioconductor ecosystem, miaTime advances longitudinal microbiome research across clinical, environmental, and systems biology applications.
Johannes Rainer
Mass spectrometry (MS) is a key technology used across multiple fields, including biomedical research and life sciences. Technological advancements result in increasingly large and complex data sets and analyses must be tailored to the experimental and instrumental setups. Excellent software libraries for such data analysis are available in both R and Python, including R packages from the RforMassSpectrometry initiative and Python libraries like matchms, spectrum_utils, Pyteomics and pyOpenMS. Having partially complimentary functionality, these software cover different aspects of MS-based proteomics or metabolomics data analysis. The reticulate R package provides an R interface to Python enabling interoperability between the two programming languages. Here we present the SpectriPy R package that builds upon reticulate and provides functionality to efficiently translate between R and Python MS data structures. It can convert between R’s Spectra::Spectra and Python’s matchms.Spectrum and spectrum_utils.spectrum.MsmsSpectrum objects and includes functionality to directly apply spectral similarity, filtering, normalization, etc. routines from the Python matchms library on MS data in R. SpectriPy hence enables and simplifies the integration of R and Python for MS data analysis, empowering data analysts to benefit from the full power of algorithms in both programming languages. Furthermore, software developers can reuse algorithms across languages rather than re-implementing them, enhancing efficiency and collaboration.
Florian Auer
RCX2 is an R package implementing the CX2 specification, the next-generation of the Cytoscape Exchange format designed to address limitations of its predecessor. CX2 introduces a simplified data model, optimized for memory-efficient processing of large-scale networks via enhanced streaming capabilities, and a more compact serialization to improve data transfer rates and reduce storage requirements. RCX2 provides a comprehensive programmatic interface within R for creating, modifying, serializing, and deserializing biological networks conforming to the CX2 standard. A critical design feature of the RCX and RCX2 packages is their inherent interoperability. The underlying data models facilitate direct, lossless conversion between each other, and as a consequence between the CX and CX2 formats. This deliberate compatibility ensures a smooth transition for users adopting the newer standard while maintaining the ability to interact with existing CX-formatted data and collaborate with researchers using the original specification. Although RCX and RCX2 employ distinct internal data handling strategies tailored to the nuances of each format, their synergistic relationship provides a unified and forward-compatible solution for the R-based biological network research community. Together, these packages offer a robust framework for managing biological network data across different versions of the Cytoscape Exchange format, future-proofing analytical pipelines and fostering seamless data exchange within the broader network biology landscape.
Lucas Beerland
Recent advancements in mass spectometry based single cell proteomics (SCP) enabled the characterization of cellullar heterogeneity across conditions at unprecedented resolution. However, current SCP data analysis workflows still focus on comparing average protein abundances and overlook informative distributional changes in shape, such as differences in variability and/or modality, limiting the advantage of SCP over bulk proteomics. We therefore propose a novel statistical framework for SCP data to infer distributional differences between conditions. Our method builds on Lindsey’s Method, which recasts the density estimation into a Poisson regression problem i.e., by fitting smooth histograms with a large number of equally spaced bins using a basis function expansion. In this contribution we enable interpretable inference on differences in the abundance distributions between conditions by using interactions between the spline basis and an experimental factor. We illustrate how our modelling approach can prioritize proteins that exhibit differential distributions across conditions using likelihood ratio tests and Wald-tests. These tests assess the omnibus null hypothesis of a common density across conditions. Next, we develop tests on pairwise contrasts between the groupwise smoothers to infer regions with distributional differences between conditions. This approach also enabled us to provide our users with intuitive plots that visualize the density estimators in both conditions, highlight regions with differential distributions, and have a one-to-one relation to the models and hypothesis tests that we propose. Finally, we evaluate and illustrate our novel framework using simulation studies and real SCP experiments. These analyses show that our comprehensive distributional analysis framework is a first step to leverage the wealth of information in SCP data. It offers a novel perspective to study single-cell heterogeneity and to compare the protein abundance distribution in populations of single cells that differ in cell type, biological conditions, or treatment.
12:30 Lunch
13:45 Keynote Susan Holmes Latent variables as the best medicine for heterogeneity
14:15 Poster pitches
14:45 Short talks
Jacopo Ronchi
MicroRNAs (miRNAs) are critical regulators of gene expression, implicated in nearly all cellular processes and frequently associated with pathological development. Given their broad functional significance, systematic and accurate characterization of miRNA dysregulation is essential to uncover the molecular underpinnings of disease. While high-throughput technologies such as microarrays and miRNA-Seq enable the quantification of small RNA transcripts, deciphering their functional roles in disease mechanisms remains a major challenge. This is partly due to the lack of standardized, integrative frameworks for miRNA-mRNA analysis and the widespread reliance on outdated or inappropriate methodologies, which often yield poorly reproducible results and limited insights into miRNA regulatory networks. Moreover, the absence of statistical models tailored for non-sample-matched datasets significantly hampers the exploitation of many existing miRNA datasets, further restricting their biological interpretability. To address these limitations, we present MIRit, an open-source, comprehensive R framework designed to support rigorous, state-of-the-art integrative analyses of miRNA and mRNA data. MIRit guides users through all critical steps, including differential expression analysis of miRNAs and mRNAs, and identification of miRNA-target pairs using ensemble-based prediction methods combined with validated interactions. It further integrates miRNA and mRNA expression profiles using statistically robust techniques such as partial correlation analysis, rotation gene set tests, and one-sided association tests. Beyond integration, MIRit offers a suite of tools for exploring dysregulated miRNA networks, including the identification of disease-associated variants affecting miRNA expression and the assessment of the functional impact of miRNA dysregulation.
Najla Abassi
Differential expression analysis (DEA) and functional enrichment analysis (FEA) are core steps in transcriptomic workflows, enabling researchers to detect and interpret biological differences between conditions. However, managing the outputs of these analyses across multiple contrasts becomes increasingly overwhelming, especially when the number of conditions (defined by complex experimental designs) and analysis tools increases. This challenge is further amplified in single-cell RNA-seq, where pseudo-bulk analyses generate numerous results tables across cell types and contrasts. As a result, efficient management, exploration, interpretation and reproducibility of these outputs becomes a significant challenge, also for experienced practitioners. To address these issues, we present DeeDeeExperiment, a new R/S4 class that extends the widely adopted SummarizedExperiment core Bioconductor object. DeeDeeExperiment provides a structured and consistent framework to organize DEA and FEA results alongside the core expression data and metadata, enabling users to retrieve and explore analysis outputs across multiple contrasts in a coherent and easy manner. DeeDeeExperiment is fully compatible with the Bioconductor ecosystem, promoting reproducibility and enabling seamless integration not only with existing popular visualization tools, such as GeneTonic and iSEE, but also making it easy to be plugged into existing workflows that are based on “classical” SummarizedExperiment objects and its derivatives. DeeDeeExperiment is publicly available under the MIT license at https://github.com/imbeimainz/DeeDeeExperiment
Justine Leclerc
Multi-condition single-cell RNA sequencing reveals how gene expression varies across conditions within specific cell populations. Most current methods model these changes by fitting gene-wise generalized linear models to read counts, and then detect differential expression using statistical tests such as the likelihood-ratio test or the quasi-likelihood F-test . Each gene \(\times\) cell pair occurs only once in the dataset and is observed under a single condition. Predicting counterfactual expression levels—e.g., estimating how a cell would express genes under treatment even if observed under control—increases sample size by imputing expression values for all conditions. This might reduce false discovery rates and improve detection power, both key to efficient statistical testing. The method LEMUR addresses this imputation task using a type of PCA that learns for each cell a low-rank structure per condition. The flexibility of the model makes uncertainty more challenging to quantify, an open problem as stated by the authors of LEMUR. Our conformeR is a wrapper around LEMUR R-package that adds uncertainty quantification without altering the underlying model. conformeR constructs prediction intervals for LEMUR-imputed values with finite-sample coverage guarantees using conformal prediction , which relies on the assumption that cells from the same biological replicate and cell type are exchangeable. By inverting conformal prediction intervals, conformeR outputs for each gene \(\times\) cell pair a p-value that encodes the difference between the observed and predicted expression levels. We then adjust these p-values using the Benjamini-Hochberg procedure , leveraging the fact that conformal p-values satisfy the positive regression dependence on a subset condition . Finally, conformeR aggregates over cells from the same replicate and cell type to yield a single p-value per gene. We evaluate conformeR on scRNA-seq data from 864 patients with varying Covid-19 severity . The large cohort allows us to account for patient-level variability and demonstrate the robustness and power of conformeR.
Lizhong Chen
edgeR is an R/Bioconductor software package for differential analyses of sequencing data in the form of read counts for genes or genomic features. Over the past 15 years, edgeR has been a popular choice for statistical analysis of data from sequencing technologies such as RNA-seq or ChIP-seq. This year, we announce edgeR version 4 with expanded functionality and improved support for small counts and larger datasets. We introduce a new quasi-likelihood (QL) method in edgeR 4.0.0, using the adjusted deviance statistics to yield very nearly unbiased quasi-dispersion estimators even for fitted values close to zero. We only estimate a constant NB-dispersion from the most highly expressed genes which is fast even for larger dataset. We update the empirical Bayes hyperparameter estimation in edgeR 4.4.0, to give improved performance when the residual degree of freedom are unequal and possibly small. Recently we introduce a new diffSplice method for alternative splicing analysis in edgeR 4.6.0. With those expanded functionality, we can apply edgeR to single-cell RNA-seq data analysis. For example, we can identify the highly variable genes using the goodness-of-fit test for one sample or multiple samples. We can also find the marker genes for the cell clusters using one vs the average of others approach instead of pseudo-bulk approach which only works for more than two samples. Besides, the new diffSplice method is designed for alternative splicing by differential transcript usage or differential exon usage analysis. We are continuing to work on the edgeR project to expand the functionality and introduce new statistical methods, such as testing relative to a fold-change threshold, introducing the sample weights to account for the variations in sample quality and so on in the future.
Koen Van den Berge
Gene expression is the primary modality being studied to differentiate between biological cells. Contemporary single-cell studies simultaneously measure genome-wide transcription levels for thousands of individual cells in a single experiment. While the characterization of cell population differences has often occurred through differential expression analysis, tiny effect sizes become statistically significant when thousands of cells are available for each population, compromising biological interpretation. Moreover, these large studies have spurred the development of methods to infer gene regulatory networks (GRNs) directly from the data, and GRN databases are becoming more comprehensive. In this work, we propose a statistical model for gene expression measures and an inference method that leverage GRNs to deconvolve transcription factor (TF) activity from gene expression, by probabilistically assigning mRNA molecules to TFs. This shifts the paradigm from investigating gene expression differences to regulatory differences at the level of TF activity, aiding interpretation and allowing prioritization of a limited number of TFs responsible for significant contributions to the observed gene expression differences. The inferred TF activities result in intuitive prioritization of TFs in terms of the (difference in) estimated number of molecules they produce, in contrast to other widely-used methods relying on arbitrary enrichment scores. Our model allows the incorporation of prior information on the regulatory potential between each TF and target gene through prior distributions, and is able to deal with both repressing and activating interactions. We compare our approach to other TF activity estimation methods using simulation experiments and case studies.
Nicolò Gnoato
Cancer is a complex pathological condition that originates from the accumulation of genetic mutations, which can manifest as both point mutations in single nucleotides and structural modifications of the genome, such as copy number variations (CNVs). Evidence has demonstrated that CNVs significantly alter gene expression levels, disrupting normal cellular mechanisms and promoting uncontrolled cell proliferation. Tumour formation, resulting from this abnormal growth, ultimately damages surrounding tissues and impairs physiological functions. Therefore, characterising CNVs is crucial for investigating the mechanisms underlying tumorigenesis and tumour progression. Currently, cell classification largely relies on marker genes, an approach limited by marker selection biases and methodological efficiency, particularly in cancer cell detection. To overcome these limitations, we are developing a R package aimed at effectively stratifying normal and tumour cells using their CNV profiles inferred from single-cell RNA sequencing data (scRNAseq). Our approach integrates several metrics, including copy number burden (CNB), ploidy, CNV signature features, and homologous recombination deficiency (HRD), to robustly identify tumour cells. Given the known instability of cancer genomes, our strategy calculates CNV-based scores across the entire transcriptome and within specific genomic regions known to be associated with distinct tumour subtypes. This scoring enhances classification accuracy, providing greater confidence in distinguishing tumour from non-tumour cells. Ultimately, our methodology offers a refined approach to cancer cell identification and characterization through comprehensive CNV analysis, potentially advancing our understanding of tumour biology and informing therapeutic strategies.
Anna C E De Lima Tanada
Singular and individual gene approaches became obsolete with the emergence of omics technologies. For example, omics-focused studies of biological processes can paint a more holistic picture of complex diseases, such as cancer or neurodegenerative diseases. From this perspective, we developed our R package MOSClip, which was recently added to BioConductor. MOSClip integrates multiple omics by implementing graph theory methods for topological analysis of biological pathways. It performs dimensionality reduction, which can be done at two levels: whole pathways or decomposition of pathways into modules. Then, MOSClip can either evaluate the impact of each pathway and omic on patients’ prognosis through survival analysis, or operate a two-class analysis. Additionally, MOSClip offers multiple graphical tools to aid in the visualization of the results. The last released version of our R package supports bulk RNA-sequencing, methylation, mutation, and copy number variation data. A new addition to MOSClip is the support for ATAC-seq data, and we are further developing it to accept single-cell data. To illustrate and confirm the utilities and performance of this package, we conducted multiple case studies using multi-omics data from different complex diseases. Specifically, we focused on the new functionalities of MOSClip, including the two-class analysis. MOSClip is a valuable tool for its ability to give powerful insights into complex diseases that could pass undetected otherwise. This is achieved thanks to the topological analysis and the built-in MOSClip graph functions, allowing quick and intuitive interpretation of the results. In summary, we showed that MOSClip can extract powerful information from complex and intricate data from omics integration, demonstrating its importance for researching several diseases.
Igor Cervenka
The European Genome-phenome Archive (EGA) is offers secure, long-term storage and controlled access to personally identifiable genetic and phenotypic datasets. While the EGA’s website enables data submission, manual entry is error-prone and time-consuming for larger datasets. Crafting and validating the complex metadata payloads needed for dataset deposition remains a persistent bottleneck for many laboratories. We present Rega, an open-source R package that leverages API provided by EGA, enabling programmatic interaction. Rega simplifies and systematizes metadata submission by coupling an intuitive, GEO-style Excel template with a robust, extensible R interface that: Transforms—converts user-filled spreadsheets or in-memory R data frames into EGA-compliant JSON payloads. Validates—performs schema-aware checks on samples, experiments, datasets and analyses, flagging structural or semantic errors before transmission. Uploads—leverages the EGA REST API with built-in retries and granular progress reporting, delivering reduction in submission time compared to manual web-form entry. By abstracting away low-level API intricacies and unifying metadata management in a familiar spreadsheet-to-R paradigm, Rega lowers the barrier to data sharing, enhances reproducibility and encourages timely deposition of sensitive genomic resources. Rega is released under the Artistic license and is hosted on GitHub, with comprehensive vignettes.
16:30 Poster session
Thu. - Sep. 18, ’25
9:00 Keynote James Sharpe C3PO: Cell 3D Positioning by Optical encoding and its application to spatial transcriptomics
9:30 Short talks
Charlotte Soneson
Single molecule footprinting is an increasingly used assay to study gene regulation and chromatin biology using enzymatic modification of accessible DNA, followed by detection of modifications by sequencing. The resulting data provides genome-wide, rich measurements of accessibility and DNA modifications at near-base pair resolution and relatively low cost. However, the representation of such data is not yet standardised and only few tools have been specifically developed to handle it. Here we present footprintR, a new R package that provides a framework for representing and analysing single molecule footprinting data. Read-level and summarized data can be imported from standard file types and are stored together in a single SummarizedExperiment container, using the newly developed NaMatrix from the SparseArray Bioconductor package for efficient representation and computation. We apply footprintR to genome-wide 6mA and 5mC footprinting data obtained using nanopore long read sequencing and illustrate how it can be used to address various biological questions, including footprint scoring, nucleosome placement and measurement of nucleosome spacing, as well as detection of differentially methylated or accessible regions. footprintR also contains flexible and powerful visualisation functionalities. Single molecule footprinting is becoming a more and more important part of the epigenetics and gene regulation research toolbox. We hope that footprintR with its efficient data representation and analysis functions will facilitate the analysis of such data for R users.
Ning Shen
MeDIP-seq is an enrichment-based DNA methylation profiling technique that measures the abundance of methylated DNA. While this technique offers efficiency advantages over direct methylation profiling, it does not provide absolute quantification of DNA methylation necessary for cell type deconvolution. We introduce decemedip, a Bayesian hierarchical model for cell type deconvolution of methylated sequencing data that leverages reference atlases derived from direct methylation profiling. We demonstrate its accuracy and robustness through simulation studies and validation on cross-platform measurements, and highlight its utility in identifying tissue-specific and cancer-associated methylation signatures using MeDIP-seq profiling of patient-derived xenografts and cell-free DNA. decemedip is available at .
Jiayi Wang
The Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) has become widely adopted for assessing chromatin accessibility due to its speed, simplicity, and low input requirements. These advantages make ATAC-seq particularly well-suited for identifying transcription factors (TFs) that mediate regulatory changes between cell types or states. However, technical biases inherent to ATAC-seq data can complicate downstream analyses, especially the accurate identification of differentially active TFs. A recent benchmarking study identified a variation of \(chromVAR\) as the preferred method for this task, but we observed that it suffers from an incomplete and computationally intensive bias correction strategy and overlooks the footprint characteristics of TF binding. To address these limitations, we developed two alternative approaches: the weighted and insertion models. The weighted model corrects GC-content and fragment length biases in ATAC-seq data by assigning weights to fragments, followed by cyclic loess normalization to calculate peak weights for enrichment bias correction. The resulting unbiased peak accessibility matrix is then used for differential motif accessibility analysis with \(limma\)-\(voom\). Unlike \(chromVAR\), which generates motif-level scores, the unbiased peak-level accessibility count matrix produced by the weighted model enables broader downstream analyses, such as peak-level differential accessibility analysis and bias-corrected coverage profiles for visualization. We therefore envision that the weighted model will have broader applicability to bulk epigenetic data, providing unbiased information for a variety of downstream analyses at multiple levels. The insertion model enhances differential motif accessibility analysis by incorporating Tn5 footprint patterns around motif sites, assigning weighted insertion scores per motif match. Bias correction is performed using \(chromVAR\)-like background peaks, and differential motif accessibility is assessed using z-scores and \(limma\). Each of these two models offers enhanced statistical performance, with the weighted model effectively correcting various technical biases. As a future direction, we aim to integrate the weighted and insertion models into a unified framework that leverages both fragment-level bias correction and motif-specific footprint signals for more robust differential motif accessibility analysis.
10:15 Break
11:00 Short talks
Maria Doyle
For over 20 years, Bioconductor (https://bioconductor.org/) has provided an open-source ecosystem for reproducible genomic data analysis, supporting over 2,000 R-based packages with 1M+ annual unique downloads worldwide. However, access to bioinformatics training remains a challenge in Africa, limiting the adoption of these tools. To bridge this gap, Bioconductor is expanding its training initiatives and community collaborations across the continent. Through the Bioconductor Carpentry program, we have trained 31 instructors in the last two years, developed structured training materials on R/Bioconductor, RNA-seq and single-cell analysis, and delivered 28+ workshops globally. To address the growing demand for in-person training, we have partnered with local bioinformatics leaders to deliver on-site workshops in East and West Africa. In March 2025, we held a highly successful week-long workshop in Nairobi, Kenya, in collaboration with the International Institute of Tropical Agriculture (IITA) and international experts. The course provided hands-on training in genomic data analysis and RNA-seq workflows, enabling participants to apply Bioconductor tools to their own datasets. More details: training.bioconductor.org/workshops/2025-03-Nairobi . We plan to run a similar course in West Africa in late 2025 / early 2026. We aim to connect with and support the African bioinformatics community by: - Expanding instructor networks to increase local training capacity. - Identifying needs and gaps in bioinformatics training and tools to ensure our initiatives are needs-based. - Developing region-specific curricula for One Health research, including pathogen and agricultural genomics. - Collaborating with African bioinformatics institutions to scale training opportunities. - Providing opportunities to get involved in developing training materials in local languages. This presentation will highlight progress to date, lessons learned, and future opportunities, inviting researchers and educators from around the world to join Bioconductor’s growing global community and contribute to bioinformatics education.
Luca Marconato
The rapid growth and increasing diversity of spatial profiling technologies, combined with the demand for accelerated innovation cycles, have created a fragmented landscape characterized by incompatible file formats and isolated analysis methods. This fragmentation significantly hinders scientific reproducibility. To address these challenges, scverse, in collaboration with napari and researchers from the Open Microscopy Environment (OME), developed the SpatialData framework. This framework provides a language-agnostic, standardized storage format derived from the OME-Zarr format and OME-NGFF specification, specifically designed to maximize interoperability for spatial omics. Additionally, the SpatialData framework includes a suite of Python libraries tailored for high-performance processing and flexible visualization of spatial omics data. SpatialData aims to ease interoperability of analysis and visualization tools via its standardized storage format. In this talk, I will share recent developments of the SpatialData framework, emphasizing efforts aimed at extending interoperability beyond Python. I will present preliminary results from ongoing collaborations with the Bioconductor, and bioimaging (napari, OME) communities, highlighting key challenges, implemented solutions, and our vision for fully interoperable, cross-language analyses of spatial omics data. These efforts aim to leverage the unique strengths of each programming language and their respective analysis communities. Furthermore, I will stress the critical importance of adopting open, reusable standards to mitigate redundancy and fragmentation. Specifically, I will discuss ongoing efforts within the bioimaging community to develop a canonical parser for OME-NGFF and OME-Zarr formats in Python, which will significantly enhance interoperability and facilitate the development of sustainable, maintainable software solutions. Similar efforts within Bioconductor have the potential to significantly broaden accessibility, foster community collaboration, and unlock powerful, integrative spatial omics analyses across diverse programming environments.
Peiying Cai
Background Spatially resolved transcriptomics (SRT) technologies measure gene expression while preserving spatial context. Several methods have been developed to identify spatially variable genes (SVGs), i.e., genes whose expression profiles vary across tissue. As multi-sample, multi-condition SRT datasets (e.g., healthy vs. diseased, different treatments, or time points) become more common, it can be of interest to identify differences between conditions, such as genes with different abundances or splicing patterns. However, current methods face several limitations: (i) lack of support for multiple samples and conditions; (ii) high computational demands; and, (iii) inability to test for spatial changes of gene expression patterns across groups. Methodology Here, we present an extension of our previous DESpace framework for identifying genes with differential spatial patterns (DSP) across conditions in multi-sample settings. The method inputs pre-computed spatial domains, i.e., regions of neighboring spots with similar expression profiles, usually identified by spatially resolved clustering tools. We then pseudo-bulk gene expression within each domain and sample, and use edgeR with spatial clusters and conditions as covariates. Intuitively, if the expression is significantly associated with the interaction terms between spatial clusters and condition, the spatial gene expression structure varies between conditions, hence indicating DSP. Key strengths of our method include: (i) modeling of multiple samples across multiple conditions, reducing uncertainty from individual samples and identifying genes with DSP across experimental conditions; (ii) region-specific testing, allowing investigation of the mRNA abundance changes between conditions in areas of particular interest; and, (iii) computational efficiency and compatibility with diverse SRT platforms. Benchmarking We benchmarked our method against scran’s findMarkers, Seurat’s FindMarkers (with and without pseudo-bulk aggregation), and spatialLIBD, using semi-simulated datasets generated from two real SRT datasets, with four known spatial expression patterns. In simulations, our approach exhibits well-calibrated false discovery rates and higher true positive rates, while on real datasets, our framework identified more condition-associated genes than the competitors. Availability Our approach is available via Bioconductor: https://bioconductor.org/packages/DESpace. A pre-print (in preparation) will also follow in the coming weeks.
Martin Emons
Spatial omics technologies are generating vast amounts of spatial data with often complex experimental designs and the need for tested tools to quantify spatial features from these datasets is becoming increasingly important. There are many standard tools to quantify different aspects of spatial data (e.g. cellular colocalisation). Most of these tools do not provide a way to compare multiple samples across conditions with statistical models. The existing approaches that allow for such a comparison across samples focus on comparing scalar metrics. Our Bioconductor package, spatialFDA, takes a different approach by using spatial metrics that quantify so-called \(r\)-neighbourhoods with increasing radii \(r\), thereby constructing the spatial metric as function of the radius \(r\). Instead of compressing these functions into scalar values, spatialFDA compares the entire functions between conditions using generalised additive mixed models with functional responses. This estimation framework allows to account for the complex correlational structure of spatial omics datasets. The estimated coefficients of these models are themselves functions over the user-defined radius range. This allows for an interpretation of the scale of the differential co-localisation effect between conditions. The methodology of spatialFDA was extended and improved to reflect the characteristics of spatial statistics functions. This entailed improved estimation of random errors, optimising the identifiability of the model and specifying suitable link functions. We show that spatialFDA can be used to find biologically-meaningful differential co-localisation of cell types in different biological examples. We have developed simulation scenarios to investigate the performance of spatialFDA in a controlled setting and compare it to competing methods.
Ilaria Billato
Histopathological images provide unparalleled insights into tissue architecture, cellular morphology, and tumor spatial organization. While these images are routinely used in cancer research and diagnosis, their computational analysis typically requires specialized software outside the R/Bioconductor ecosystem, the primary environment for other high-throughput omics analyses. This disconnection creates a significant barrier to integrative multi-model analyses. To address this challenge, we aim to develop standardized workflows for processing raw histopathological images and making extract features available in R, along with a comprehensive repository of pre-processed features from The Cancer Genome Atlas (TCGA) images. We developed a comprehensive image analysis pipeline that includes color normalization, tissue segmentation, tiling, and feature extraction at the nuclei and region-of-interest (ROI) levels. The pipeline leverages state-of-the-art deep learning models, including HoVer-Net for nuclear segmentation and classification, and outputs standardized data structures compatible with R/Bioconductor’s existing analytical frameworks. We validated this workflow by processing the complete collection of diagnostic H&E-stained whole slide images (WSIs) from TCGA. We established a standardized histopathology image analysis pipeline and implemented it on 11,765 diagnostic images from 9,640 cases across 25 cancer types. Extracted nuclei-level features included location, shape, texture, and classification (e.g., benign, neoplastic, stromal and necrotic), while ROI-level features captured tissue organization, cellular composition, and spatial relationships. We further demonstrated the use case of multi-model integration across extracted image features, genomics, and transcriptomics data. Finally, to enhance the usability of this resource, we developed ImageTCGA, a Shiny application for interactive exploration, filtering, and visualization of the extracted features with the original images. Our standardized workflow and feature repository bridge a critical gap between histopathological image analysis and multi-omics integration. By providing robust tools within the R/Bioconductor environment, along with pre-computed features from TCGA, we enable researchers to readily incorporate spatial and morphological insights into their analyses. ImageTCGA facilitates hypothesis generation and validation across diverse cancer types, potentially accelerating biomarker discovery and advancing our understanding of tumor-microenvironment interactions. Future work will focus on expanding the repository to include additional cancer image collections and developing new analytical methods for multi-modal data integration.
Maryna Chepeleva
Interpreting spatial multi-omics data in terms of biologically meaningful functional programs remains a key challenge in computational biology. We present recent developments and use cases for extracting such signals using our R package, consICA, which includes signal deconvolution method followed by functional annotation. Originally developed for robust extraction of mixed signals from multi-omics data, consICA has now been extended to support spatial analysis. We applied consICA to the whole tissue slide in a spatial transcriptomics dataset. The resulting spatially resolved independent components revealed biologically interpretable gradients of transcriptional activity, including angiogenesis, immune response, and cell death programs. In parallel, we used ssGSEA on the same spatial matrices, treating spatial spots as samples. This enables local estimation of pathway activities at single-spot resolution, providing 2D tissue maps of metabolic and regulatory programs. In addition to spatial data, we demonstrate applications to multi-omics cancer datasets, including metabolome-proteome data from glioblastoma. This analysis uncovers coordinated functional signals across data modalities and supports the generation of interpretable biological hypotheses.
Liyang Fei
In cancer and stem cell research, comparing the clonal composition of pools of cells before and after perturbations can reveal important features such as treatment-resistance or multipotent clones. New clonal tracking technologies now allow us to track progeny cells back to their original progenitor cell using unique DNA barcodes that are passed from each original cell to its offspring. The barcode count for each sample indicates the number of progeny cells derived from the original cell. However, there is a lack of bioinformatic tools for robust statistical analysis of barcode count data from bulk sequencing of DNA barcodes. To address this, we introduce barbieQ, an Bioconductor R package designed for analysing barcode count data across groups of samples. barbieQ is built upon the widely used SummarizedExperiment data structure, making it interoperable with low-level processing tools. It provides novel functions for data cleaning, summarising, and visualising barcode count data, and implements two statistical tests: 1) Differential barcode proportion - identifies barcodes with significantly changing proportions in one condition compared to another. 2) Differential barcode occurrence - identifies barcodes more frequently seen in the samples of one condition compared to another. barbieQ can handle complex experimental designs using regression models to test a factor of interest while taking into account other variables. Both tests maintain a type 1 error less than 5% under the null hypothesis. Notably, in analysis of a real world dataset of monkey blood stem cell differentiation, barbieQ accurately identified novel NK cell-specific clones whilst avoiding less specific barcodes, in contrast to the original analysis based on heuristic cutoffs and visualisations commonly used in the field. Overall, barbieQ is a powerful Bioconductor package with greater power to detect biologically meaningful changes compared to traditional methods. The package is available for download in the development version of Bioconductor using BiocManager::install(“barbieQ”).
12:30 Lunch
13:45 Keynote Noelia Ferruz Controllable protein design with language models
14:15 Poster pitches
14:45 Workshops
Yixing Dong & Ellis Patrick
Spatial omics technologies have advanced rapidly over the past five years, spanning both sequencing-based and imaging-based platforms, with applications across diverse tissues such as the brain and cancer. However, despite these technological leaps, many current analysis pipelines have been adapted from single-cell RNA-seq workflows without sufficient tailoring to the unique spatial context. As a result, a comprehensive, statistically-informed framework for spatial transcriptomics analysis remains largely absent. To address this gap, we have collaboratively developed an open-source online book that serves as the most comprehensive Bioconductor-based resource for spatial transcriptomics to date (https://lmweber.org/OSTA/). This book offers step-by-step tutorials built around publicly available datasets, covering a broad spectrum of spatial technologies and resolutions, from Visium, Visium HD, to Xenium and CosMx. We showcase workflows that incorporate spatial statistical methods for pattern detection, modeling spatial variability, and integrating multimodal data. Our resource also emphasizes reproducibility and interoperability between R and Python data classes and tools. In this workshop, we will present newly developed workflow chapters that establish various avenues of spatial omics data analysis. We will also demonstrate complete pipelines—from raw data ingestion to spatial visualization and biological interpretation—designed to be both user-friendly and extensible to emerging technologies. By promoting open standards and cross-language compatibility, our work aims to equip the spatial omics community with reproducible, robust, and accessible analytical tools.
Tuomas Borman
Because of the complex and high dimensional nature of microbiome data, machine learning and other computational approaches have become an instrumental part of the researcher’s toolkit in this rapidly evolving field. There is an increasing need to develop robust and reproducible methods that take into account current and future trends in microbiome research such as multi-omics, expanding datasets, and longitudinal study designs. The previous solutions in microbiomics have fallen short in addressing these modern requirements, particularly in terms of scalability and data integration. To meet these challenges, the research community has extended the widely adopted SummarizedExperiment data container to TreeSummarizedExperiment, enabling support for microbiome-specific data structures such as hierarchical relationships. The framework is further enriched with mia (Microbiome Analysis) package family and packages by independent developers, providing methods for common data operations, visualization, and advanced analytical approaches. This session will showcase the latest advances in microbiome data science within Bioconductor, focusing on the mia (Microbiome Analysis) framework through a practical case study. We will also present Orchestrating Microbiome Analysis with Bioconductor, a freely available online book designed to promote best practices and facilitate adoption of the ecosystem. Together, these resources form a foundation for reproducible, scalable, and transparent microbiome data science, and they continue to evolve through active community contributions.
Christophe Vanderaa
Mass spectrometry (MS) has become a method of choice for exploring the proteome landscape that drives cellular functions. While technological advancements have significantly increased the sensitivity of MS instruments, obtaining reliable statistical results from these data remains a challenging and often tedious task. Many researchers continue to rely on ad-hoc analysis workflows due to a lack of clear guidelines, which can lead to violations of key statistical assumptions. In this workshop, we will offer a hands-on introduction to the msqrob2 package that provides a set of rigorously validated and benchmarked statistical workflows for MS-based proteomics. These workflows are built on the QFeatures framework for data processing. We will begin by familiarising participants with the input data format and the QFeatures data structure. From there, we will walk through the minimal data processing steps required prior to statistical modelling, explaining when and why each step is necessary. Next, we’ll explore the sources of variation inherent in proteomics data, highlighting their hierarchical structure and demonstrating how linear mixed models can properly account for these complexities. The modelling process will be carried out using msqrob2, which offers additional advantages such as robust and stabilised parameter estimation. Finally, we will demonstrate how to translate biological questions into hypothesis tests and how to prioritise proteomic markers that change in response to a condition of interest. Depending on the progress of the group, we will also briefly explore the emerging field of single-cell proteomics, discussing the additional challenges posed by these data. This workshop is designed for proteomics researchers who want to learn how to analyse their data using reproducible and statistically sound workflows, as well as for omics data analysts interested in expanding their skill set to include proteomics.
Alexandru Mahmoud
Since the adoption of Galaxy as the underlying platform for serving Bioconductor Workshops, the process for authoring, wrapping, and deploying workshops has evolved. The recent direct collaboration effort with the Galaxy Project further introduces new changes in the automation available to workshop authors, while also bringing better integration and new avenues of interoperating with Galaxy. This workshop by a Bioconductor Core Team member aims to 1) introduce users to the easiest path for authoring material, empowering more users to create Bioconductor Workshops, and 2) introduce users to new features available in the Galaxy platform, especially ways in which Bioconductor users can now benefit from data persistence as well as interoperability with thousands of tools and workflows available from Galaxy. This workshop does not require any prior knowledge, and aims to be useful to users, current workshop authors, and prospective new authors alike.
Federico Marini
iSEE (Interactive SummarizedExperiment Explorer) is a Bioconductor software package that provides a powerful and extendable multi-purpose visual interface for exploring data stored in a SummarizedExperiment object, using the R/shiny framework. Given the widespread adoption of SummarizedExperiment and its derivatives (SingleCellExperiment, SpatialExperiment, …) throughout the Bioconductor ecosystem and the smooth interoperability from other main frameworks (Seurat, Scanpy/AnnData), this package can be a ubiquitous companion across all the main steps of efficient analysis workflows, from the initial quality control all the way down to deploying data to accompany publications. In this workshop, we will provide an overview of its main functionality, displaying how the most common tasks in data exploration and interpretation can be achieved within this package, which delivers an efficient combination of interactivity and reproducibility. Attendees will be able to learn hands-on through a collection of vignettes that compose a masterclass-like full course on iSEE and its related packages (iSEEu, iSEEde, iSEEpathways, iSEEindex, and more). We aim to empower users in a “from zero to hero” format to explore in depth a wide spectrum of datasets, providing a free, natural, efficient, and customizable solution to achieve this within the Bioconductor project, and ultimately extract valuable insight from omics datasets in interdisciplinary settings.
16:30 Poster session
Fri. - Sep. 19, ’25
9:00 Keynote Jacques Serizay Enhancing genomic research with community-driven flexible software
9:30 Flash talks
10:30 Break
11:15 BoF sessions
12:45 Closing Organizing committee

Posters

(In alphabetical order.)

author title
Wed. - Sep. 17, ’25
Ahmed Salah
Radiation biology research increasingly relies on transcriptomic profiling to uncover the molecular mechanisms underlying cellular responses to ionizing radiation. However, the diversity of radiation types and their intensities—each with distinct energy deposition patterns and biological effects—poses a challenge for integrative analysis. To address this, we present DoReMiTra, a comprehensive and curated data package aggregating transcriptomic datasets from multiple experiments either performed by our group or published by others involving X-ray, gamma ray, and neutron exposures. DoReMiTra harmonizes data from publicly available high-throughput studies, standardizing metadata related to radiation dose, time post-exposure, biological model, and platform technology. It includes gene expression matrices, accompanying experimental metadata, and annotations that facilitate comparative analysis across radiation types and conditions. This data package not only streamlines access to a curated transcriptomic landscape, investigating radiation effects, but is also designed for seamless integration with state-of-the-art R/Bioconductor analysis workflows. We demonstrate the utility of DoReMiTra through exploratory analyses identifying shared and radiation-specific gene expression signatures, dose-response patterns, and pathway enrichments. By enabling side-by-side comparisons of transcriptomic responses across radiation qualities, DoReMiTra serves as a valuable resource for radiobiology researchers and a key tool for developing biomarkers of radiation exposure in the context of radiotherapy and biodosimetry, encouraging reproducibility and meta-analyses in radiation genomics research.
Aitor Moruno-Cuenca
Motivation: Disease models serve as fundamental tools in drug discovery and early-stage drug development. However, these models are not a perfect reflection of human disease, and selecting a suitable model can be challenging. Existing computational approaches for molecular validation of pathophysiological resemblance to human conditions at single-cell resolution remain limited. Although quantitative computational methods exist to inform this selection, they are very limited at the single-cell resolution, which can be critical for model selection. Quantifying the resemblance of disease models to the human condition with single-cell technologies in an explainable, integrative, and generalizable manner remains a significant challenge. Results: We present singIST, a computational method for comparative single-cell transcriptomics analysis between disease models and human conditions. singIST provides explainable quantitative measures on disease model similarity to human condition at both pathway and cell type levels, highlighting the importance of each gene in the latter. These measures account for orthology, cell type presence in the disease model, cell type and gene importance in human condition, and gene changes in the disease model measured as fold change. This is achieved within a unifying framework that controls for the intrinsic complexities of single-cell data. We tested our method using three well-characterized murine models of moderate-to-severe Atopic Dermatitis, demonstrating its ability to recapitulate established biological knowledge while generating novel hypothesis through pathway-level analysis. Availability and implementation: R library at https://github.com/DataScienceRD-Almirall/singIST and implementation code at https://github.com/DataScienceRD-Almirall/singIST_paper_results
Alex Cecchetto
As omics technologies continue to advance, there is a growing need for statistical methodologies that can effectively characterize gene expression at increasingly high resolution. Single-cell datasets, with their high dimensionality and heterogeneity, offer significant opportunities for uncovering novel biological insights but also pose analytical challenges due to their complex noise structures. Matrix factorization techniques have emerged as powerful tools for addressing these challenges, enabling dimensionality reduction and latent structure discovery. Generalized Matrix Factorization (GMF) models extend classical approaches by supporting a broader range of data types, making them particularly well-suited for high-dimensional single-cell data. The advent of spatially-resolved transcriptomics further expands the potential of these methods by preserving spatial context, which is critical for understanding tissue organization and cellular interactions. In this work, we explore the application of GMF models to spatial transcriptomics data, with a focus on leveraging spatial information to enhance interpretability and support the development of flexible, scalable tools for omics analysis.
Ali Mostafa Anwar
-Motivation- Differential expression analysis plays a vital role in omics research, as it enables the precise identification of features that are specifically associated with different phenotypes. This process is critical for uncovering biological differences between conditions, such as disease versus healthy states. In proteomics, several statistical methods have been used, ranging from simple t-tests to more sophisticated methods like limma and ROTS. -Results- In this study, we developed LimROTS, a hybrid method integrating the linear model and the empirical Bayes method from the limma framework with Reproducibility-Optimized Statistics from ROTS, by introducing a novel moderated ranking statistic, for robust and flexible analysis of proteomics data. We validated its performance using twenty-one proteomics gold standard spike-in datasets with different protein mixtures, MS instruments, and techniques for benchmarking. This hybrid approach improves accuracy and reproducibility in proteomics. Additionally, it could be potentially applicable to other omics such as transcriptomics and metabolomics, making LimROTS a powerful tool for high-dimensional omics data analysis. -Availability and Implementation- LimROTS has been implemented as an R/Bioconductor, available at https://bioconductor.org/packages/LimROTS/.
Anastasiia Horlova
Understanding how genetic networks enable animals to develop and maintain complex tissues is a central goal of biological research. The Drosophila midgut is a well-established and important model system that has enabled many discoveries that generalize to other animals, including humans. The gut has remarkable plasticity, high self-renewal, and is prominently exposed to environmental factors. Our research investigates how intestinal stem cells (ISCs) contribute to the generation of differentiated cell types in the Drosophila midgut. We use single-cell transcriptomic data from in vivo CRISPR gene knockdown experiments that specifically affect Drosophila melanogaster ISCs, in their native context within healthy adult animals. Our goal is to systematically examine the consequences of single gene perturbations on cellular behaviour. The dataset comprises sc-RNAseq transcriptomic profiles of 427,549 cells from 61 replicate experiments of 28 unique genetic perturbations. In our analysis workflow we preprocess and filter the data to eliminate low-quality cells. We pool the wild type samples and integrate them via the Harmony algorithm. We use clustering and marker gene expression to annotate cell-types. We annotate perturbed cells using a reference-based k-nearest neighbors classifier. We examine differential gene expression via pseudo-bulk DESeq2 analysis, and evaluate changes in cell-type proportions via compositional analysis. We use lineage inference to identify transcription factors involved in differentiation. Gene set enrichment analysis will help identify pathways associated with ISC differentiation. Shared pathways, transcriptional factors, and gene expression changes across perturbations will be analyzed to reveal underlying genetic network dynamics. We use dimensionality reduction and visual summaries (e.g., UMAPs, heatmaps, volcano plots) to support biological interpretation of the data. Ultimately, this work aims to offer novel insights into genetic network dynamics, advancing our understanding of cellular differentiation and organ development.
Angelo Velle
In the last years an increasing number of new techniques for omics data acquisition have been developed, so today we have access to large datasets containing different omics such as gene expression, methylation, copy number variation and miRNA expression data. Most of the standard approaches for omics data modeling rely on the comparison of just one data type among different groups of samples. Since the different omics are biologically related, it is fundamental to statistically detect their interplay. In order to do that we need statistical models that take into account all the omics, trying to detect the key molecular players involved. To solve this need, in this work we provide an easy to use R package for omics data integration. gINTomics is designed to detect the association between the expression of a target and of its regulators while taking into account their genomics modifications such as Copy Number Variations and methylation. For RNA sequencing data, the counts will be fitted using a negative binomial model, while in the case of microarray or other types of data, a linear model will be applied. The gINTomics visualizer allows the visualization of the results for all the integrations performed. It is divided into four sections: Genomic Integration for the results regarding copy number variations and methylation; Transcription Integration for those regarding transcriptional networks (Transcription Factors and miRNA); Class Comparison for highlighting differences among classes defined by the user; Complete Integration for a comprehensive table with all the available results and a Circos plot for the visualization of different integrations. We are currently working on the development of new workflows to exploit gINTomics’ integration with single cell and spatial transcriptomics data through the generation of pseudo-bulk data, moreover we are developing new functions to integrate the omics at the single cell level. We tested our package on 385 TCGA ovarian cancer samples, the analysis revealed that 88% of genes are regulated by copy number variation (CNV), while 38% are influenced by methylation. A total of 4,230 genes were found to be simultaneously regulated by both mechanisms, including SCYL3, FGR, and GCLC, which are involved in tumor progression and drug resistance. Key transcription factors such as TAL1, FOXP3, and GATA1 were identified, influencing the tumor microenvironment and immune response. The comparison between short- and long-term survivors highlighted molecular differences, with oncogenes amplified in poor prognosis cases and tumor suppressors lost. Finally, prognostic microRNAs were identified, including miR-34b/c in short-term survivors and miR-206 in long-term survivors. Our package provides solid multi-omics integration models coupled with a powerful shiny app for results visualization.
Anna Bortolato
Genome-wide screenings across various omic technologies enhance the discovery of aberrations that synergistically contribute to the onset and progression of pathological processes. Integrating these heterogeneous data types remains challenging due to their high dimensionality and diverse features. MOSClip R package, recently made available on Bioconductor, implements a statistical method that leverages pathway topology for pathway analysis within a multi-omic framework. It supports the integration of multiple omics—such as gene expression, methylation, mutation, and copy number variation—through various dimensionality reduction strategies. Exploiting graph theory, analyses can be conducted at the level of entire pathways or pathway modules, identified through graph decomposition, allowing for a more detailed examination of affected blocks of genes within large pathways. MOSClip supports two main types of analysis: assessing the survival association of pathways or modules, and conducting two-class comparisons. Initially developed for bulk data, the two-class analysis is currently being extended to support multi-omic single-cell datasets. The package also provides multiple graphical tools to facilitate the visualization and interpretation of results. Developed with a modular structure in mind, MOSClip enables users to handle complex and large-scale datasets offering a range of customizable options, including flexible dimensionality reduction methods and statistical tests, aiming to efficiently deliver results with the best interpretability.
Antonino Zito
-Introduction- Missing values (MVs) reduce the completeness of biological datasets. Compared to other omics, proteomic data are more severely affected by MVs. Values can miss at random (MAR) or not at random (MNAR). Both MAR and MNAR data points impact key data processing steps, including batch effect correction, clustering, and differential expression tests. Ultimately, MVs reduce discovery power. Removing MVs allow to achieve a fully complete dataset, but at the cost of losing potentially valuable information. Ideally, MVs should be accurately recovered, i.e., ”imputed”. How MVs emerge in proteomics is not well understood. Likewise, it is still unclear what are the best methods and stages for MVs recovery. -Methods- To contribute to addressing these questions, we undertook a combinatorial approach. We used public proteomic datasets to: 1) perform comparison with RNA-seq data; 2) assess the type of MVs; 3) injected MVs in complete real data and performed comparison between distinct imputation algorithms. To this end, we employed R meta-packages, including MSnbase, pcaMethods and NAguideR to conveniently integrate single-value, global-structure and local-similarity-based imputation methods. Tested methods included SVD, LLS, BPCA, MsImpute, RF, KNN, QRILC, MLE, MinDet, MinProb; 4) improve the existing SVD-based imputation to achieve an enhanced algorithm exhibiting superior performance. -Preliminary data- Comparison between RNA-seq vs. proteomic data revealed substantially different data distribution. While RNA-seq data typically follow a negative binomial distribution populating the 0-15 range of values (log2), proteomic intensities may range from minimum detected intensity of 20 to 40 (log2). Proteomic data may harbor a gap between 0 and minimum detected intensity. These undetected features -peptides or proteins- are reported as “NA” (MVs) rather than zero. Owing to this distribution gap, reporting MVs as zero would skew fold-changes and other statistical measures for low intensities. Next, we assessed and compared the performance of distinct imputation algorithms. We injected MAR and MNAR MVs in complete real data and performed imputation. This approach allowed us to compare imputed points against the original (complete) values. We observed that both matrix factorization (SVD, NMF) or regression-based methods (RF, LLS, MLE) can model MAR and MNAR MVs. This would challenge previous reports of model-based methods as only applicable to MARs. Overall, variable behavior is observed across methods. RF and BPCA were often the best in terms of accuracy and error, but also among the slowest methods. Interestingly, svdImpute() provided good balance in terms of accuracy, robustness and computational efficiency. It is also capable of imputing both MAR and MNAR MVs, scaling well to large sample sizes. Thus, we aimed to further improve the current svdImpute(). In svdImpute() the MVs are set to zero at the first iteration. PCA is often performed on the covariance matrix that has zero mean, however, imputation should not be expected to have zero mean. This might contribute to a reduced performance of svdImpute(). To address this problem, we modified the way svdImpute handles MVs at the first iteration. We also added irlba(). Altogether, our modified algorithm version, “svdImpute2”, exhibited a 40% improved computation speed when compared to svd() or prcomp(). Novel aspect svdImpute2() is more robust, and exhibits reduced error rate and runtime than svdImpute(). It scales very well on large datasets. Conflict of interest The presented work has been conceived and conducted at BigOmics Analytics. All authors are employees of BigOmics Analytics.
Artür Manukyan
The growing number of spatial omic technologies have created a demand for computational tools capable of storing, analyzing, and integrating spatial datasets with multiple modalities and diverse spatial resolutions. Meanwhile, image processing is becoming an integral part of analyzing spatial data readouts where image registration and alignment of tissue sections is essential for accurate spatially aware data integration. Hence, there is a need for computational platforms that process and analyze images of microanatomical tissue structures as well as those that integrate datasets across distinct spatial resolutions and microscopy images. To this end, we have developed VoltRon, an R package for end-to-end spatial omics analysis with comprehensive image processing capabilities and a novel spatial data framework that supports a large selection of distinct spatial resolutions. To connect and integrate spatial multi-omic profiles across adjacent/serial tissue sections, VoltRon incorporates scalable computer vision workflows for automated synchronization of spatial coordinates and images. VoltRon accounts for spatial organization of tissue blocks (samples), layers (sections) and assays given a multi-resolution and agnostic collection of spatial data readouts. Our tool offers a unique data structure that accommodates data readouts with many levels of spatial resolutions (i.e., multi-resolution). These include spots, single cells, and molecules as well as regions of interest (ROIs) and even image tiles/pixels that are often ignored in currently available spatial omic analysis platforms. Hence, VoltRon (i) is capable of bulk-like analysis of ROI-based spatial datasets, (ii) paves the way for pixel-level analysis of images and (iii) provides downstream analysis of spatially-resolved subcellular observations. VoltRon performs information transfer across tissue sections and multiple modalities which we demonstrate by integrating the pathology driven analysis onto spatially localised SARS-CoV-2 viral RNAs within infected lung tissue using image co-registration of histological images. VoltRon is currently available in GitHub (https://github.com/BIMSBbioinfo/VoltRon) and more information on VoltRon framework can be found on https://bioinformatics.mdc-berlin.de/VoltRon.
Bernat Gel
Single-cell transcriptomics technologies allow us to explore the cellular identities and states at scale, giving us an unprecedented view of tissue composition and cellular diversity. Single-cell data, however, is complex, incomplete and noisy. There are technical and technological limitations affecting our view of each individual cell, such as limited sequencing capacity, failure to completely isolate individual cells prior to their analysis. Most of the technical noise and limitations and identified and managed through the extensive quality control steps often included in single-cell transcriptomics data analysis pipelines. This data is also affected by true biological variation in cell states, variation that reflects true biological differences in cellular state and identity. One example of such variation is the cell cycle, which might affect the global transcriptional profile of a cell and strongly impact our analysis. There are multiple approaches for the identification of cycling cells, but their management depends on the aims of our analyses, since their analysis will inform on the state of the original tissue but will limit our capacity to identify more subtle expression changes. Another source of cellular gene expression variation is cellular stress. This is an important source of variation because it might represent the state of the original tissue or might be the result of sample management in their long path from tissue resection to analysis, forcing us to detect it and take it into account in our analysis. For this, we developed scYoga, an R package for the identification of cellular stress in single-cell transcriptomics data. The core functionality of the package is based on the detection of the combined overexpression of sets of genes associated with the different cellular stresses. It can produce a score for each cell and stress type and generate useful diagnostics plots linking stress to any variable (samples, cell types, clusters…) or to their position in dimensionality reduction plots. scYoga leverages the SingleCellExperiment container and the interoperability of the Bioconductor single-cell analysis ecosystem to provide a simple functionality that is easy to incorporate into existing data analysis pipelines.
Claire Rioualen
Bioconductor packages are currently annotated using the biocViews controlled vocabulary (Carey et al., 2024), structured as a graph with nearly 500 terms, of which 175 are meant for software annotation. In order to ensure the consistency of the annotations, an automated validation is performed by BiocCheck upon submission of a new package. This ensures that packages include valid biocViews terms and meet the minimum requirement of at least two non-top-level terms. However, it lacks the formal structure of an ontology and is specific to the Bioconductor project, thereby limiting both the semantic search capabilities as well as the interoperability of the packages . EDAM (Ison et al., 2013) is a domain ontology specialised in data analysis and management concepts in biosciences, making it a good candidate for the annotation of Bioconductor software packages. Furthermore, EDAM is a key component for the Findability, Accessibility, Interoperability, and Reusability (FAIR principles) within ELIXIR resources, including the ELIXIR Research Software Ecosystem (RSEc) (Ienasescu et al. 2023) and the bio.tools registry (Ison et al., 2019). bio.tools holds a collection of more than 30,000 entries, curated and annotated using EDAM terms, and provides enhanced search capacities through an API for easy programmatic access to the database. In order to increase the visibility and interoperability of Bioconductor packages, as well as the coverage and validity of Bioconductor in the RSEc, we aim to use the EDAM ontology to describe Bioconductor package metadata, and to automate their integration into the bio.tools registry. Our first results towards these objectives include mapping the biocViews vocabulary to EDAM, initiating the curation of a reference set of packages with manual annotations to be used as a gold standard, integrating Bioconductor metadata into the RSEc with automated updates, and prototyping a tool for automated EDAM term suggestions. Together, these achievements establish a strong foundation for further integration and refinement of Bioconductor package descriptions. It will also consequently improve their integration within the Galaxy toolshed through the bio.tools registry of software. Beyond advancing the FAIRification of resources for life science research, this initiative builds a collaborative bridge between the Bioconductor, ELIXIR and Galaxy communities.
Dany Mukesha
Alzheimer’s Disease (AD) and Dementia with Lewy Bodies (DLB) are clinically similar, complicating diagnosis. This study uses targeted serum metabolomics and machine learning to identify biomarkers distinguishing AD from DLB. Serum samples from 55 AD, 14 DLB, and 52 healthy controls were analyzed via mass spectrometry. Machine learning models (Lasso, Random Forest, XGBoost) were trained, incorporating APOE genotypes obtained with an optimized PCR-based genotyping method. AD patients had a higher prevalence of APOE e4/e4 genotypes than DLB and controls (p<0.001). Lipid dysregulations, particularly in phosphatidylcholines, were common across groups. Lasso identified 63 key metabolites. Including APOE improved classification: AUC rose from 0.78 to 0.81 for AD vs DLB, and from 0.78 to 0.83 for AD vs HC. DLB vs HC remained at 0.78. These findings support metabolomics and APOE genotyping as tools for distinguishing AD from DLB. Validation in larger cohorts is planned to develop non-invasive screening tool set to reduce AD and DLB misdiagnosis.
Dewy Nijhof
Autism Spectrum Disorder (ASD) and Attention-Deficit/Hyperactivity Disorder (ADHD) are two prevalent neurodevelopmental disorders (NDDs) that are frequently comorbid, posing significant challenges for diagnosis and treatment. Despite their distinct clinical profiles, they share several overlapping symptoms, suggesting common underlying genetic mechanisms. However, comprehensive research exploring this genetic overlap is still lacking. In our ongoing study we have assembled lists of genes linked to both disorders with curated evidence trails for each. These lists currently contain approximately 1800 genes linked to ASD and 500 genes linked to ADHD. Since synaptic dysfunction is a key feature of both disorders, we are particularly interested in proteins expressed at the neuronal synapse where we have a comprehensive model based on over 50 published proteomic studies. The resulting network model integrates both the postsynaptic (4,817 proteins, 27,788 interactions) and presynaptic (2,221 proteins, 8,678 interactions) proteomes. Using the BioConductor packages BioNAR and SynaptomeDB, we have reconstructed a molecular network (protein-protein interaction) model of the synaptic proteome on which we have overlaid the genetic evidence lines for ASD and ADHD. We can then use these models to probe the molecular mechanisms common to both disorders (comorbidity) as well as identify molecular clusters more specifically associated with either disorder. Our findings aim to inform future research into disorder-specific and overlapping mechanisms and may have implications for biomarker discovery or therapeutic targeting.
Edoardo Filippi
The ever increasing adoption of growth of different high-throughput omics technologies has led to an abundance of data across various biological layers, including transcriptomics, proteomics, and metabolomics, and such scenarios are becoming increasingly common in many life and medical science applications. Integrating and visualizing these complex multi-omics datasets is crucial for a more comprehensive biological understanding, but remains a significant challenge due to data complexity and heterogeneity. We are developing a tool capable of integrating and visualizing multi-omics data in an efficient and compelling manner, facilitating the exploration of relationships between different biological features and their multiple layers, such as genes, proteins, and metabolites. The goal is to allow the user to easily perform a drill down analysis on multiple omics layers at the same time, with a focus on an easy and effective interpretation of results that often encompass multiple contrasts. This tool will also interface to different integration methods (e.g. MOFA, moClusters), and additionally enable the integration of results at the level of features and pathways. While still delivering support for reproducible analyses, we intend to have interactive and in-depth exploration (via the Shiny framework) as a key approach to achieve this. We will use our tool to analyze transcriptomics, proteomics, and metabolomics data from patients with systemic lupus erythematosus (SLE), a complex autoimmune disease marked by highly individualized clinical trajectories. This will serve to deepen the understanding of its pathogenesis, identify potential biomarkers, and, in particular, support the development of more personalized therapeutic strategies.
Guillaume Deflandre
Loading, exploring and analysing the resulting Peptide-Spectrum Matches (PSMs) from a database search in Mass Spectrometry (MS)-based proteomics can be time-consuming. PSMatch is an R/Bioconductor package designed to handle this process by offering functionalities to streamline exploration and visualisation of PSM data. It provides functions to load PSM data from mzId or tabular files, generate theoretical fragment ions, model peptide-protein relations and facilitate visualisations. Recent developments in PSMatch have focused on extending these functionalities to support post-translational modifications, enabling more accurate characterisation of modified peptides. Effort in identifying modified peptides is needed as it is these peptides that are expected to constitute a significant proportion of unidentified spectra. In fields such as single-cell proteomics or metaproteomics, where the identification rates pale by comparison with bulk approaches, this becomes even more prominent. Enabling users to benefit from a powerful and flexible R ecosystem to further explore these unidentified spectra is therefore paramount. PSMatch is part of the R for Mass Spectrometry initiative, that develops an open and collaborative ecosystem of MS-based proteomics and metabolomics, offering efficient, scalable, and stable infrastructure for MS-based proteomics.
Himanshu Saraswat
Applying bioC packages to identify disease relevant genes and pathways in multi-case families with a complex disease. Himanshu Saraswat1, James C. Slimmer1, Jac C. Charlesworth1, Bruce V. Taylor1, Jessica L. Fletcher1, Kathryn P. Burdon1, Nicholas B. Blackburn1. 1. Menzies Institute for Medical Research, University of Tasmania, Hobart, TAS Multiple sclerosis (MS) is a complex, chronic autoimmune, neurodegenerative, central nervous system disease influenced by genetics. Around half the risk of developing MS is due to genetics, and large case-control studies have accounted for part of this risk through common genetic variation but have been limited in their ability to identify rare genetic variants associated with MS. Approximately 10-15% of people with MS report a family history. While infrequent, rare multi-case MS families do occur. Genome sequencing in multi-case MS families is ideal for identifying rare variants that may be associated with MS risk in a family. We hypothesised that rare, likely deleterious, genetic variants segregating with MS contribute to the higher incidence of MS in six multi-case families. To identify these variants and genes that may contribute to MS in these families we combined variants prioritised from genome sequence data with Bioconductor packages. Genome sequence data was analysed with the SAREK nextflow pipeline, along with ‘bcftools’, ‘GATK’, and ‘R’. The resulting variant calls were annotated and filtered to identified rare, likely deleterious variants that segregated with MS in each family. Bioconductor packages and webtools including ‘enrichViewNet’, ‘goseq’, ‘STRING’, and ‘genemenia’ were used to analyse the resulting gene lists from each family to identify overrepresented biological pathways across families. Within each family, the number of variants prioritised for further analysis ranged from nine to 224. The number of variants depended on the size of the family studied. Across all six families, a total of 712 variants in 680 genes were identified. Narrowing down which specific variants may increase the risk for MS among the identified variants is challenging. At the gene level compelling candidates were identified in each family, with known biology supporting a putative role in MS. There was also an overlap at the gene level across families with variants in 21 genes found in two or more families. Using these gene list we first examined genes with protein coding variants. We used Bioconductor packages to identify overlapping biological pathways in these gene lists across families. An overrepresentation of genes with rare, likely deleterious variants was identified in the following three gene ontology (GO) molecular function terms: cytoskeletal protein binding (GO:0008092, P=0.00157), microfilament motor activity (GO:0000146, P=0.0162), actin binding (GO:0003779, P = 0.010526). The analysis was then extended to include genes that had non-coding variants. Non-coding variants may also contribute to disease risk but may be more difficult to initially predict their functional effect. This increased the number of overrepresented GO terms identified to 90 GO terms as well as identifying terms in phenotype and pathway ontologies. This included neurogenesis (GO:0022008, P = 6.205e-08), somatodendritic compartment (GO:0036477, P=2.81e-04), cytoskeletal protein binding (GO:0008092, P=5.99e-04), and spinal dysraphism (HP:0010301 , P=2.65e-02). Through the application of Bioconductor packages this analysis has identified evidence that suggests genes involved in pathways relevant to neuronal function are involved in MS risk. This expands our understanding of the genetic architecture of MS risk.
Hiranyamaya Dash
MotifPeeker provides a novel benchmarking framework for epigenomic datasets where no established “gold standard” reference exists, using transcription factor motif enrichment as a principal evaluation metric. Traditional approaches often require comparison against a reference dataset—typically ChIP-seq data—under the unsubstantiated assumption that older methods inherently produce more reliable results. Our approach eliminates this requirement by instead measuring the biological relevance of peak calls through motif occurrence analysis. With minimal input parameters, users can analyse and compare multiple epigenomic datasets in a single function call, receiving a comprehensive and intuitive HTML report detailing relative performance metrics. MotifPeeker seamlessly integrates with existing Bioconductor workflows and supports various peak file formats, making it accessible to researchers with limited computational experience. This tool addresses a critical need in the epigenomics community for objective comparison methods as new techniques like CUT&Tag and TIP-Seq continue to emerge, enabling more robust experimental design and interpretation of results across diverse epigenomic profiling methods.
Jasper Spitzer
In order to visualise transcriptomic and other high-variable data, the heatmap has become one of the default approaches. While different options exist, some of which already in the Bioconductor community, none of the options are as easy to use and workflow agnostic as this solution: here, we presen simpleHM, an R package to facilitate heatmap visualisations, leveraging the popular “tidyverse” packages to create easy to create and easy to customise plots. In contrast to other approaches, this packages delivers a ggplot object, allowing for infinite tweaking and customisation while on the other hand not compromising on clustering, dendrograms or annotations.
Jiaqi Ni
Background and Objectives: The diet-microbiota-gut-brain axis emerges as a promising research area for preventing neurodegenerative disorders. We examined associations between nut consumption, cognitive function changes over 6 years, and gut microbiota composition in older adults at cognitive decline risk. Material and methods: This prospective study included participants from PREDIMED-Plus trial with overweight/obesity and metabolic syndrome who provided baseline stool samples, dietary information, and at least one follow-up cognitive assessment. Nut consumption, assessed at baseline using a validated food frequency questionnaire, was categorized as ≤1, 1–3, 3–7, and >7servings/week. Cognitive function was evaluated at baseline, 2, 4, and 6 years using a comprehensive battery of neuropsychological tests. Gut microbiota composition was profiled via 16S rRNA amplicon sequencing. Multivariable linear mixed-effects regression and linear regression models adjusting for potential confounders were used. Results: Among 747 participants (mean age at baseline 65±5 years, 48% women) included in the final analysis, those consuming 3–7servings of nuts/week showed significantly slower declines in global cognitive function compared to those consuming ≤1serving/week (4 years: β[95%CI]=0.170[0.022,0.319], p=0.024; 6 years: 0.176[0.020,0.331], 0.027). Nut consumption was associated with higher gut microbial diversity (Shannon index 3–7vs≤1serving/week: 0.211[0.008,0.414], 0.042) and modest distinct microbial patterns (p=0.047). Thirteen taxa, including Lachnospiraceae UCG-004 and Roseburia, were associated with nut consumption, with Lachnospiraceae UCG-004 further associated with positive changes in global cognitive function (2 years: 0.020[0.004,0.036], q=0.050) and slower decline in attention (6 years: 0.042[0.020,0.064], q=0.001). Conclusions: Moderate nut consumption was interconnectedly associated with cognitive preservation and favorable gut microbiota composition. Further longitudinal studies and clinical trials are warranted.
Kateřina Matějková
Accurate characterization of splice junctions is critical for understanding the role of alternative mRNA splicing and for interpretation of DNA variants that cause aberrant splicing events. Here, we present safRa, an R-based computational tool that annotates alternative/aberrant splice junctions and quantifies their relative abundance compared to the corresponding canonical wild-type mRNA. Built using the syntax and logic of the tidyomics ecosystem, safRa provides a streamlined interface compatible with the tidyverse and the Bioconductor framework, supporting readable code and seamless integration into bioinformatics workflows. The tool accepts splice junction coordinates and RNA-seq coverage data and supports custom transcript annotations, providing flexibility beyond default references such as MANE Select. A core feature of safRa is its ability to identify wild-type counterparts for alternative junctions and compute their relative usage, offering biologically anchored quantification. It further annotates each splice junction with genomic context, coding potential, exon involvement, and predicts functional outcomes through nonsense-mediated decay (NMD) analysis. The implementation of safRa as a Bioconductor package provides an accessible, up-to-date solution that eliminates technological barriers for investigators in research and clinical diagnostics without extensive programming experience. Aligned with ACMG/AMP recommendations from the ClinGen SVI Splicing Subgroup, safRa supports evidence-based variant interpretation. By delivering detailed and accessible splicing analysis, it facilitates RNA-seq interpretation in both research and diagnostic settings.
Marta Sevilla Porras
Uniparental disomies (UPDs) are rare chromosomal abnormalities in which an individual inherits both copies of a chromosome, or a segment thereof, from a single parent. This genetic anomaly can contribute to disease through recessive mutations or disruptions in genomic imprinting. Despite their clinical relevance, UPDs remain underexplored due to limitations in current detection methods applied to next-generation sequencing (NGS) data. We introduce UPDhmm, a novel bioinformatics tool designed to detect UPD events using NGS data from trios (proband and both parents). By employing a Hidden Markov Model (HMM) to distinguish standard Mendelian inheritance from UPD patterns, UPDhmm enables the accurate identification, classification, and localization of UPD events. Benchmark evaluations using simulated data from both exome and whole-genome sequencing datasets demonstrated that UPDhmm outperforms existing tools, such as UPDio and AltAFplotter, particularly in detecting smaller UPD events while minimizing false positives and accurately classifying UPD subtypes (isodisomy and heterodisomy). A key advantage of UPDhmm is its ability to model both Mendelian inheritance and UPD patterns with enhanced sensitivity, particularly in consanguineous datasets where runs of homozygosity (ROH) can confound detection. By reducing reliance on ROH, UPDhmm provides more accurate results. Its capacity to analyze both exome and whole-genome sequencing data makes it highly applicable in clinical contexts. Additionally, UPDhmm precisely identifies the genomic coordinates of UPD events, a critical feature often lacking in alternative methods. UPDhmm’s efficiency was further validated through its application to real datasets, including data from the Simons Simplex Collection (SSC), a large autism spectrum disorder (ASD) cohort. Two notable UPD events exceeding 10 Mb were exclusively detected in affected children. One case revealed a paternal isodisomy on chromosome 8, overlapping genes CHD7 and VPS13B, both linked to neurodevelopmental disorders. Another case identified a maternal heterodisomy on chromosome 22, involving the ASD-related gene SHANK3 and several imprinted genes. These findings highlight UPDhmm’s potential for discovering clinically relevant UPD events. UPDhmm is available as an R/Bioconductor package, providing accessibility for researchers and clinicians working with NGS datasets. Its performance on both simulated and real datasets underscores its value for uncovering rare genetic variations with clinical significance in undiagnosed rare disease patients, making it a powerful tool for research and diagnostics.
Qiao Yang
Intrinsic molecular subtyping of breast cancer into categories such as Luminal A, Luminal B, HER2-enriched, Basal-like, and Normal-like is fundamental for understanding tumor biology and guiding personalized treatment strategies. However, research implementations of these subtyping methods are fragmented across various tools, leading to inconsistencies and reduced reproducibility. We present BreastSubtypeR, an R/Bioconductor package that integrates ten established subtyping approaches—including the original PAM50 method—into a unified BS_Multi function. A key feature, the AUTO mode, dynamically selects the most appropriate set of methods based on cohort-specific estrogen receptor (ER)/HER2 distributions, thereby minimizing assumption violations. An Entrez ID–based gene mapping routine ensures robust cross-platform compatibility. For users without programming expertise, iBreastSubtypeR, a local R Shiny application, provides an intuitive graphical user interface encapsulating the core functionalities. By consolidating multiple intrinsic subtyping approaches within the Bioconductor ecosystem, BreastSubtypeR enhances consistency, reproducibility, and accessibility in breast cancer research. Its cohort-adaptive AUTO mode and optimized gene mapping facilitate robust analyses across diverse datasets, while the Shiny interface broadens usability among researchers with varying levels of computational proficiency. BreastSubtypeR and its companion Shiny application, iBreastSubtypeR, are freely available through Bioconductor (https://bioconductor.org/packages/release/bioc/html/BreastSubtypeR.html). The complete source code is also hosted on GitHub at https://github.com/JohanHartmanGroupBioteam/BreastSubtypeR under the GPLv3 license. Comprehensive documentation and an example dataset are provided to facilitate user adoption.
Roger Olivella
We present ribomsqc, an automated quality control (QC) pipeline tailored for ribonucleoside analysis by mass spectrometry. Designed to address the challenges of manual analysis of QC samples in large-scale experiments, ribomsqc relies on the Bioconductor R package MSnbase for robust handling of raw MS data, enabling consistent monitoring of instrument performance. Built with Nextflow and aligned with nf-core standards, the pipeline integrates chromatogram-based metrics and MultiQC reports for comprehensive visualization of key analytical figures. ribomsqc offers a scalable and reproducible solution for QC in post-transcriptional RNA modification research, making it particularly relevant to the Bioconductor community.
Sara Baldinelli
Through the use of a heat-inducible CRISPR/Cas9 system, we can perturb genes otherwise essential during development in an adult context. Scaling up this experimental approach is the central tool of the DECODE project, founded with the aim of systematically mapping context-dependent genetic networks across dynamic tissues in vivo. Drosophila melanogaster serves as our model system, in which we specifically perturb only cells expressing intestinal stem cell markers. By integrating single-cell transcriptomics and high-resolution imaging of thousands of conditional knockouts, DECODE aims to resolve how genetic networks dynamically adapt their topology across cell types and external stimuli. To achieve organism-scale profiling, we utilize single-cell combinatorial indexing RNA sequencing (sci-RNA-seq), a split-pool barcoding strategy that exponentially scales throughput while reducing costs to ~1 cent per cell. With the transition from whole-cell to single-nucleus profiling, this method is able to avoid challenges in cell dissociation and minimize tissue-specific biases. Even though nuclear RNA lacks cytoplasmic transcripts, nuclei isolation enables universal application across tissues, including RNase-rich adult samples, and reduces dissociation-induced stress artifacts compared to scRNA-seq. Despite the mentioned loss of cytoplasmic transcripts, snRNA-seq retains sufficient biological signal to characterize population heterogeneity and often achieves a higher signal-to-noise ratio by avoiding sequencing of abundant mitochondrial and ribosomal transcripts. A careful optimization of sci-RNA-seq is required to mitigate artifacts like index hopping and ensure high-quality data. QC metrics such as UMI counts, gene detection, and duplication rates are lower compared to droplet-based methods but are sufficient for robust population characterization when improved. Indeed, a recent optimization to sci-RNA-seq (sci-RNA-seq3) has enhanced sensitivity. Understanding the challenges and limitations of sci-RNA-seq remains an active area of investigation and, through this work, we aim to further evaluate its performance and benchmark it against gold-standard methods.
Sara Potente
Introduction: Shallow Whole Genome Sequencing (sWGS) has become a cost-effective method for genomic analysis, particularly in identifying copy number alterations (CNAs). However, the lack of standardized pipelines for sWGS data analysis presents a significant challenge towards robustness and reproducibility of the results. To address this gap, we have developed SAMURAI (Shallow Analysis of copy nuMber alterations Using a Reproducible And Integrated bioinformatics pipeline), a modular, reproducible pipeline tailored for sWGS CNA analysis. Methods: SAMURAI was built in compliance with nf-core community guidelines and is fully containerized to ensure reproducibility and scalability across computing environments. The pipeline comprises modular blocks for data pre-processing, copy number analysis, and reporting. Users can specify the sample type—solid tumor or liquid biopsy (e.g., circulating tumor DNA)—with tailored tools integrated for each context. Results: The pre-processing module standardizes input preparation before branching into CNA detection using a suite of state-of-the-art algorithms to detect CNAs. and custom scripts. SAMURAI includes many R/Bioconductor packages (QDNAseq [1], ASCAT.sc [2], CINSignatureQuantification [3], ichorCNA [4], Maftools [5]), harmonizing outputs for streamlined downstream analysis. A final report provides detailed results useful for data interpretation. Validation with both simulated and real-world clinical datasets showed high concordance with ground truth CNAs and consistent performance across multiple conditions, demonstrating its robustness and reliability. Discussion: SAMURAI enables efficient and reproducible CNA detection from sWGS data, particularly in cancer genomics. Its modular structure and R/Bioconductor core make it accessible and customizable, providing researchers with a reliable framework for both exploratory and clinical genomic analysis. Conclusion: In conclusion, SAMURAI offers a reliable and versatile pipeline for analyzing CNAs from sWGS data. Its adherence to community guidelines and containerized software ensures reproducibility and scalability, making it an helpful tool for researchers in diverse environments, and potentially contributing to advancements in precision medicine. References 1. Scheinin I, Sie D, Bengtsson H, et al. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome Res. 2014; 24:2022–2032 2. VanLoo P. ASCAT.sc. 2021; 3. Drews RM, Hernando B, Tarabichi M, et al. A pan-cancer compendium of chromosomal instability. Nature 2022; 606:976–983 4. Adalsteinsson VA, Ha G, Freeman SS, et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat. Commun. 2017; 8:1324 5. Mayakonda A, Lin D-C, Assenov Y, et al. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 2018; 28:1747–1756
Stefania Pirrotta
Mitochondria are highly dynamic organelles essential for energy production, biosynthetic processes, and cellular signaling. They actively process biological information, sensing and responding to both internal and external cues. Through complex physical interactions and diffusion mechanisms within cellular networks, mitochondria integrate diverse signals to finely modulate cellular functions and overall physiology. Consequently, impairments in mitochondrial function can result in a wide spectrum of phenotypic manifestations. High-throughput transcriptomic technologies are capable of capturing these mitochondrial-related changes. However, conventional pathway analysis tools based on general-purpose databases often fall short in detecting specific mitochondrial alterations. This limitation is largely due to the broad nature of pathways, where mitochondrial components typically represent only a minor fraction of the overall signaling cascades. To address this gap, we developed mitology, an R package specifically designed for the analysis of mitochondrial activity through transcriptomic data. We began by compiling a comprehensive list of genes encoding proteins localized to mitochondria, drawing from specialized resources such as MitoCarta3.0, IMPI, MSeqDR, and Gene Ontology entries with terms including “mitochondri-” word. Building on this curated gene set, mitology offers ready-to-use implementations of MitoCarta3.0 pathways and introduces a restructured view of general pathway databases like Reactome and Gene Ontology, reorganized to highlight mitochondrial-related processes. These mitochondria-focused gene sets serve as the foundation for pathway analysis, enabling single-sample evaluations that are both targeted and sensitive to mitochondrial signals. Additionally, mitology is compatible with a range of transcriptomic data types - from classical bulk RNA-seq to cutting-edge single-cell and spatial transcriptomics - allowing researchers to explore mitochondrial dynamics across diverse biological contexts. Further, it includes several graphical functions for easy interpretation and informative visualization of results. As such, mitology stands out as a versatile and powerful tool for in-depth investigation of mitochondrial function through gene expression data.
Thu. - Sep. 18, ’25
Adrian Hernandez-Cacho
The gut microbiota plays a potential role in the pathophysiology of depression through the gut-brain axis. This cross-sectional study in 400 participants from the PREDIMED-Plus study investigates the interplay between gut microbiota and depression using a multi-omics approach. Depression was defined as antidepressant use or high Beck Depression Inventory-II scores. Gut microbiota was characterized by 16S rRNA sequencing, and faecal metabolites were analysed via liquid chromatography-tandem mass spectrometry. Participants with depression exhibited significant differences in gut microbial composition and metabolic profiles. Differentially abundant taxa included Acidaminococcus, Christensenellaceae R-7 group, and Megasphaera, among others. Metabolomic analysis revealed 15 significantly altered metabolites, primarily lipids, organic acids, and benzenoids, some of which correlated with gut microbial features. This study highlights the interplay between the gut microbiota and depression, paving the way for future research to determine whether gut microbiota influences depression pathophysiology or reflects changes associated with depression.
Alina Jenn
Single cell transcriptomic data has become a valuable approach to investigating tissues with heterogeneous cell populations. One of the fundamental steps in the analysis workflows is the data annotation, where each cell is assigned informative labels about their type, state or identity. Manually reviewing single cell RNA sequencing data and annotating based on the individual gene expression profiles is slow and laborious, yet this is still regarded as the gold standard in the field. Most of the automated annotation workflows follow one of the three main strategies: reference-based, marker-based, or model-based. Still, the multiplicity of tools and options available makes it challenging for users to become proficient at running them and to efficiently combine their outputs for downstream steps. To address this lack of a standardized annotation tool, we are currently developing scBaptism, a tool that unifies the suitable annotation programs, leveraging one single framework (SingleCellExperiment) to store and integrate the relevant cell label information. Additionally, we provide visualizations that deliver insights by directly comparing results from different tools, showing similarities and differences among tools, methods, and references (e.g. inconsistencies across various methods, variations across annotations/references). scBaptism can be an ideal companion towards defining appropriately discrete groups and clusters of cells, which will be fundamental for interpreting the data after differential state analyses (differential expression, differential abundance) are performed. With our tool the process of single cell annotation will be more easily accessible and require less manual input while providing the researcher with more comprehensive results through one software interface.
Anastasiya Boersch
Over the past decade, flow cytometry technologies have advanced significantly, now enabling the simultaneous measurement of up to 50 parameters per cell and offering a great analytical potential. However, it also presents substantial challenges, especially when working with large numbers of samples, making the traditional analysis tool FlowJo less suited for handling such complexity. In contrast, Bioconductor provides multiple R packages and workflows that support comprehensive flow cytometry data analysis, including automated data-driven gating and rich visualization capabilities. Nonetheless, the heterogeneity of data structures and often limited documentation across packages can hinder seamless integration and usability. To address these challenges, we developed an R workflow that streamlines key steps of flow cytometry data analysis and ensures the reproducibility of obtained results: compensation (optional), transformation, quality control, batch correction (optional), gating, and the extraction of gating statistics. This workflow is designed to run locally, making advanced cytometric analysis more accessible and customizable for researchers.
Andrea Mock
Tumor-draining lymph nodes (TDLNs) are key sites of immune regulation and early metastasis in breast cancer, yet their role in predicting therapy response in hormone receptor-positive (HR+) subtypes remains poorly characterized. Leveraging Imaging Mass Cytometry (IMC), we analyzed spatially-resolved protein expression across 67 HR+ breast cancer patients (28 Luminal A, 39 Luminal B), profiling 35 markers in primary tumors, metastatic, and non-metastatic lymph nodes. To address the computational demands of single-cell spatial data, we applied geometric sketching to enhance representation of rare immune cell types and accelerate downstream analysis. Louvain clustering and kNN-based label transfer enabled scalable and accurate cell-type annotation. Using Bioconductor-compatible tools, including CATALYST, cytomapper, BiocNeighbors and sketchR, we identified spatially enriched immune cell subsets and cellular neighborhoods predictive of recurrence. Notably, B cells, regulatory T cells, and cytotoxic T cells were associated with recurrence risk, while specific spatial niches were predictive of therapy response. Our findings highlight the prognostic value of spatial immune architecture in TDLNs and set the stage for benchmarking against other deep learning-based predictive models. This work underscores the utility of Bioconductor-driven spatial analysis for biomarker discovery in translational oncology.
Axel Klenk
GSVA (https://bioconductor.org/packages/GSVA) is an R/Bioconductor package that enables pathway-centric analyses of data produced by high-throughput molecular profiling technologies. The interpretation of biological findings from such data is one of the cornerstones of biomedical research, and GSVA facilitates that goal by performing a conceptually simple but powerful change in the functional unit of analysis, from genes to gene sets. Here we describe our efforts to adapt GSVA to data produced at single-cell and spatial resolution, increasing its robustness and scalability, and improving the user interface and documentation. By using pathways instead of gene-centric features, this new version of GSVA contributes to improve exploratory data analysis, and to develop lower-dimensional statistical and machine learning models, on data from single-cell and spatial transcriptomics experiments.
Charlotte Soneson
Following the Bioconductor package guidelines, all packages submitted for distribution via the project are required to have several layers of documentation, including manual pages for functions as well as long-form vignettes outlining how the package can be used to perform analyses. However, in order to take advantage of the rich information provided in these documentation sources, a user must first locate the right package. Given the large number of packages available in Bioconductor (more than 2,300 in the current release), this is not always an easy task, especially for a user who may have just started analysing a certain type of data. Recently, the Bioconductor Training Committee initiated a collection of Bioconductor ‘How To’ documents, building further on a set of task-specific sections available for many years in the GenomicRanges vignette. These How To documents are short and aimed at describing how to perform very specific tasks, such as reading paired-end reads from a bam file, reading mass spectrometry data, or retrieving a gene model, using one or multiple Bioconductor packages. Each document shows an example, and points the reader to other resources (e.g., package vignettes) where more information and further details are provided. The growing collection of How To documents is available online at https://bioconductor.github.io/BiocHowTo/articles/index.html. With this poster, we will describe the initiative and a call for contributions from the community.
Charlotte Soneson
The Bioconductor project currently distributes more than 2,300 software packages. While some of these, especially those focusing on fundamental infrastructure and functionality, are written by the Bioconductor core team, most are contributed by community members. An explicit goal of Bioconductor is that packages are intended and expected to work together in a concerted way. A consequence of this is that any new package must adhere to a set of guidelines to enable a seamless interaction with the existing ecosystem. An extensive contributor’s guide (https://contributions.bioconductor.org/) is available to help developers navigate these guidelines. In this poster, we provide summarized, accessible guidance for prospective package contributors in the form of eleven quick tips for writing a Bioconductor package. The tips focus specifically on the aspects of the contributor guidelines that, in our experience, are most helpful to keep in mind throughout the development process in order to enable a smooth submission experience.
Dario Righelli
The rapid expansion of single-cell RNA sequencing (scRNA-seq) technologies has increased the need for robust and scalable clustering evaluation methods. To address these challenges, we developed robin2, an optimized version of our R package robin. It introduces enhanced computational efficiency, support for high-dimensional datasets, and harmonious integration with R’s base functionalities for robust network analysis. Our tests demonstrate that robin2 reduces computational time by ninefold on large-scale scRNA-seq datasets using parallel processing with 12 cores. It offers improved functionality for clustering stability validation and enables systematic evaluation of community detection algorithms across various resolutions and pipelines. The application of robin2 to the Tabula Muris dataset confirmed its capability to identify biologically meaningful cell subpopulations with high statistical significance. The robin2 package is freely available on CRAN at https://CRAN.R-project.org/package=robin. Comprehensive documentation and a detailed analysis vignette are accessible on GitHub at https://github.com/drighelli/scrobinv2.
Diego M. Fernandez-Aroca
Introduction: Pathological cardiac hypertrophy is widely linked to lifestyle and environmental factors, and is a risk factor for life-threatening conditions such as heart failure. GWAS population studies have clearly implicated non-coding regulatory genomic regions in the development of cardiac hypertrophy, and a number of mechanistic investigations suggest a role for epigenetic mechanisms as well as DNA damage in mediating hypertrophy. However, the identity and activity of epigenomic regions deployed during this pathological process have not been studied in detail, in particular cell-type specific epigenetic changes. In this ongoing project, we aim to study the interplay between DNA damage and epigenomic dysregulation associated to the development of cardiac hypertrophy, using established in vivo and in vitro models. Methods: We have applied a cardiac hypertrophy induction protocol to a transgenic mouse reporter model (⍺-MHC-mCherry), chronically treated with Angiotensin II (1µg/kg/min) (Ang-II) over 14 days. Induction of cardiac hypertrophy was measured by analysing heart weight, induction of hypertrophy-related gene markers, and cell size in histological sections. In this model, we performed: i) single-cell RNAseq and ii) epigenomic profiling of H3K27ac in mice cardiomyocytes (CM) and non-cardiomyocytes (non-CM) by ChIP sequencing, iii) epigenomic profiling of H3K27ac in iPSC-CMs by CUT&Tag. As a complementary approach, we are employing an in vitro model in human iPSCs differentiated to cardiomyocytes (iPSC-CM) to study the impact of DNA damage modulation in the epigenome during cardiac hypertrophy, by using endothelin-I (ET-1) and pharmacological inhibition approaches. iPSC-CM cell size changes are evaluated by automated high-content microscopy and automated image analysis (ilastik + CellProfiler). Results: Mice treated with Ang-II show a clear induction of hypertrophy-related gene markers as well as an increase in cell size. Single-cell RNAseq data shows a cell-type specific transcriptomic response to Ang-II. Consistently, differential binding analysis showed 766 differentially H3K27ac bound sites in CMs and 2061 in non-CMs, most of them closely related to cardiac hypertrophy biological processes as shown in Gene Ontology enrichment analysis. In vitro, treatment with ET-1 in iPSC-CM leads to a clear induction of hypertrophy-related gene markers as well as to an increase in cell size. We detected induction of DNA damage and, interestingly, pharmacological inhibition of ATM reverses CM hypertrophy induced by ET-1. Evaluation of H3K27ac changes observed during induction of cardiac hypertrophy by CUT&Tag are partially reversed after inhibition of ATM. Conclusions: We have validated Angiotensin-II-infused mice and endothelin-I-treated iPSC-CMs for the study of cardiac hypertrophy at the molecular and cellular level. Induction of cardiac hypertrophy through Angiotensin-II induces broad (epi)genomic dysregulation in different cardiac cell types. Further ongoing epigenomic profiling in iPSC-CMs after modulation of DNA response will deliver novel insights on the role DNA damage in this pathology and the underlying epigenomic dysregulation.
Elena Zuin
Single-cell datasets often include samples from multiple laboratories and conditions, leading to complex batch effects. This unwanted technical variation overlaps with biological effects of interest and confuses downstream analyses. A key challenge in the study of single-cell data is to align various datasets while preserving biological variations correctly. The batch effects correction is an important step for single-cell data analysis to create a new atlas or to identify cell types, in particular, rare cell types. Existing methods are based on different mathematical approaches and, unfortunately, produce highly dissimilar results. To gain a deeper understanding of their strengths and weaknesses, we performed a benchmark study using a selection of the most popular and novel approaches available in the R and Python ecosystems. We evaluated the performance of these methods in terms of their ability to remove batch effects while preserving biological signals. Initially, we based our assessment on popular metrics such as Local Inverse Simpson’s Index, Average Silhouette Width, and Adjusted Range Index. However, these metrics frequently yielded inconsistent results, hindering our ability to identify the best-performing methods. To address this limitation, we propose the Wasserstein distance as an alternative evaluation metric, providing a global summary of batch differences in a reduced dimensionality space. Lastly, we created an R package that contains a common interface for single-cell batch correction methods in R and Python environments to facilitate further studies.
Eliana Ibrahimi
The gut microbiome is increasingly recognized as a key modulator in the early pathogenesis of cardiovascular disease such as ischemic heart disease (IHD), offering both diagnostic potential and mechanistic insights before clinical symptom onset. Translating this promise into clinical tools requires advanced computational methods capable of extracting meaningful patterns from complex and noisy microbiome data. In this study, we compare three supervised machine learning algorithms, Lasso-regularized Generalized Linear Models (GLM), Random Forests (RF), and Fuzzy Forests (FF), to classify IHD from healthy controls, using gut microbiome profiles obtained from the Metacardis project (metacardis.net). The dataset included 375 patients with IHD and 275 healthy controls. Preprocessing included normalization, filtering of low-abundance taxa, and stratified train and test splitting. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), with FF outperforming RF and Lasso-GLM in models classifying IHD from healthy subjects (AUC = 0.83 vs. 0.74 and 0.71, respectively). The results from the FF algorithm are consistent with previously reported studies, which emphasize significant alterations in the abundance of various microbial taxa in IHD patients. Taxa such as Prevotella, Bacteroides, and Ruminococcus were among those exhibiting marked differences in abundance between IHD patients and healthy controls, suggesting their potential role in the disease’s etiology as previously reported. The ability of FF to identify key microbial taxa contributing to this separation provides valuable insights into potential biomarkers for early diagnosis and therapeutic targets. The superior performance of FF highlights its robustness in handling uncertainty and high dimensionality inherent in microbiome data, making it a promising tool for early IHD prediction and microbiome-related biomarker discovery.
Enes Sefa Ayar
Proteins are the key molecules in executing biological functions, and they cooperate as part of protein complexes or biological pathways. Correlation in their abundance suggests functional interdependence, offering insights into biological functions. Thus, identifying biologically meaningful protein groups (modules) is a critical step in understanding cellular processes. While many module detection methods exist, they were developed for bulk or transcriptomic data and rely on the assumption that gene expression levels can identify functionally related protein groups. Single-cell proteomics (SCP) quantifies protein levels at single-cell resolution, eliminating this assumption and offering a more accurate view of functional protein relationships. Moreover, SCP preserves cellular heterogeneity, enabling the discovery of dynamic and context-specific protein modules are often masked in bulk measurements. Despite these advantages, existing module detection methods may not be well suited for SCP data, which presents unique challenges such as batch effects and missing values. Moreover, these methods also differ by various features, for instance, whether they incorporate (or not) prior biological knowledge, whether they allow (or not) overlapping modules, or to what extent they use differential correlation analysis. Parameter choices further influence the identified modules, often leading to arbitrary decisions. However, all share a more critical limitation: they generate modules even when applied to random data. It is therefore essential to distinguish biologically relevant modules from artifacts. In this work, we systematically evaluate module detection methods on SCP datasets. Our assessment framework integrates (1) internal clustering metrics to evaluate compactness and separation, (2) external validation against known biological annotations, and (3) network-based analyses incorporating protein-protein interaction data to enhance biological interpretability. Our findings reveal notable differences in the biological relevance of the identified modules and offer practical recommendations for selecting and validating module detection approaches for single-cell proteomics. Furthermore, we propose strategies for addressing missing values and batch effects, thereby improving the accuracy and reliability of module detection.
Fjona Lami
Understanding how the gut microbiome influences obesity is a growing area of interest, but analyzing the complex and high-dimensional microbiome datasets can be challenging. Machine learning (ML) models are powerful tools for finding patterns in microbiome data, yet they often act as “black boxes,” making it hard to understand why certain predictions are made. In this study, we applied explainable artificial intelligence (XAI) techniques using curatedMetagenomicData (via Bioconductor/R) from individuals with varying body mass index (BMI) to study the microbial drivers behind obesity predictions. To ensure reliable input for machine learning models, we applied several preprocessing steps. Rare taxa present in fewer than 10% of the samples were filtered out to minimize noise. To address the compositional nature of microbiome data, feature counts were transformed using centered log-ratio (CLR) transformation after the addition of a small pseudo-count. Samples with missing BMI values or incomplete metadata were excluded. This preprocessing pipeline helped standardize the data and reduce technical variability, ensuring that downstream machine learning models could focus on biologically relevant patterns. Several classification models such as random forests, gradient boosting, and predomics, were trained. We then used SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanations) to interpret the models’ outputs. These tools helped identify key bacterial genera that were consistently associated with obesity. Models achieved high predictive accuracy, while offering biologically meaningful insights into the gut–obesity connection. Our findings highlight the potential of interpretable machine learning in guiding future microbiome-based strategies for obesity prevention and treatment.
Florian Auer
The RCX package provides an adaptation of the Cytoscape Exchange (CX) format, facilitating the creation, manipulation, and interoperability of biological networks within the R statistical computing environment. The RCX package now offers the capability to directly construct biological networks from data stored in standard tabular structures within R, significantly enhancing its utility. It now enables the instantiation of network objects from data frames by interpreting information pertaining to nodes, edges, or both simultaneously, accommodating their respective attributes. This complements the name-value approach inherent in the base CX specification and significantly improves the practicality and integrability of the RCX package. Beyond data representation, RCX has been extended to provide improved support for the visual representation of networks as defined within the CX format’s visual properties aspect. Accurate interpretation and application of these properties are crucial for consistent visualization across platforms such as Cytoscape, NDEx, and within RCX itself. Recognizing the complexity of manually constructing these specifications in R, the updated RCX package incorporates higher-level functions that encapsulate the precise structure and valid value domains of Cytoscape visual styles. This abstraction layer minimizes the potential for errors and ensures consistent network rendering. Furthermore, RCy3, an R package providing an interface to remotely control Cytoscape for programmatic network visualization, analysis, and exploration, also leverages the visual properties. Consequently, the implemented functionalities in RCX for handling these visual properties directly enhance the usability of the RCy3 package, streamlining the programmatic control of Cytoscape’s visualization capabilities. These advancements establish RCX as a more rigorous and efficient tool for the R-driven biological network analysis community.
Francesca A L Marino
ADARs (Adenosine Deaminases Acting on RNA) are enzymes that play a crucial role in RNA editing, specifically converting adenosine to inosine – which is interpreted by the cell as guanosine – within double-stranded RNA molecules. This co-transcriptional modification can alter the coding sequence of mRNAs or regulate gene expression and is involved in fundamental processes as neural development and immune response. To allow a quantitative comparison of ADAR1 activity in various RNA-seq samples, the Alu Editing Index (AEI) was developed by Roth, Shalom Hillel et al., Nature methods (2019). The AEI measures ADAR1 activity in Alu as the ratio of A-to-G mismatches in the total number of adenosines in Alu elements. Therefore, the AEI needs to start from the raw reads. This hampered AEI usability when raw reads are not available or when RNA-seq 3’ end libraries are used, including single cell RNA-seq and spatial transcriptomic data. To overcome these issues, we developed a tool to globally quantify ADAR activity (not only specific to the Alu sites) starting from the expression matrix. We created four transcriptional ADAR signatures (Neuronal Human, Neuronal Mouse, Inflammation Human, Cancer related) leveraging several bulk RNA-Seq datasets with the knock-out (or knock-down) of ADAR1, ADAR2 or both in different conditions (neuronal development, mouse neurons, cancer cell related, interferon stimulation). The signatures were defined as the regulons extracted from the network reconstructed obtained by gene expression matrices, where regulons are defined as the first neighbors of the ADAR genes. ADAR signatures are calculated with the R package viper. The robustness of the signatures was tested by applying them across validation datasets of bulk, single-cell and spatial transcriptomics datasets. The results obtained suggest that the developed signatures can be employed to get reliable ADARs’ activity scores specific to each biological context. Hence, we developed an R package that will be available on GitHub and submitted to Bioconductor. The functions of the package allow to measure ADARs’ activity scores starting from an expression matrix of bulk, single-cell or spatial transcriptomics datasets, as well as the visualization of the results based on the previously described signatures.
Ivo Kwee
Multi-omics biomarker identification is crucial for advancing biomedical research and personalized medicine. Computational genomic approaches are increasingly used to screen large biological datasets and identify features capable of classifying or predicting phenotypes. While traditionally focused on single-omics data, integration of multi-omics data enhances biomarker selection by capturing additional layers of biological variation. Matrix factorization methods have emerged as a powerful tool for multi-omics analysis, as they can learn latent factors that capture significant heterogeneity across different data types. However, each of the current multi-omics factorization algorithms presents its own strengths and weaknesses. Distinct approaches may result into different biomarker sets, therefore causing loss of potentially valuable information. Ultimately, reconciling different biomarker sets identified by distinct methods is difficult and error-prone. To contribute to addressing these questions, we undertook a combinatorial approach. We used a TCGA, multi-omics breast cancer dataset of 150 samples comprising of transcriptomics, proteomics, and microRNA profiles. We utilized this dataset to: 1) perform multi-omics factorization using 10 distinct algorithms or algorithm variants thereof, including DIABLO, PCA (principal component analysis), MOFA (multi-omics factor analysis), NMF (non-negative matrix factorization), WGCNA (weighted gene correlation network analysis), SGCCA (sparse generalized CCA), SGCCDA (sparse generalized canonical correlation discriminant analysis), RGCCA (regularized generalized CCA), RGCCDA (regularized generalized canonical correlation discriminant analysis), MCIA (multiple co-inertia analysis); 2) combined results of all methods into a variable importance measure to identify a robust set of biomarkers; 3) compared the distinct methods to assess concordance between methods. All methods were able to predict a set of biomarkers. However, as anticipated, non overlapping biomarker features between methods were often observed. It remains challenging to determine the optimal criteria for selecting the most appropriate factorization approach. For example, distinct data types or data modalities may require tailored approaches. We conducted comparison between methods and found that PCA, MOFA, NMF are more similar to each other compared to other methods. This can be explained by algorithm similarity. For instance, both PCA and MOFA attempt to explain the maximum variance into a small set of components or factors created as an approximated linear combination of the original variables from each data modality. Also, as expected, canonical correlation analysis (CCA) methods, including SGCCA, RGCCA, SGCCDA were highly correlated with each other. We also found that DIABLO, a widely used supervised learning factorization method, was highly correlated with SGCCDA. Correlation analysis also revealed significant divergence between methods. For instance, both PCA and MOFA were lowly correlated with MCIA (multiple co-inertia analysis). We computed a variable importance score for each method. An aggregated score is then calculated as the cumulative rank of the variable importances of the different algorithms. To define a robust set of biomarkers, we select the best predictive features as those with the highest cumulative ranks. Combining classifications from multiple multi-omics integration methods delivers more robust biomarkers bypassing the risk of information loss from single methods.
Janet Piñero
Unlock the full breadth of disease-related biomedical knowledge with disgenet2r, an R package that provides seamless access to the DISGENET knowledge platform—integrating over 30 million associations among genes, genetic variants, diseases, and drugs (Piñero et al., 2021, 2019). Designed for researchers in disease genomics, drug discovery, and precision medicine, disgenet2r enables programmatic access, visualization, and advanced exploration of DISGENET data within the R environment. Built for reproducibility, disgenet2r offers a rich set of functions to explore multiple perspectives of DISGENET data—starting from genes, variants, diseases, or drugs—and harnessing a wide range of metrics and attributes available in the platform to rank and filter associations effectively according to the use case. With disgenet2r, users can investigate the genetic landscape of diseases, prioritize candidate targets, and craft hypothesis-driven queries to uncover novel therapeutic opportunities. The package supports a variety of visualization techniques—including networks, heatmaps, and more—facilitating the interpretation of complex disease mechanisms and polygenic interactions. By leveraging DISGENET’s harmonized, AI-enhanced knowledge base, disgenet2r accelerates translational research by uncovering hidden connections critical for early-stage drug development and precision medicine. In this poster/presentation, we will showcase practical use cases that demonstrate how disgenet2r streamlines disease annotation and target prioritization, and how it can be seamlessly integrated into broader biomedical research pipelines.
Kang Wang
The accurate extraction of tumor cell-specific gene expression profiles (GEP) from bulk RNA-seq or microarray data is crucial for understanding the metabolic phenotypes of cancer cells. PureMeta, an innovative bioinformatics tool, addresses this need by providing a robust methodology for isolating tumor cell GEP and generating metabolic phenotypes across seven key metabolic pathways. In this study, we applied PureMeta (https://github.com/WangKang-Leo/PureMeta) to bulk RNA-seq data from breast cancer samples, successfully extracting tumor cell GEP and identifying metabolic states (upregulated, neutral, downregulated) for each pathway. Our results demonstrate the tool’s efficacy in distinguishing tumor-specific metabolic alterations, which are often masked in bulk data analysis. We will present a comprehensive workflow for using PureMeta, including data preprocessing, GEP extraction, and metabolic phenotype generation. Additionally, we will discuss the implications of our findings for cancer research and potential applications in personalized medicine. Our findings underscore the importance of tumor-specific analysis in cancer genomics and highlight PureMeta as a valuable resource for researchers aiming to elucidate the metabolic landscape of tumor cells.
Laura Masatti
Ovarian cancer, especially the high-grade serous subtype (HGSOC), poses a serious challenge in women’s health due to its aggressive nature and high mortality rate. Early detection remains difficult, as most cases are diagnosed at an advanced stage owing to the lack of effective screening tools. This late diagnosis contributes to poor outcomes and complicates treatment strategies. A key factor in its severity is the extensive heterogeneity of both tumor cells and their surrounding microenvironment, combined with their intricate interplay. The major bottleneck in understanding the interactions and mechanisms of tumor microenvironment (TME) cells lies in the accurate identification of specific cell types and subtypes. Despite advances in sequencing technologies, the lack of integrated multi-omics data—such as pairing single-cell transcriptomics with spatial transcriptomics—limits the accuracy of cell identification. Accurate cell identification remains a challenge, especially due to limitations in current annotation methods. These include manual annotation offers expert insight but is slow and resource-intensive; reference-based tools like SingleR and CIBERSORTx automate the process but rely on complete, context-specific reference datasets, often lacking in cases like HGSOC, while marker-based methods allow targeted identification but suffer from reduced accuracy due to overlapping or variable marker expression. Our HGSOC dataset consists of 17 single-cell RNA sequencing (scRNA-seq) samples collected from different patients and anatomical sites, allowing us to explore the heterogeneity of cell composition in relation to metastatic location. Following a comprehensive benchmarking of annotation tools, we selected the strategy that most effectively captured the extensive cellular diversity in our samples, emphasizing methods supported by strong and consistent evidence. To distinguish tumour from normal cells, we first relied on copy number variation (CNV) profiles inferred from scRNA-seq data. Using an R package currently under development in our lab, we calculated CNV-based scores across the transcriptome, improving classification accuracy and confidence in identifying tumour versus non-tumour cells. Next, we applied a curated panel of biomarkers tailored to the cellular landscape of HGSOC. Leveraging the AUCell package, we performed an initial classification of normal cells into stromal and immune compartments. Subsequently, we refined cell type annotation by manually selecting non-overlapping gene markers representative of key cell populations in the TME. These markers were then used with the SCINA package to assign specific identities to the normal cells. Finally, to validate and further characterize the annotations, we used the signifinder package to assess functional signatures related to cancer metabolism and cellular mechanisms, providing deeper insights into cell behavior within the TME. To achieve our primary objective—identifying a robust set of gene markers for both known and previously uncharacterized cell types and subtypes—we begin with a thorough cell-type annotation. In the second phase of our analysis, we employ unsupervised clustering to identify distinct sub-clusters within the major cell populations. This approach allows us to detect cellular heterogeneity that may reflect unique behaviors, such as adaptation to specific metastatic sites or emergence after initial chemotherapy. By analyzing differentially expressed genes within each cluster, we aim to statistically validate and define meaningful subpopulations, ultimately associating significant marker genes with each distinct cluster. We aim for our pipeline to establish a clear, reproducible framework for cell-type annotation—one that is both biologically grounded and statistically robust. By promoting methodological transparency and precision, we hope to contribute a scalable and accurate strategy for the annotation of complex single-cell datasets.
Laureano Tomas-Daza
Capture Hi-C (CHi-C) has emerged as a powerful technique for exploring long-range chromatin interactions, particularly for investigating gene regulation. However, existing computational tools often focus on specific analytical steps (e.g., interaction calling or differential analysis), lacking the flexibility to support comprehensive data handling workflows tailored to the structure of capture-based protocols. While Bioconductor packages like GenomicInteractions provide useful infrastructure for chromatin contact data, they are not optimized for CHi-C pipelines. Conversely, specialized tools such as CHiCAGO and Chicdiff offer focused statistical frameworks but limited support for integration or downstream analysis. To fill this gap, we present HiCaptuRe, a Bioconductor-compatible R package for streamlined processing, standardization, and integration of Capture Hi-C and related interaction datasets. Designed with capture-specific workflows in mind, HiCaptuRe distinguishes between baited and non-baited fragments and supports multiple input formats — including native CHiCAGO outputs, BEDPE, and track files for genome browsers — allowing flexibility across a range of interaction capture technologies (e.g., HiChIP, PCHi-C[1], liCHi-C[2]). HiCaptuRe’s preprocessing module resolves duplicated interactions, enforces consistent formatting, and stores results in a structured HiCaptuRe object. This object maintains interaction data alongside detailed metadata, ensuring transparency and reproducibility throughout the analysis pipeline. The package enables integrative analysis through two complementary approaches: - interactionsByBaits supports bait-centered workflows (e.g., gene promoter CHi-C), facilitating integration with transcriptomic data such as RNA-seq. - interactionsByRegions allows region-centered exploration, particularly suited for linking interaction data with epigenomic profiles (e.g., ATAC-seq, CUT&RUN). These dual strategies enable both hypothesis-driven and exploratory multi-omics analyses, making HiCaptuRe a valuable tool for researchers studying transcriptional regulation and non-coding genome function. HiCaptuRe also offers functionality for comparative analysis, enabling users to identify condition-specific chromatin contacts — a critical feature for dissecting dynamic processes like cellular differentiation or disease progression. To ensure scalability, HiCaptuRe integrates data.table-based data handling and includes caching mechanisms to improve runtime performance. The package’s modular design makes it easily extensible and adaptable to custom workflows within the Bioconductor ecosystem. To broaden accessibility, a Shiny web application is in development, designed for non-programming users. This graphical interface allows wet-lab scientists to upload, explore, and analyze CHi-C datasets without writing code, bridging the gap between data generation and biological interpretation. In summary, HiCaptuRe complements the Bioconductor suite by offering a capture-aware, integration-ready, and user-friendly solution for analyzing chromatin interaction data. By combining preprocessing, standardization, and multi-omics integration within a coherent R framework, HiCaptuRe facilitates deeper insights into 3D genome organization and its role in gene regulation and disease. The package is open-source, actively developed, and open to community contributions and suggestions. HiCaptuRe is currently being prepared for official submission to Bioconductor, where it will continue to evolve as part of the collaborative, reproducibility-focused ecosystem supporting genomic data analysis in R. [1] Javierre, B. M. et al. Lineage-Specific Genome Architecture Links Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell 167, 1369–1384.e19 (2016). [2] Tomás-Daza, L. et al. Low input capture Hi-C (liCHi-C) identifies promoter-enhancer interactions at high-resolution. Nature communications 14, 268 (2023).
Laurent Gatto
A frequent problem with scientific research software is the lack of support, maintenance and further development. In particular, development by a single researcher can easily result in orphaned and dysfunctional software packages, especially if combined with poor documentation, missing unit tests or lack of adherence to open software development standards. The RforMassSpectrometry (https://www.rformassspectrometry.org/) initiative aims to develop an efficient, scalable, and stable infrastructure for mass spectrometry (MS) based proteomics and metabolomics data analysis. As part of this initiative, a growing ecosystem of R software packages is being developed covering different aspects of metabolomics and proteomics data analysis. To avoid the aforementioned problems, community contributions are fostered, and open development, documentation and long-term support emphasised. At the heart of the package ecosystem lies the Spectra package that provides the core infrastructure to handle, process and visualise MS data. Its design allows easy expansion to support existing and new file or data formats, including data representations with minimal memory footprint or remote data access. For proteomics data analysis, two packages in particular are dedicated to the analysis or quantitative and identification data. The PSMatch package handles and manages peptide identification data. It also provides functions to model and visualise peptide-protein relations to make informed decision about shared peptide filtering. The package also provides functions to calculate and visualise MS2 fragment ions, in conjunction with the Spectra package. The QFeatures package is the working horse for quantitative proteomics data. It builds on the familiar SummarizedExperiment and MultiAssayExperiment infrastructure and provides a familiar Bioconductor user experience to manage bulk and single-cell quantitative data across different assay levels (such as peptide spectrum matches, peptides and proteins) in a coherent and tractable way. For metabolomics data analysis, xcms is one of the core software packages for the required preprocessing of LC-MS data. This Bioconductor package was recently updated to reuse the R for Mass Spectrometry infrastructure, enabling now also the analysis of very large, and/or remote, data. This integration simplifies in addition complete analysis workflows which can include functionality from the MsFeatures package for compounding, and from the MetaboAnnotation package facilitating annotation of untargeted metabolomics experiments. Public annotation resources can be easily accessed through packages such as MsBackendMassbank, MsBackendMsp or CompoundDb, the latter also allowing to create and manage lab-specific compound databases. These packages rely on the MsCoreUtils and MetaboCoreUtils packages for efficient implementations of commonly used algorithms, designed to be re-used by other R packages. In contrast to a monolithic software design, the R for Mass Spectrometry ecosystem enables to build customised, modular, and reproducible analysis workflows. Future proteomics- and metabolomics-related development will focus on improved data structures and analysis methods, better support for third-party data import, and better interoperability with other open source software.
Léopold Guyot
As bioinformatics continues to evolve, it must keep pace with experimental techniques that generate increasingly large volumes of data. In this context, optimizing the performance of code and packages that handle these data becomes essential. This work presents the optimization efforts carried out on the QFeatures \(\textit{R/Bioconductor}\) package, which is used for the analysis of quantitative proteomics data. We also highlight a set of tools and methods that are valuable for performance optimization, with a particular focus on VerR, an R package designed to create isolated and reproducible environments (https://leopoldguyot.github.io/VerR/). These environments allow for the installation of specific package versions, enabling systematic benchmarking to assess the performance impact of different versions. As a result of these optimization efforts, we observed a 90% reduction in the runtime of a classical single-cell proteomics (scp) workflow and a 50% decrease in memory usage, demonstrating the significant impact of targeted optimizations.
Molka Anaghim Ftouhi
The e-OMIX project (https://www.eomix.be) aims to lower the barrier of entry into omics research by providing an interactive platform where users will be able to perform analyses, as well as storing the resulting data and metadata, without the need for advanced coding skills. e-OMIX is developed under AGPL 3 license as an Angular/Java-based web-app, making use of several innovative technologies. Pre-built pipelines are implemented from nf-core, a repository of publicly available workflows, maximizing their reproducibility and ease of use. Result matrices are stored in a database optimized for fast querying of large datasets (TileDB) and can be exported in several objects notably ’Bioconductor’s SingleCell Expriment (SCE) or other (anndata, or Seurat), while metadata are stored as per-sample individual documents in a document-oriented database (CouchDB). To increase the interoperability of metadata, e-OMIX also offers the possibility to manage and export them using Fast Healthcare Interoperability Resources (FHIR), a widely used standard in healthcare and clinical research. Finally, data visualization is made possible by using the iSEE R/Biocondutor package. As a first use case, we demonstrate the end-to-end execution of single-cell RNA-seq pipeline, starting from metadata and raw files upload, and leading to actionable data, such as annotated cell types, individual gene expression or marker gene identification.
Rike Hanssen
Bioconductor provides a rich ecosystem of R packages for bioinformatics, yet deploying these tools at scale and ensuring reproducibility across compute environments can present considerable challenges. Managing package dependencies, configuring execution environments, and transitioning between automated pipelines and exploratory analysis often require significant manual effort and technical expertise. These barriers can limit the accessibility and reproducibility of R-based bioinformatics tools in large-scale research settings. To address these challenges, we present a set of integrated, free, and open-source solutions (Nextflow, Seqera Containers) that streamline the use of Bioconductor throughout the analysis lifecycle. A web-based container builder (Seqera Containers) allows researchers to generate Docker images for any Bioconductor package—or combination of packages—with just a few clicks. This approach eliminates manual dependency resolution and ensures consistent execution environments across pipelines, collaborators, and compute infrastructure. We show how Nextflow pipelines use these containers in scalable, reproducible workflows. Nextflow supports embedded R scripts as first-class workflow steps, and built-in container support enables each R-based process to run reproducibly across HPC, cloud, or hybrid systems. To support interactive and exploratory analysis, we additionally provide ‘Studios’ through our commercial offering ‘Seqera Platform’ (available with free academic licensing): fully reproducible workspaces for running RStudio or JupyterLab notebooks in the cloud. These environments are containerized and co-located with pipeline data, allowing researchers to move seamlessly between automated workflows and hands-on exploration. By combining user-friendly container generation, scalable workflow integration, and reproducible interactive environments, we aim to reduce friction for researchers working with Bioconductor at scale. Seqera, the creators of Nextflow, develop and maintain these tools to enable reproducible, scalable, and accessible data science in the life sciences.
Stefania Pirrotta
Understanding cancer mechanisms, defining subtypes, predicting prognosis, and assessing therapy efficacy are crucial in cancer research. While bulk gene-expression signatures have played a key role, advances in single-cell RNA sequencing and spatial transcriptomics highlight tumor heterogeneity, requiring computational tools for precise characterization. To address this, we developed signifinder, an R Bioconductor package that streamlines the collection and application of cancer transcriptional signatures across bulk, single-cell, and spatial transcriptomics data. Leveraging publicly available signatures, users can assess tumor characteristics, therapy responses, and tumor microenvironment features. Indeed, cancer gene expression research has generated numerous transcriptional signatures that aid in tumor classification, prognosis, and therapeutic decisions. These signatures, composed of specific gene sets, provide insights into tumor biology and interactions with the tumor microenvironment. However, tumor heterogeneity presents challenges, requiring tools that define cancer cell states within high-resolution transcriptomic data. Advancements in single-cell RNA sequencing and spatial transcriptomics have revealed tumors as complex cellular mosaics shaped by spatial patterns and clonal variations. Computational tools are needed to analyze gene expression modules and derive cell-specific scores. While pan-cancer analyses confirm multiple, non-mutually exclusive cancer cell states, a comprehensive catalog of gene modules remains elusive, limiting genomic data interpretation. Despite extensive research, bulk transcriptional signatures face reproducibility and standardization challenges, with few adopted in clinical practice. Open-source implementations are lacking, restricting usability. To bridge this gap, we developed signifinder, a package that implements public transcriptional signatures across transcriptomic data types. It ensures interoperability within Bioconductor workflows, enabling systematic signature evaluation. Signifinder integrates bulk, single-cell, and spatial transcriptomic signatures, compiling a collection of cancer gene signatures, and enables intra-tumor heterogeneity analysis. Further, it includes several visualization functions for easy interpretation of results. Future development will expand its catalog, promoting broader options. By enhancing high-resolution transcriptomic data interpretation, signifinder aids in detecting intratumor variability and advancing cancer research.
Yosra Berrouayel Dahour
Identifying transcription factors (TFs) that regulate co-expressed gene sets is a critical step in interpreting differential gene expression and uncovering underlying biological mechanisms. TFEA.ChIP is a key tool for TF enrichment analysis, using publicly available ChIP-seq data. The original version used GeneHancer to link TF binding sites to target genes, a necessary yet challenging aspect of enrichment analysis. It then used ChIP-seq data from ReMap2022 to identify TFs bound to these regions, highlighting potential master regulators. However, this approach is limited by its dependence on predicted interactions, with minimal grounding in raw experimental data and lacking the resolution needed for tissue- or cell type-specific contexts. To address these limitations, we present a major update to TFEA.ChIP. By incorporating cell type–specific associations from the rE2G ENCODE resource, the tool now captures regulatory patterns across diverse cellular contexts, enhancing biological relevance. The rE2G database is built on high-resolution experimental data, including CRISPR and Hi-C, providing a robust foundation for mapping regulatory interactions. Additionally, the updated version filters out TFs not expressed in specific cell types, reducing noise and providing more accurate results. Benchmarking against gene expression profile datasets demonstrates that the updated TFEA.ChIP outperforms the original in accuracy and predictive power. In summary, this upgraded version of TFEA.ChIP offers a more precise, context-aware framework for TF enrichment analysis, facilitating deeper exploration of the transcriptional networks underlying complex cellular responses.