High-throughput technologies have propelled biology into the big data era. Single-cell RNA sequencing now produces datasets with millions of cells, while digital pathology generates whole-slide images containing billions of pixels. These advances enable unprecedented discovery but create a computational paradox: data are generated faster than they can be processed, and standard workflows often fail to scale. Efficient algorithms and integrative strategies are therefore essential for analyzing massive, heterogeneous datasets. This PhD thesis addresses these challenges through two complementary aims. First, we benchmark different Singular Value Decomposition (SVD) algorithm for Principal Component Analysis (PCA), a key dimensionality-reduction step in single-cell transcriptomics. Classical PCA becomes prohibitively slow and memory-intensive as data size increases. To overcome these limitations, we evaluate state-of-the-art algorithms and out-of-memory data formats across complete single-cell workflows. The benchmark compares Seurat, OSCA/Bioconductor and scrapper in R and Scanpy, and GPU-enabled frameworks such as rapids\textunderscore singlecell in Python, leveraging GPU acceleration to reduce runtime and memory usage on datasets with millions of cells. These analyses quantify performance trade-offs and provide reproducible guidance for selecting optimal pipelines for large-scale single-cell studies. Second, we focus on digital pathology, where histopathological images reveal tissue architecture, cellular morphology, and tumor spatial organization. We processed 11,765 H\&E-stained images from 32 TCGA cancer types using deep learning (HoVer-Net) to extract nuclei-level features and Prov-GigaPath to extract slide level embeddings. To bridge the gap between image analysis and the R/Bioconductor ecosystem, we released three packages: TCIAAPI, HistoImageR and imageTCGA, a Shiny application for interactive exploration, filtering, and visualization of extracted features alongside the original images. By combining scalable computation with cross-modal integration, this work improves the efficiency of single-cell analysis and supports precision medicine through clinically relevant molecular - morphological associations.

Sviluppo di metodi efficienti e scalabili per l’analisi di dati omici in biologia del cancro / Billato, Ilaria. - (2026 Feb 20).

Sviluppo di metodi efficienti e scalabili per l’analisi di dati omici in biologia del cancro

BILLATO, ILARIA
2026

Abstract

High-throughput technologies have propelled biology into the big data era. Single-cell RNA sequencing now produces datasets with millions of cells, while digital pathology generates whole-slide images containing billions of pixels. These advances enable unprecedented discovery but create a computational paradox: data are generated faster than they can be processed, and standard workflows often fail to scale. Efficient algorithms and integrative strategies are therefore essential for analyzing massive, heterogeneous datasets. This PhD thesis addresses these challenges through two complementary aims. First, we benchmark different Singular Value Decomposition (SVD) algorithm for Principal Component Analysis (PCA), a key dimensionality-reduction step in single-cell transcriptomics. Classical PCA becomes prohibitively slow and memory-intensive as data size increases. To overcome these limitations, we evaluate state-of-the-art algorithms and out-of-memory data formats across complete single-cell workflows. The benchmark compares Seurat, OSCA/Bioconductor and scrapper in R and Scanpy, and GPU-enabled frameworks such as rapids\textunderscore singlecell in Python, leveraging GPU acceleration to reduce runtime and memory usage on datasets with millions of cells. These analyses quantify performance trade-offs and provide reproducible guidance for selecting optimal pipelines for large-scale single-cell studies. Second, we focus on digital pathology, where histopathological images reveal tissue architecture, cellular morphology, and tumor spatial organization. We processed 11,765 H\&E-stained images from 32 TCGA cancer types using deep learning (HoVer-Net) to extract nuclei-level features and Prov-GigaPath to extract slide level embeddings. To bridge the gap between image analysis and the R/Bioconductor ecosystem, we released three packages: TCIAAPI, HistoImageR and imageTCGA, a Shiny application for interactive exploration, filtering, and visualization of extracted features alongside the original images. By combining scalable computation with cross-modal integration, this work improves the efficiency of single-cell analysis and supports precision medicine through clinically relevant molecular - morphological associations.
Development of efficient and scalable methods for omic data analyses in cancer biology
20-feb-2026
Sviluppo di metodi efficienti e scalabili per l’analisi di dati omici in biologia del cancro / Billato, Ilaria. - (2026 Feb 20).
File in questo prodotto:
File Dimensione Formato  
tesi_definitiva_Ilaria_Billato.pdf

accesso aperto

Descrizione: documento di tesi definitivo
Tipologia: Tesi di dottorato
Dimensione 8.9 MB
Formato Adobe PDF
8.9 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3594627
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact