High-throughput technologies have propelled biology into the big data era. Single-cell RNA sequencing now produces datasets with millions of cells, while digital pathology generates whole-slide images containing billions of pixels. These advances enable unprecedented discovery but create a computational paradox: data are generated faster than they can be processed, and standard workflows often fail to scale. Efficient algorithms and integrative strategies are therefore essential for analyzing massive, heterogeneous datasets. This PhD thesis addresses these challenges through two complementary aims. First, we benchmark different Singular Value Decomposition (SVD) algorithm for Principal Component Analysis (PCA), a key dimensionality-reduction step in single-cell transcriptomics. Classical PCA becomes prohibitively slow and memory-intensive as data size increases. To overcome these limitations, we evaluate state-of-the-art algorithms and out-of-memory data formats across complete single-cell workflows. The benchmark compares Seurat, OSCA/Bioconductor and scrapper in R and Scanpy, and GPU-enabled frameworks such as rapids\textunderscore singlecell in Python, leveraging GPU acceleration to reduce runtime and memory usage on datasets with millions of cells. These analyses quantify performance trade-offs and provide reproducible guidance for selecting optimal pipelines for large-scale single-cell studies. Second, we focus on digital pathology, where histopathological images reveal tissue architecture, cellular morphology, and tumor spatial organization. We processed 11,765 H\&E-stained images from 32 TCGA cancer types using deep learning (HoVer-Net) to extract nuclei-level features and Prov-GigaPath to extract slide level embeddings. To bridge the gap between image analysis and the R/Bioconductor ecosystem, we released three packages: TCIAAPI, HistoImageR and imageTCGA, a Shiny application for interactive exploration, filtering, and visualization of extracted features alongside the original images. By combining scalable computation with cross-modal integration, this work improves the efficiency of single-cell analysis and supports precision medicine through clinically relevant molecular - morphological associations.
Sviluppo di metodi efficienti e scalabili per l’analisi di dati omici in biologia del cancro / Billato, Ilaria. - (2026 Feb 20).
Sviluppo di metodi efficienti e scalabili per l’analisi di dati omici in biologia del cancro
BILLATO, ILARIA
2026
Abstract
High-throughput technologies have propelled biology into the big data era. Single-cell RNA sequencing now produces datasets with millions of cells, while digital pathology generates whole-slide images containing billions of pixels. These advances enable unprecedented discovery but create a computational paradox: data are generated faster than they can be processed, and standard workflows often fail to scale. Efficient algorithms and integrative strategies are therefore essential for analyzing massive, heterogeneous datasets. This PhD thesis addresses these challenges through two complementary aims. First, we benchmark different Singular Value Decomposition (SVD) algorithm for Principal Component Analysis (PCA), a key dimensionality-reduction step in single-cell transcriptomics. Classical PCA becomes prohibitively slow and memory-intensive as data size increases. To overcome these limitations, we evaluate state-of-the-art algorithms and out-of-memory data formats across complete single-cell workflows. The benchmark compares Seurat, OSCA/Bioconductor and scrapper in R and Scanpy, and GPU-enabled frameworks such as rapids\textunderscore singlecell in Python, leveraging GPU acceleration to reduce runtime and memory usage on datasets with millions of cells. These analyses quantify performance trade-offs and provide reproducible guidance for selecting optimal pipelines for large-scale single-cell studies. Second, we focus on digital pathology, where histopathological images reveal tissue architecture, cellular morphology, and tumor spatial organization. We processed 11,765 H\&E-stained images from 32 TCGA cancer types using deep learning (HoVer-Net) to extract nuclei-level features and Prov-GigaPath to extract slide level embeddings. To bridge the gap between image analysis and the R/Bioconductor ecosystem, we released three packages: TCIAAPI, HistoImageR and imageTCGA, a Shiny application for interactive exploration, filtering, and visualization of extracted features alongside the original images. By combining scalable computation with cross-modal integration, this work improves the efficiency of single-cell analysis and supports precision medicine through clinically relevant molecular - morphological associations.| File | Dimensione | Formato | |
|---|---|---|---|
|
tesi_definitiva_Ilaria_Billato.pdf
accesso aperto
Descrizione: documento di tesi definitivo
Tipologia:
Tesi di dottorato
Dimensione
8.9 MB
Formato
Adobe PDF
|
8.9 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




