The microbiome comprises all the genetic material within a microbiota (microorganisms inhabiting an ecological niche). These microorganisms constitute real ecosystems that live in dynamic equilibrium also in our organism, contributing to our health. Indeed, the microbiota performs numerous and very important functions for the whole organism, such as providing protection against the growth of pathogens; regulates the metabolism; protects the cardiovascular system; eliminates toxins. Efficient and cost-effective high throughput DNA sequencing techniques have enhanced the study of these complex microbial systems, leading to important conclusions in different fields. Standard bioinformatics preprocessing pipelines allow obtaining from the sequence read the so-called abundance matrix that describes each taxon’s abundance in each analyzed sample. Different downstream methodologies are exploited to mine information on abundance data. Several characteristics of the abundance matrix, such as discrete values, high sparsity, compositionality, and heteroscedasticity, make downstream analyses challenging. Therefore, bioinformatics methods dealing with the nature of abundance data have been recently developed. Among others, differential abundance (DA) analysis looks at statistically significant differences in taxa abundances between classes of samples; whereas network inference analysis estimates the complex network of interactions established between taxa within the microbiota ecosystem. However, there is still no consensus about the best approaches to use. In the literature, methods initially developed for differential expression analysis in RNA-seq data were used for DA analysis. Subsequently, more specific methods for metataxonomic and metagenomic data were proposed. However, even if researchers rely on benchmarking studies to understand which method to choose, it is difficult to compare these studies since they use different performance metrics and simulation frameworks. On the other hand, although in recent years several network inference methods have been developed specifically for microbiome sequencing data, benchmarking studies are completely missing in the literature. In this thesis, a benchmarking of DA analysis methods is proposed, which makes use of a reliable and easily extendable simulation framework based on an already publish 16S abundance data simulator. Moreover, with respect to other already published comparative studies, the methods’ performance are evaluated with a greater number of metrics, namely False Positive Rate, False Discovery Rate, Recall, Precision-Recall curve, partial Area Under PRcurve, and computational burden. Furthermore, scenarios and covariates not yet investigated by other approaches are considered such as the combined effect of sample size, percentage of DA taxa, sequencing depth, fold change, variability of taxa, use of threshold to avoid/allow low abundance DA taxa, different approaches to deal with zero entries, normalization, and the presence of different ecological niches. As regards network inference, after an extensive literature review that provides a quite complete overview of different approaches, a novel simulation framework based on metabolites-mediated taxa-taxa interactions is proposed. The simulator provides as output the golden standard interaction networks and the count table data to be used to infer it, thus providing a valuable tool for benchmarking methods across different experimental scenarios. Finally, a case study on the analysis of the microbiota signature of the upper respiratory tract in patients SARS-CoV-2 is illustrated.

The microbiome comprises all the genetic material within a microbiota (microorganisms inhabiting an ecological niche). These microorganisms constitute real ecosystems that live in dynamic equilibrium also in our organism, contributing to our health. Indeed, the microbiota performs numerous and very important functions for the whole organism, such as providing protection against the growth of pathogens; regulates the metabolism; protects the cardiovascular system; eliminates toxins. Efficient and cost-effective high throughput DNA sequencing techniques have enhanced the study of these complex microbial systems, leading to important conclusions in different fields. Standard bioinformatics preprocessing pipelines allow obtaining from the sequence read the so-called abundance matrix that describes each taxon’s abundance in each analyzed sample. Different downstream methodologies are exploited to mine information on abundance data. Several characteristics of the abundance matrix, such as discrete values, high sparsity, compositionality, and heteroscedasticity, make downstream analyses challenging. Therefore, bioinformatics methods dealing with the nature of abundance data have been recently developed. Among others, differential abundance (DA) analysis looks at statistically significant differences in taxa abundances between classes of samples; whereas network inference analysis estimates the complex network of interactions established between taxa within the microbiota ecosystem. However, there is still no consensus about the best approaches to use. In the literature, methods initially developed for differential expression analysis in RNA-seq data were used for DA analysis. Subsequently, more specific methods for metataxonomic and metagenomic data were proposed. However, even if researchers rely on benchmarking studies to understand which method to choose, it is difficult to compare these studies since they use different performance metrics and simulation frameworks. On the other hand, although in recent years several network inference methods have been developed specifically for microbiome sequencing data, benchmarking studies are completely missing in the literature. In this thesis, a benchmarking of DA analysis methods is proposed, which makes use of a reliable and easily extendable simulation framework based on an already publish 16S abundance data simulator. Moreover, with respect to other already published comparative studies, the methods’ performance are evaluated with a greater number of metrics, namely False Positive Rate, False Discovery Rate, Recall, Precision-Recall curve, partial Area Under PRcurve, and computational burden. Furthermore, scenarios and covariates not yet investigated by other approaches are considered such as the combined effect of sample size, percentage of DA taxa, sequencing depth, fold change, variability of taxa, use of threshold to avoid/allow low abundance DA taxa, different approaches to deal with zero entries, normalization, and the presence of different ecological niches. As regards network inference, after an extensive literature review that provides a quite complete overview of different approaches, a novel simulation framework based on metabolites-mediated taxa-taxa interactions is proposed. The simulator provides as output the golden standard interaction networks and the count table data to be used to infer it, thus providing a valuable tool for benchmarking methods across different experimental scenarios. Finally, a case study on the analysis of the microbiota signature of the upper respiratory tract in patients SARS-CoV-2 is illustrated.

Evaluation of differential abundance and network inference methods for microbiota sequencing data / Cappellato, Marco. - (2023 Mar 17).

Evaluation of differential abundance and network inference methods for microbiota sequencing data

CAPPELLATO, MARCO
2023

Abstract

The microbiome comprises all the genetic material within a microbiota (microorganisms inhabiting an ecological niche). These microorganisms constitute real ecosystems that live in dynamic equilibrium also in our organism, contributing to our health. Indeed, the microbiota performs numerous and very important functions for the whole organism, such as providing protection against the growth of pathogens; regulates the metabolism; protects the cardiovascular system; eliminates toxins. Efficient and cost-effective high throughput DNA sequencing techniques have enhanced the study of these complex microbial systems, leading to important conclusions in different fields. Standard bioinformatics preprocessing pipelines allow obtaining from the sequence read the so-called abundance matrix that describes each taxon’s abundance in each analyzed sample. Different downstream methodologies are exploited to mine information on abundance data. Several characteristics of the abundance matrix, such as discrete values, high sparsity, compositionality, and heteroscedasticity, make downstream analyses challenging. Therefore, bioinformatics methods dealing with the nature of abundance data have been recently developed. Among others, differential abundance (DA) analysis looks at statistically significant differences in taxa abundances between classes of samples; whereas network inference analysis estimates the complex network of interactions established between taxa within the microbiota ecosystem. However, there is still no consensus about the best approaches to use. In the literature, methods initially developed for differential expression analysis in RNA-seq data were used for DA analysis. Subsequently, more specific methods for metataxonomic and metagenomic data were proposed. However, even if researchers rely on benchmarking studies to understand which method to choose, it is difficult to compare these studies since they use different performance metrics and simulation frameworks. On the other hand, although in recent years several network inference methods have been developed specifically for microbiome sequencing data, benchmarking studies are completely missing in the literature. In this thesis, a benchmarking of DA analysis methods is proposed, which makes use of a reliable and easily extendable simulation framework based on an already publish 16S abundance data simulator. Moreover, with respect to other already published comparative studies, the methods’ performance are evaluated with a greater number of metrics, namely False Positive Rate, False Discovery Rate, Recall, Precision-Recall curve, partial Area Under PRcurve, and computational burden. Furthermore, scenarios and covariates not yet investigated by other approaches are considered such as the combined effect of sample size, percentage of DA taxa, sequencing depth, fold change, variability of taxa, use of threshold to avoid/allow low abundance DA taxa, different approaches to deal with zero entries, normalization, and the presence of different ecological niches. As regards network inference, after an extensive literature review that provides a quite complete overview of different approaches, a novel simulation framework based on metabolites-mediated taxa-taxa interactions is proposed. The simulator provides as output the golden standard interaction networks and the count table data to be used to infer it, thus providing a valuable tool for benchmarking methods across different experimental scenarios. Finally, a case study on the analysis of the microbiota signature of the upper respiratory tract in patients SARS-CoV-2 is illustrated.
Evaluation of differential abundance and network inference methods for microbiota sequencing data
17-mar-2023
The microbiome comprises all the genetic material within a microbiota (microorganisms inhabiting an ecological niche). These microorganisms constitute real ecosystems that live in dynamic equilibrium also in our organism, contributing to our health. Indeed, the microbiota performs numerous and very important functions for the whole organism, such as providing protection against the growth of pathogens; regulates the metabolism; protects the cardiovascular system; eliminates toxins. Efficient and cost-effective high throughput DNA sequencing techniques have enhanced the study of these complex microbial systems, leading to important conclusions in different fields. Standard bioinformatics preprocessing pipelines allow obtaining from the sequence read the so-called abundance matrix that describes each taxon’s abundance in each analyzed sample. Different downstream methodologies are exploited to mine information on abundance data. Several characteristics of the abundance matrix, such as discrete values, high sparsity, compositionality, and heteroscedasticity, make downstream analyses challenging. Therefore, bioinformatics methods dealing with the nature of abundance data have been recently developed. Among others, differential abundance (DA) analysis looks at statistically significant differences in taxa abundances between classes of samples; whereas network inference analysis estimates the complex network of interactions established between taxa within the microbiota ecosystem. However, there is still no consensus about the best approaches to use. In the literature, methods initially developed for differential expression analysis in RNA-seq data were used for DA analysis. Subsequently, more specific methods for metataxonomic and metagenomic data were proposed. However, even if researchers rely on benchmarking studies to understand which method to choose, it is difficult to compare these studies since they use different performance metrics and simulation frameworks. On the other hand, although in recent years several network inference methods have been developed specifically for microbiome sequencing data, benchmarking studies are completely missing in the literature. In this thesis, a benchmarking of DA analysis methods is proposed, which makes use of a reliable and easily extendable simulation framework based on an already publish 16S abundance data simulator. Moreover, with respect to other already published comparative studies, the methods’ performance are evaluated with a greater number of metrics, namely False Positive Rate, False Discovery Rate, Recall, Precision-Recall curve, partial Area Under PRcurve, and computational burden. Furthermore, scenarios and covariates not yet investigated by other approaches are considered such as the combined effect of sample size, percentage of DA taxa, sequencing depth, fold change, variability of taxa, use of threshold to avoid/allow low abundance DA taxa, different approaches to deal with zero entries, normalization, and the presence of different ecological niches. As regards network inference, after an extensive literature review that provides a quite complete overview of different approaches, a novel simulation framework based on metabolites-mediated taxa-taxa interactions is proposed. The simulator provides as output the golden standard interaction networks and the count table data to be used to infer it, thus providing a valuable tool for benchmarking methods across different experimental scenarios. Finally, a case study on the analysis of the microbiota signature of the upper respiratory tract in patients SARS-CoV-2 is illustrated.
Evaluation of differential abundance and network inference methods for microbiota sequencing data / Cappellato, Marco. - (2023 Mar 17).
File in questo prodotto:
File Dimensione Formato  
tesi_Marco_Cappellato.pdf

embargo fino al 16/03/2026

Descrizione: tesi_Marco_Cappellato
Tipologia: Tesi di dottorato
Dimensione 31.39 MB
Formato Adobe PDF
31.39 MB Adobe PDF Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3473646
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact