The increasing availability of data and the decreasing computational power cost sparked the data-revolution that we are in nowadays, with machine learning and artificial intelligence methods influencing our daily life more and more. Such trend also influenced several scientific fields, as now complex and massive data analyses are performed in multiple fields allowing researchers to test elaborate hypotheses and to speed-up discoveries. A significant drawback of traditional machine learning approaches, however, is their ability to discover only correlations between variables that do not always reflect the true causal mechanisms of the phenomenon under study, possibly leading to misleading conclusions. In light of such obstacles, the field of causality has gained significant traction due to its natural ability to answer two fundamental questions for knowledge discovery from data. The first is to select the important variables among a pool of observed ones, as big datasets comprised of multiple and heterogeneous measurements are collected for subsequent analyses without any prior knowledge of the importance of each feature. The second is to understand how those variables influence each other, as this helps understanding the evolution of the scenario under study. Both questions can be answered using the causal framework as causal discovery algorithms aim to recover cause and effect relationships among variables, upon which it is possible to identify the important ones for the task under study, and effect estimation techniques quantify how modifying a feature (or treatment) in the real-world influences the other variables, allowing us to better understand the system under study. One last common issue of data analysis on datasets is to report false discoveries in output, that are results that arise by chance without reflecting causal effects or other relationships in the data. This problem is especially important when performing large analyses comprised of multiple hypotheses and this is critical in high-stake fields such as in financial or medical analyses. One way to address this problem is to adopt suitable techniques designed to bound the Family-Wise Error Rate (FWER), that is the probability of returning at least one false discovery in output, below an user-defined threshold. In this thesis we develop two causality methods with rigorous guarantees on the FWER: the first focuses on a causal discovery problem and the second involves effect estimation and its application on cancer data. In the first part of the thesis we focus on a sub task of causal discovery, the local causal discovery task, that given a target variable and a candidate set of variables, aims at selecting a subset of the latter with specific causal or statistical properties with the target. In particular, local causal discovery focuses on inferring two sets of variables: the Parent-Children (PC) set, which is composed of variables that are direct causes or direct consequences of the target, and the Markov boundary (MB) of the target, which is the minimal set of variables with the highest target prediction performances. We present the first two algorithms for local causal discovery that bound the FWER of their output, as the inference of PC and MB sets requires performing multiple independence tests from data. We prove that state-of-the-art algorithms cannot be adapted for the task due to untestable and unrealistic assumptions on the statistical power of independence tests used for the discovery, while our algorithms come with provable guarantees on their results and require less assumptions. We successfully control the FWER either by exploiting the well-known Bonferroni correction for multiple hypotheses testing or by implementing data-dependent bounds based on Rademacher averages, a tool commonly used to measure the complexity of a family of functions. To the best of our knowledge, our work is the first one introducing the use of Rademacher averages in (local
Statistical Learning Techniques for Causal Structure Discovery and Effect Estimation / Simionato, Dario. - (2024 Mar 20).
Statistical Learning Techniques for Causal Structure Discovery and Effect Estimation
SIMIONATO, DARIO
2024
Abstract
The increasing availability of data and the decreasing computational power cost sparked the data-revolution that we are in nowadays, with machine learning and artificial intelligence methods influencing our daily life more and more. Such trend also influenced several scientific fields, as now complex and massive data analyses are performed in multiple fields allowing researchers to test elaborate hypotheses and to speed-up discoveries. A significant drawback of traditional machine learning approaches, however, is their ability to discover only correlations between variables that do not always reflect the true causal mechanisms of the phenomenon under study, possibly leading to misleading conclusions. In light of such obstacles, the field of causality has gained significant traction due to its natural ability to answer two fundamental questions for knowledge discovery from data. The first is to select the important variables among a pool of observed ones, as big datasets comprised of multiple and heterogeneous measurements are collected for subsequent analyses without any prior knowledge of the importance of each feature. The second is to understand how those variables influence each other, as this helps understanding the evolution of the scenario under study. Both questions can be answered using the causal framework as causal discovery algorithms aim to recover cause and effect relationships among variables, upon which it is possible to identify the important ones for the task under study, and effect estimation techniques quantify how modifying a feature (or treatment) in the real-world influences the other variables, allowing us to better understand the system under study. One last common issue of data analysis on datasets is to report false discoveries in output, that are results that arise by chance without reflecting causal effects or other relationships in the data. This problem is especially important when performing large analyses comprised of multiple hypotheses and this is critical in high-stake fields such as in financial or medical analyses. One way to address this problem is to adopt suitable techniques designed to bound the Family-Wise Error Rate (FWER), that is the probability of returning at least one false discovery in output, below an user-defined threshold. In this thesis we develop two causality methods with rigorous guarantees on the FWER: the first focuses on a causal discovery problem and the second involves effect estimation and its application on cancer data. In the first part of the thesis we focus on a sub task of causal discovery, the local causal discovery task, that given a target variable and a candidate set of variables, aims at selecting a subset of the latter with specific causal or statistical properties with the target. In particular, local causal discovery focuses on inferring two sets of variables: the Parent-Children (PC) set, which is composed of variables that are direct causes or direct consequences of the target, and the Markov boundary (MB) of the target, which is the minimal set of variables with the highest target prediction performances. We present the first two algorithms for local causal discovery that bound the FWER of their output, as the inference of PC and MB sets requires performing multiple independence tests from data. We prove that state-of-the-art algorithms cannot be adapted for the task due to untestable and unrealistic assumptions on the statistical power of independence tests used for the discovery, while our algorithms come with provable guarantees on their results and require less assumptions. We successfully control the FWER either by exploiting the well-known Bonferroni correction for multiple hypotheses testing or by implementing data-dependent bounds based on Rademacher averages, a tool commonly used to measure the complexity of a family of functions. To the best of our knowledge, our work is the first one introducing the use of Rademacher averages in (localFile | Dimensione | Formato | |
---|---|---|---|
Tesi_Dario_Simionato.pdf
accesso aperto
Descrizione: Tesi_Dario_Simionato
Tipologia:
Tesi di dottorato
Licenza:
Altro
Dimensione
10.2 MB
Formato
Adobe PDF
|
10.2 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.