Metodi di permutazione per test multipli su dati ad alta dimensionalità

Vesely, Anna

We consider the problem of testing multiple hypotheses in high-dimensional settings, arguing that more tools are needed to support an exploratory approach, where researchers may test many subsets of hypotheses and make a selection post hoc. We focus on resampling-based methods, that rely on minimal assumptions and tend to be more powerful than parametric approaches, especially in presence of multiple hypotheses. In this framework, we provide two general and flexible procedures: a method to make confidence statements on the proportion of true discoveries (TDP), and a method to make inference on predictor variables in linear regression. First, we propose a general closed testing procedure for sum-based global tests. It provides lower confidence bounds for the TDP, simultaneously over all subsets of hypotheses; these simultaneous inferences come for free, i.e., without any adjustment of the alpha-level, whenever a global test is used. Our method allows for an exploratory approach, as simultaneity ensures control of the TDP even when the subset of interest is selected post hoc. It adapts to the unknown joint distribution of the data through permutation testing. Any sum test may be employed, depending on the desired power properties. We present an iterative shortcut for the closed testing procedure, based on the branch and bound algorithm, which converges to the full closed testing results, often after few iterations; even if it is stopped early, it controls the TDP. The feasibility of the method for high dimensional data is illustrated on brain imaging data, then we compare the properties of different choices for the sum test through simulations. Subsequently, we propose a multiple testing method for hypotheses on coefficients in high-dimensional linear regression. It allows to construct asymptotically valid resampling-based tests for any subset of hypotheses, which can be used in closed testing procedures, as well as the above-mentioned shortcut. The approach is presented in two ways: an exact method, and an approximate method that is less computationally intensive. We show that, to build test statistics for any set of hypotheses, it is sufficient to define test statistics for individual hypotheses, relying on a variable selection procedure, and then combine these through a suitable function. The resulting method is extremely flexible, allowing different selection procedures and several combining functions. The performance of the proposed exact and approximate methods is illustrated through simulations.

Nel contesto dei test multipli su dati ad alta dimensionalità, sono necessari nuovi strumenti per supportare un approccio esplorativo, in cui i ricercatori possano testare diversi sottoinsiemi di ipotesi e selezionare l'insieme di interesse post hoc. In questo manoscritto ci concentriamo sui test di permutazione, che richiedono assunzioni minime e sono generalmente più potenti degli approcci parametrici, soprattutto quando si considerano ipotesi multiple. Proponiamo due metodi generali e flessibili per dare un insieme di confidenza per la proporzione di veri positivi (true discovery proportion, TDP) e per fare inferenza sui predittori nella regressione lineare. In primo luogo, proponiamo una procedura basata sul closed testing per test globali definiti tramite somme. Questa permette di calcolare limiti inferiori di confidenza per il TDP, simultaneamente rispetto a tutti i sottoinsiemi di ipotesi. Per qualsiasi test globale, tali inferenze simultanee sono disponibili senza aggiustare il livello di significatività. Il metodo proposto permette un approccio esplorativo, in quanto la simultaneità dei limiti di confidenza controlla il TDP anche quando l'insieme di interesse è selezionato post hoc. Inoltre, il metodo si adatta alla distribuzione dei dati tramite permutazioni. Si può utilizzare qualsiasi test basato sulle somme, a seconda delle proprietà desiderate. Il metodo è presentato come una scorciatoia iterativa per la procedura di closed testing, che sfrutta un algoritmo branch and bound e che converge al closed testing, spesso dopo poche iterazioni. La procedura controlla il TDP anche se interrotta prima di giungere a convergenza. Mostriamo che il metodo è adatto a dati ad alta dimensionalità analizzando immagini cerebrali, poi confrontiamo le proprietà di diversi test globali tramite simulazioni. Successivamente, proponiamo una procedura per testare ipotesi multiple sui coefficienti di una regressione lineare ad alta dimensionalità. Il metodo costruisce test di permutazione asintoticamente validi per ogni sottoinsieme di ipotesi. Tali test possono essere poi utilizzati all'interno di approcci basati sul closed testing, compresa la scorciatoia definita precedentemente. Proponiamo il metodo in due versioni, una esatta e un'approssimazione che richiede minori tempi computazionali e minore memoria. Mostriamo che, per calcolare delle statistiche test per qualsiasi insieme di ipotesi, è sufficiente definire delle statistiche per le singole ipotesi, sfruttando una procedura per la selezione di variabili; queste statistiche vengono poi combinate tramite funzioni con determinate caratteristiche. Ne risulta un metodo estremamente flessibile, che permette di usare diverse procedure di selezione e diverse funzioni per la combinazione. Illustriamo il comportamento del metodo esatto e di quello approssimato tramite simulazioni.

Metodi di permutazione per test multipli su dati ad alta dimensionalità / Vesely, Anna. - (2022 May 05).

Metodi di permutazione per test multipli su dati ad alta dimensionalità

VESELY, ANNA

2022

Abstract

We consider the problem of testing multiple hypotheses in high-dimensional settings, arguing that more tools are needed to support an exploratory approach, where researchers may test many subsets of hypotheses and make a selection post hoc. We focus on resampling-based methods, that rely on minimal assumptions and tend to be more powerful than parametric approaches, especially in presence of multiple hypotheses. In this framework, we provide two general and flexible procedures: a method to make confidence statements on the proportion of true discoveries (TDP), and a method to make inference on predictor variables in linear regression. First, we propose a general closed testing procedure for sum-based global tests. It provides lower confidence bounds for the TDP, simultaneously over all subsets of hypotheses; these simultaneous inferences come for free, i.e., without any adjustment of the alpha-level, whenever a global test is used. Our method allows for an exploratory approach, as simultaneity ensures control of the TDP even when the subset of interest is selected post hoc. It adapts to the unknown joint distribution of the data through permutation testing. Any sum test may be employed, depending on the desired power properties. We present an iterative shortcut for the closed testing procedure, based on the branch and bound algorithm, which converges to the full closed testing results, often after few iterations; even if it is stopped early, it controls the TDP. The feasibility of the method for high dimensional data is illustrated on brain imaging data, then we compare the properties of different choices for the sum test through simulations. Subsequently, we propose a multiple testing method for hypotheses on coefficients in high-dimensional linear regression. It allows to construct asymptotically valid resampling-based tests for any subset of hypotheses, which can be used in closed testing procedures, as well as the above-mentioned shortcut. The approach is presented in two ways: an exact method, and an approximate method that is less computationally intensive. We show that, to build test statistics for any set of hypotheses, it is sufficient to define test statistics for individual hypotheses, relying on a variable selection procedure, and then combine these through a suitable function. The resulting method is extremely flexible, allowing different selection procedures and several combining functions. The performance of the proposed exact and approximate methods is illustrated through simulations.

Scheda breve

Scheda completa

Scheda completa (DC)

	Titolo in inglese
	
				Resampling-based methods for multiple testing on high-dimensional data
			
	Anno di discussione
	
				5-mag-2022
			
	Abstract
	
				Nel contesto dei test multipli su dati ad alta dimensionalità, sono necessari nuovi strumenti per supportare un approccio esplorativo, in cui i ricercatori possano testare diversi sottoinsiemi di ipotesi e selezionare l'insieme di interesse post hoc. In questo manoscritto ci concentriamo sui test di permutazione, che richiedono assunzioni minime e sono generalmente più potenti degli approcci parametrici, soprattutto quando si considerano ipotesi multiple. Proponiamo due metodi generali e flessibili per dare un insieme di confidenza per la proporzione di veri positivi (true discovery proportion, TDP) e per fare inferenza sui predittori nella regressione lineare.

In primo luogo, proponiamo una procedura basata sul closed testing per test globali definiti tramite somme. Questa permette di calcolare limiti inferiori di confidenza per il TDP, simultaneamente rispetto a tutti i sottoinsiemi di ipotesi. Per qualsiasi test globale, tali inferenze simultanee sono disponibili senza aggiustare il livello di significatività. Il metodo proposto permette un approccio esplorativo, in quanto la simultaneità dei limiti di confidenza controlla il TDP anche quando l'insieme di interesse è selezionato post hoc. Inoltre, il metodo si adatta alla distribuzione dei dati tramite permutazioni. Si può utilizzare qualsiasi test basato sulle somme, a seconda delle proprietà desiderate. Il metodo è presentato come una scorciatoia iterativa per la procedura di closed testing, che sfrutta un algoritmo branch and bound e che converge al closed testing, spesso dopo poche iterazioni. La procedura controlla il TDP anche se interrotta prima di giungere a convergenza. Mostriamo che il metodo è adatto a dati ad alta dimensionalità analizzando immagini cerebrali, poi confrontiamo le proprietà di diversi test globali tramite simulazioni.

Successivamente, proponiamo una procedura per testare ipotesi multiple sui coefficienti di una regressione lineare ad alta dimensionalità. Il metodo costruisce test di permutazione asintoticamente validi per ogni sottoinsieme di ipotesi. Tali test possono essere poi utilizzati all'interno di approcci basati sul closed testing, compresa la scorciatoia definita precedentemente. Proponiamo il metodo in due versioni, una esatta e un'approssimazione che richiede minori tempi computazionali e minore memoria. Mostriamo che, per calcolare delle statistiche test per qualsiasi insieme di ipotesi, è sufficiente definire delle statistiche per le singole ipotesi, sfruttando una procedura per la selezione di variabili; queste statistiche vengono poi combinate tramite funzioni con determinate caratteristiche. Ne risulta un metodo estremamente flessibile, che permette di usare diverse procedure di selezione e diverse funzioni per la combinazione. Illustriamo il comportamento del metodo esatto e di quello approssimato tramite simulazioni.
			
	Citazione
	
				Metodi di permutazione per test multipli su dati ad alta dimensionalità / Vesely, Anna. - (2022 May 05).
			
	Appare nelle tipologie:
	
				08.01 - Tesi di Dottorato UNIPD (Deposito Legale)

File in questo prodotto:

File	Dimensione	Formato
tesi_definitiva_Anna_Vesely.pdf accesso aperto Descrizione: tesi_definitiva_Anna_Vesely Tipologia: Tesi di dottorato Licenza: Altro Dimensione 1.13 MB Formato Adobe PDF Visualizza/Apri	1.13 MB	Adobe PDF	Visualizza/Apri