Multivariate categorical data are routinely collected in several applications, including epidemiology, biology, and sociology, among many others. Popular models dealing with these variables include log-linear and tensor factorization models, with these lasts having the advantage of flexibly characterizing the dependence structure underlying the data. Under such framework, this Thesis aims to provide novel approaches to define compact representations of the dependence structures and to introduce new inference possibilities in tensor factorization approaches. We introduce a new class of GROuped Tensor (GROT) factorizations, which have superior performance in terms of data compression if compared to standard Parafac approach, using relatively few components to represent the joint probability mass function of the data. While popular Parafac factorizations rely on mixing together independent components, GROT mixes together grouped factorizations, equivalent to replacing vector arms in Parafac with low-dimensional tensor arms. We consider a Bayesian approach to inference with Dirichlet priors on the mixing weights and arm components, to obtain a combined low-rank and sparse structure, while facilitating efficient posterior computation via Markov chain Monte Carlo. Motivated by an application on malaria risk assessment, we also introduce a novel multivariate generalization of mixed membership models, which allows identification of correlated profiles related to different domains corresponding to separate groups of variables. We consider as a case study the Machadinho settlement project in Brazil, with the aim of defining survey based environmental and behavioral risk profiles and studying their interaction and evolution. To achieve this goal, we show that the use of correlated multiple membership vectors leads to interpretable inference requiring a lower number of profiles compared to standard formulations while inducing a more compact representation of the population level model. We propose a novel multivariate logistic normal distribution for the membership vectors, which allows easy introduction of auxiliary information in the membership profiles leveraging a multivariate latent logistic regression. A Bayesian approach to inference, relying on Pólya gamma data augmentation, facilitates efficient posterior computation via Markov chain Monte Carlo. The proposed approach is shown to outperform the classical mixed membership model in simulations, and the malaria diffusion application.

Bayesian inference for tensor factorization models / Russo, Massimiliano. - (2019).

Bayesian inference for tensor factorization models

Russo, Massimiliano
2019

Abstract

Multivariate categorical data are routinely collected in several applications, including epidemiology, biology, and sociology, among many others. Popular models dealing with these variables include log-linear and tensor factorization models, with these lasts having the advantage of flexibly characterizing the dependence structure underlying the data. Under such framework, this Thesis aims to provide novel approaches to define compact representations of the dependence structures and to introduce new inference possibilities in tensor factorization approaches. We introduce a new class of GROuped Tensor (GROT) factorizations, which have superior performance in terms of data compression if compared to standard Parafac approach, using relatively few components to represent the joint probability mass function of the data. While popular Parafac factorizations rely on mixing together independent components, GROT mixes together grouped factorizations, equivalent to replacing vector arms in Parafac with low-dimensional tensor arms. We consider a Bayesian approach to inference with Dirichlet priors on the mixing weights and arm components, to obtain a combined low-rank and sparse structure, while facilitating efficient posterior computation via Markov chain Monte Carlo. Motivated by an application on malaria risk assessment, we also introduce a novel multivariate generalization of mixed membership models, which allows identification of correlated profiles related to different domains corresponding to separate groups of variables. We consider as a case study the Machadinho settlement project in Brazil, with the aim of defining survey based environmental and behavioral risk profiles and studying their interaction and evolution. To achieve this goal, we show that the use of correlated multiple membership vectors leads to interpretable inference requiring a lower number of profiles compared to standard formulations while inducing a more compact representation of the population level model. We propose a novel multivariate logistic normal distribution for the membership vectors, which allows easy introduction of auxiliary information in the membership profiles leveraging a multivariate latent logistic regression. A Bayesian approach to inference, relying on Pólya gamma data augmentation, facilitates efficient posterior computation via Markov chain Monte Carlo. The proposed approach is shown to outperform the classical mixed membership model in simulations, and the malaria diffusion application.
2019
Multivariate categorical data Contingency tables Tensor factorizations Admixture model Multivariate logistic normal distribution. Latent Dirichlet allocation
Bayesian inference for tensor factorization models / Russo, Massimiliano. - (2019).
File in questo prodotto:
File Dimensione Formato  
Thesis.pdf

accesso aperto

Tipologia: Tesi di dottorato
Licenza: Accesso gratuito
Dimensione 3.82 MB
Formato Adobe PDF
3.82 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3426830
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact