Topic Modeling algorithms help unveil the latent thematic structure from large document collections. Previous works showed that traditional approaches could be less effective when applied to short texts, e.g., tweets; however, that can be mitigated by assuming that each document is about a single topic, as done in Twitter-LDA. In this work, we relax this assumption and propose a new model where a document can be about single or multiple topics. Our model allows the generation of diverse types of descriptors from latent topics, e.g., words and hashtags, similarly to Hashtag-LDA. Moreover, words/hashtags can be generated from topics or a background/global distribution. The proposed model is modular, and our goal is to tailor it to collections that can be heterogeneous both in the presence of single or multiple-topic documents and in the adoption of diverse topic representations.

A Modular Approach to Topic Modeling for Heterogeneous Documents

Di Buccio E.
2022

Abstract

Topic Modeling algorithms help unveil the latent thematic structure from large document collections. Previous works showed that traditional approaches could be less effective when applied to short texts, e.g., tweets; however, that can be mitigated by assuming that each document is about a single topic, as done in Twitter-LDA. In this work, we relax this assumption and propose a new model where a document can be about single or multiple topics. Our model allows the generation of diverse types of descriptors from latent topics, e.g., words and hashtags, similarly to Hashtag-LDA. Moreover, words/hashtags can be generated from topics or a background/global distribution. The proposed model is modular, and our goal is to tailor it to collections that can be heterogeneous both in the presence of single or multiple-topic documents and in the adoption of diverse topic representations.
2022
CEUR Workshop Proceedings
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3457706
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact