This preliminary study assesses the impact of noise-removing techniques, such as Principal Component Pursuit (PCP), on the document-term matrix before topic modeling. Specifically, fuzzy Latent Semantic Analysis (fLSA) is applied to a benchmark dataset of Air France customer reviews to evaluate how different input representations – namely, the standard term-frequency matrix and its low-rank approximation via low-rank decomposition – affect topic coherence and interpretability. Initial results indicate that while fLSA effectively extracts meaningful topics, noise removal via PCP introduces distortions, altering topic structure.

Lost in Noise: When cleaning up clouds the picture. Fuzzy topic modeling and robust low-rank decomposition

antonio calcagni'
;
andrea sciandra;arjuna tuzzi
2025

Abstract

This preliminary study assesses the impact of noise-removing techniques, such as Principal Component Pursuit (PCP), on the document-term matrix before topic modeling. Specifically, fuzzy Latent Semantic Analysis (fLSA) is applied to a benchmark dataset of Air France customer reviews to evaluate how different input representations – namely, the standard term-frequency matrix and its low-rank approximation via low-rank decomposition – affect topic coherence and interpretability. Initial results indicate that while fLSA effectively extracts meaningful topics, noise removal via PCP introduces distortions, altering topic structure.
2025
BOOK OF SHORT PAPERS
IES 2025 - Innovation & Society: Statistics and Data Science for Evaluation and Quality
978 88 5495 849 4
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3556032
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact