This preliminary study assesses the impact of noise-removing techniques, such as Principal Component Pursuit (PCP), on the document-term matrix before topic modeling. Specifically, fuzzy Latent Semantic Analysis (fLSA) is applied to a benchmark dataset of Air France customer reviews to evaluate how different input representations – namely, the standard term-frequency matrix and its low-rank approximation via low-rank decomposition – affect topic coherence and interpretability. Initial results indicate that while fLSA effectively extracts meaningful topics, noise removal via PCP introduces distortions, altering topic structure.
Lost in Noise: When cleaning up clouds the picture. Fuzzy topic modeling and robust low-rank decomposition
antonio calcagni'
;andrea sciandra;arjuna tuzzi
2025
Abstract
This preliminary study assesses the impact of noise-removing techniques, such as Principal Component Pursuit (PCP), on the document-term matrix before topic modeling. Specifically, fuzzy Latent Semantic Analysis (fLSA) is applied to a benchmark dataset of Air France customer reviews to evaluate how different input representations – namely, the standard term-frequency matrix and its low-rank approximation via low-rank decomposition – affect topic coherence and interpretability. Initial results indicate that while fLSA effectively extracts meaningful topics, noise removal via PCP introduces distortions, altering topic structure.File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.