This study aims to assess the performance of different feature sets in an authorship attribution test using a large corpus of 76 contemporary Italian popular mystery novels by 16 authors. The feature sets include the dimensions derived from a large language model, the most frequent words, and the coordinates of correspondence analysis. Our analysis compares and contrasts the results obtained through these different vector representations in machine learning classification tasks. Although transformers have been shown to outperform other alternatives in previous works, in this case, correspondence analysis proves to be the winner of the challenge. The results support the hypothesis that specialized large corpora require tailor-made representations.
Competing Sets of Predictors in an Authorship Attribution Task: Most Frequent Words, Large Language Models and Correspondence Analysis
Andrea Sciandra
;Arjuna Tuzzi
2025
Abstract
This study aims to assess the performance of different feature sets in an authorship attribution test using a large corpus of 76 contemporary Italian popular mystery novels by 16 authors. The feature sets include the dimensions derived from a large language model, the most frequent words, and the coordinates of correspondence analysis. Our analysis compares and contrasts the results obtained through these different vector representations in machine learning classification tasks. Although transformers have been shown to outperform other alternatives in previous works, in this case, correspondence analysis proves to be the winner of the challenge. The results support the hypothesis that specialized large corpora require tailor-made representations.File | Dimensione | Formato | |
---|---|---|---|
2025_SciaTuz_SIS2024_Bari_con copertina.pdf
Accesso riservato
Descrizione: Editoriale
Tipologia:
Published (Publisher's Version of Record)
Licenza:
Accesso privato - non pubblico
Dimensione
1.51 MB
Formato
Adobe PDF
|
1.51 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.