Does the Century Matter? Machine Learning Methods to Attribute Historical Periods in an Italian Literary Corpus

Cortelazzo, Michele A.; Gatti, Franco; Mikros, Georgios K.; Tuzzi, Arjuna

doi:10.1515/9783110763560-003

This study aims to analyse an Italian literary corpus from a diachronic perspective using machine learning methods. With reference to a basis of texts written between the 16th and the 21st century, the aim is to apply a well-known robust machine learning (ML) algorithm (Random Forest - RF) in order to see how the texts are classified in four different partitions, representing periodizations theorized by four Italian literature scholars. The corpus we employed for training the ML algorithm includes 420 Italian texts: 100 texts from the 16th century, 27 from the 17th, 57 from the 18th, 100 from the 19th, 100 from the 20th, and 36 from the 21st. In order to vectorize the texts, we used the Author’s Multilevel N-gram Profile (AMNP) (Mikros and Perifanos, 2013; Cortelazzo, Mikros, and Tuzzi, 2018), a document representation method that takes into account a diverse set of linguistic features (i.e., ngrams of increasing length – unigrams, bigrams, trigrams – and ngrams of increasing level – character, word). Each text was split into text chunks of 2000 words in length, and then it was transformed into AMNP vectors. The results of this research have shown an impressive accuracy in classification with the Random Forest algorithm since the precision in the four periodizations reached a minimum value of 89% in the partition-based Migliorini's theories and a maximum value of 97% in the partition based on Cella's ones. Looking at the misclassification cases, particularly in Migliorini's training, it's interesting to notice that when Random Forest makes a mistake in classifying text chunks into a century, its error is usually of +/- 1 century.