Digital formats and data visualization are key aspects in the creation of a multilingual corpus. Nonetheless, they have received relevantly less attention than other important factors, as the problems related to the organization of the workflow and the selection of the tagset. In this contribution we show how these two apparently separate aspects are inextricably intertwined and how we approached these issues in the MICLE project (Micro Cues for Language Evolution, ANR/DFG) in terms of inclusiveness. More specifically, we show how including multiple PoS tagsets (UD, UPENN, PRESTO) in the same corpus by means of conversion scripts allows for a better fruition of the data and a better organization of the workflow. Furthermore, we show how adopting the XML-TEI format for the final version of the data allows for enough flexibility to accommodate all the different POS tags and the various syntactic information (in turn encoded in the UD – dependency-based – and UPENN – constituency-based – format). This has a clear payoff in terms of comparability of the data from the two languages of the corpus, Old French and Old Venetian, as we show in the last section, where we compare the results of an ongoing investigation on the phenomenon of Infinitival Inversion and on its relationship with the Verb Second word-order constraint.
Challenges of a multilingual corpus (Old French/Old Venetian): The example of the MICLE project
Francesco Pinzin
Writing – Original Draft Preparation
2024
Abstract
Digital formats and data visualization are key aspects in the creation of a multilingual corpus. Nonetheless, they have received relevantly less attention than other important factors, as the problems related to the organization of the workflow and the selection of the tagset. In this contribution we show how these two apparently separate aspects are inextricably intertwined and how we approached these issues in the MICLE project (Micro Cues for Language Evolution, ANR/DFG) in terms of inclusiveness. More specifically, we show how including multiple PoS tagsets (UD, UPENN, PRESTO) in the same corpus by means of conversion scripts allows for a better fruition of the data and a better organization of the workflow. Furthermore, we show how adopting the XML-TEI format for the final version of the data allows for enough flexibility to accommodate all the different POS tags and the various syntactic information (in turn encoded in the UD – dependency-based – and UPENN – constituency-based – format). This has a clear payoff in terms of comparability of the data from the two languages of the corpus, Old French and Old Venetian, as we show in the last section, where we compare the results of an ongoing investigation on the phenomenon of Infinitival Inversion and on its relationship with the Verb Second word-order constraint.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




