Digital formats and data visualization are key aspects in the creation of a multilingual corpus. Nonetheless, they have received relevantly less attention than other important factors, as the problems related to the organization of the workflow and the selection of the tagset. In this contribution we show how these two apparently separate aspects are inextricably intertwined and how we approached these issues in the MICLE project (Micro Cues for Language Evolution, ANR/DFG) in terms of inclusiveness. More specifically, we show how including multiple PoS tagsets (UD, UPENN, PRESTO) in the same corpus by means of conversion scripts allows for a better fruition of the data and a better organization of the workflow. Furthermore, we show how adopting the XML-TEI format for the final version of the data allows for enough flexibility to accommodate all the different POS tags and the various syntactic information (in turn encoded in the UD – dependency-based – and UPENN – constituency-based – format). This has a clear payoff in terms of comparability of the data from the two languages of the corpus, Old French and Old Venetian, as we show in the last section, where we compare the results of an ongoing investigation on the phenomenon of Infinitival Inversion and on its relationship with the Verb Second word-order constraint.

Challenges of a multilingual corpus (Old French/Old Venetian): The example of the MICLE project

Francesco Pinzin
Writing – Original Draft Preparation
2024

Abstract

Digital formats and data visualization are key aspects in the creation of a multilingual corpus. Nonetheless, they have received relevantly less attention than other important factors, as the problems related to the organization of the workflow and the selection of the tagset. In this contribution we show how these two apparently separate aspects are inextricably intertwined and how we approached these issues in the MICLE project (Micro Cues for Language Evolution, ANR/DFG) in terms of inclusiveness. More specifically, we show how including multiple PoS tagsets (UD, UPENN, PRESTO) in the same corpus by means of conversion scripts allows for a better fruition of the data and a better organization of the workflow. Furthermore, we show how adopting the XML-TEI format for the final version of the data allows for enough flexibility to accommodate all the different POS tags and the various syntactic information (in turn encoded in the UD – dependency-based – and UPENN – constituency-based – format). This has a clear payoff in terms of comparability of the data from the two languages of the corpus, Old French and Old Venetian, as we show in the last section, where we compare the results of an ongoing investigation on the phenomenon of Infinitival Inversion and on its relationship with the Verb Second word-order constraint.
2024
VENEZIA E LA FRANCIA TRA MEDIOEVO ED ETÀ MODERNA Similitudini, specificità, interrelazioni
979-12-5496-036-3
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3552520
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact