Nowadays a corpus is typically a large collection of text excerpts, representing a range of registers and genres, available in electronic form, marked up for contextual details of production-reception and tagged for parts of speech. Access to such a database allows a linguist to search for occurrences of a given linguistic form in a specifiable co-text (i.e. lexico-syntactic environment, text type and/or language variety) so as to uncover its patterns of use (i.e. combinatorial options and restrictions). Recent research projects on the catenative construction “going to be V-ing”, English near synonyms, and denotationally symmetrical gender-marked terms show that it is possible to use electronic corpora to determine in part the meaning of emerging syntactic structures, to trace the semantic space preferentially occupied by a given term and its degree of overlap with that of a neighbouring term, and to reveal the covert cultural assumptions of concepts conveyed through seemingly ideologically neutral terms. Corpus data is valuable in two respects: on the one hand, it makes it easier to distinguish what is possible in a language from what is actually (and frequently) attested; on the other, it offers good opportunities for replication studies, which are the basis for progress in science. However, corpus data also suffers from some limitations (e.g. limited representativeness) so that the researcher needs to exercise caution in its classification, interpretation, evaluation and generalization to the language as a whole.
Linguistic research with large-scale corpora
GESUATO, SARA
2008
Abstract
Nowadays a corpus is typically a large collection of text excerpts, representing a range of registers and genres, available in electronic form, marked up for contextual details of production-reception and tagged for parts of speech. Access to such a database allows a linguist to search for occurrences of a given linguistic form in a specifiable co-text (i.e. lexico-syntactic environment, text type and/or language variety) so as to uncover its patterns of use (i.e. combinatorial options and restrictions). Recent research projects on the catenative construction “going to be V-ing”, English near synonyms, and denotationally symmetrical gender-marked terms show that it is possible to use electronic corpora to determine in part the meaning of emerging syntactic structures, to trace the semantic space preferentially occupied by a given term and its degree of overlap with that of a neighbouring term, and to reveal the covert cultural assumptions of concepts conveyed through seemingly ideologically neutral terms. Corpus data is valuable in two respects: on the one hand, it makes it easier to distinguish what is possible in a language from what is actually (and frequently) attested; on the other, it offers good opportunities for replication studies, which are the basis for progress in science. However, corpus data also suffers from some limitations (e.g. limited representativeness) so that the researcher needs to exercise caution in its classification, interpretation, evaluation and generalization to the language as a whole.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.