A Probabilistic Model for Stemmer Generation

Bacchin, Michela; Ferro, Nicola; Melucci, Massimo

Today managing textual resources and providing full-text search capabilities on them is a relevant issue also for database management systems. Stemming is part of the indexing and searching processes, when we deal with textual resources. In this paper we present a languageindependent probabilistic model which can automatically generate stemmers for several different languages. The variety of word forms makes the match between the end user’s words and the document words impossible even if they refer to the same concept - this mismatch degrades retrieval performance. Stemmers can improve the retrieval effectiveness, but the design and the implementation of stemmers requires a laborious amount of effort. The proposed model describes the mutual reinforcement relationship between stems and derivations and then provides a probabilistic interpretation of it. A series of experiments shows that the stemmers generated by the probabilistic model are as effective as the ones based on linguistic knowledge.