Words that are, by some measure, over- or underrepresented in the context of larger se- quences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and signi cance thereof. Here we take the global approach of annotating the suf x tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary lter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree an- notations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of signi cance. This result is achieved by an ad hoc embedding in statistical expressions of the combinatorial structure of the periods of a string. Speci cally, we show that the expected value and variance of all substrings in a given sequence of sym- bols can be computed and stored in (optimal) 2 overall worst-case, log expected time and space. The 2 time bound constitutes an improvement by a linear factor over direct methods. Moreover, we show that under several accepted measures of deviation from expected frequency, the candidates over- or underrepresented words are restricted to the words that end at internal nodes of a compact suf x tree, as opposed to the 2 pos- sible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, overrepresented, then its extension to the nearest node of the tree is even more so. Based on this, we design global detectors of favored and unfavored words for our probabilistic framework in overall linear time and space, discuss related software implementations and display the results of preliminary experiments.
Efficient Detection of Unusual Words
APOSTOLICO, ALBERTO;
2000
Abstract
Words that are, by some measure, over- or underrepresented in the context of larger se- quences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and signi cance thereof. Here we take the global approach of annotating the suf x tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary lter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree an- notations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of signi cance. This result is achieved by an ad hoc embedding in statistical expressions of the combinatorial structure of the periods of a string. Speci cally, we show that the expected value and variance of all substrings in a given sequence of sym- bols can be computed and stored in (optimal) 2 overall worst-case, log expected time and space. The 2 time bound constitutes an improvement by a linear factor over direct methods. Moreover, we show that under several accepted measures of deviation from expected frequency, the candidates over- or underrepresented words are restricted to the words that end at internal nodes of a compact suf x tree, as opposed to the 2 pos- sible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, overrepresented, then its extension to the nearest node of the tree is even more so. Based on this, we design global detectors of favored and unfavored words for our probabilistic framework in overall linear time and space, discuss related software implementations and display the results of preliminary experiments.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.