The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let () denote the average edit distance between random, independent strings of n characters from an alphabet of a given size k. An open problem is the exact value of ()=()/. While it is known that, for increasing n, () approaches a limit , the exact value of this limit is unknown, for any ≥2. This paper presents an upper bound to based on the exact computation of some () and a lower bound to based on combinatorial arguments on edit scripts. Statistical estimates of () are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes k. In particular, for a binary alphabet, the rigorous bounds are 0.1742≤2≤0.3693 while the obtained estimate is 2≈0.2888; for a quaternary alphabet, 0.3598≤4≤0.6318 and 4≈0.5180. These values are more accurate than those previously published.
Bounds and Estimates on the Average Edit Distance
Schimd M.;Bilardi G.
2019
Abstract
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let () denote the average edit distance between random, independent strings of n characters from an alphabet of a given size k. An open problem is the exact value of ()=()/. While it is known that, for increasing n, () approaches a limit , the exact value of this limit is unknown, for any ≥2. This paper presents an upper bound to based on the exact computation of some () and a lower bound to based on combinatorial arguments on edit scripts. Statistical estimates of () are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes k. In particular, for a binary alphabet, the rigorous bounds are 0.1742≤2≤0.3693 while the obtained estimate is 2≈0.2888; for a quaternary alphabet, 0.3598≤4≤0.6318 and 4≈0.5180. These values are more accurate than those previously published.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.