Pattern mining is a fundamental data mining task with applications in several domains. In this work, we consider the scenario in which we have a sequence of datasets generated by potentially different underlying generative processes, and we study the problem of mining statistically robust patterns, which are patterns whose probabilities of appearing in transactions drawn from such generative processes respect well defined conditions. Such conditions define the patterns of interest, describing the evolution of their probabilities through the datasets in the sequence, which may, for example, increase, decrease, or stay stable, through the sequence. Due to the stochastic nature of the data, one cannot identify the exact set of the statistically robust patterns analyzing a sequence of samples, i.e., the datasets, taken from the generative processes, and has to resort to approximations. We then propose GRosSo, an algorithm to find a rigorous approximation of the statistically robust patterns that does not contain false positives with high probability. We apply our framework to the mining of statistically robust sequential patterns. Our extensive evaluation on pseudo-artificial and real data shows that GRosSo provides high-quality approximations for the problem of mining statistically robust sequential patterns.
GRosSo: Mining statistically robust patterns from a sequence of datasets
Tonon A.;Vandin F.
2020
Abstract
Pattern mining is a fundamental data mining task with applications in several domains. In this work, we consider the scenario in which we have a sequence of datasets generated by potentially different underlying generative processes, and we study the problem of mining statistically robust patterns, which are patterns whose probabilities of appearing in transactions drawn from such generative processes respect well defined conditions. Such conditions define the patterns of interest, describing the evolution of their probabilities through the datasets in the sequence, which may, for example, increase, decrease, or stay stable, through the sequence. Due to the stochastic nature of the data, one cannot identify the exact set of the statistically robust patterns analyzing a sequence of samples, i.e., the datasets, taken from the generative processes, and has to resort to approximations. We then propose GRosSo, an algorithm to find a rigorous approximation of the statistically robust patterns that does not contain false positives with high probability. We apply our framework to the mining of statistically robust sequential patterns. Our extensive evaluation on pseudo-artificial and real data shows that GRosSo provides high-quality approximations for the problem of mining statistically robust sequential patterns.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.