1. Detection of Sequential Outliers using a Variable Length Markov Model
- Author
-
C. Low-Kam, M. Teisseire, A. Laurent, Institut de Mathématiques et de Modélisation de Montpellier (I3M), Centre National de la Recherche Scientifique (CNRS)-Université Montpellier 2 - Sciences et Techniques (UM2)-Université de Montpellier (UM), Fouille de données environnementales (TATOO), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), and Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)
- Subjects
Computer science ,Suffix tree ,Markov process ,information criterion ,suffix tree ,computer.software_genre ,Information theory ,Markov model ,01 natural sciences ,law.invention ,010104 statistics & probability ,03 medical and health sciences ,symbols.namesake ,Entropy (classical thermodynamics) ,sequential databases ,law ,[MATH.MATH-ST]Mathematics [math]/Statistics [math.ST] ,Entropy (information theory) ,sort ,Outliers ,0101 mathematics ,Entropy (energy dispersal) ,Entropy (arrow of time) ,030304 developmental biology ,information theory ,0303 health sciences ,[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB] ,Entropy (statistical thermodynamics) ,concentration inequality ,Tree (data structure) ,Outlier ,symbols ,Data mining ,computer - Abstract
International audience; The problem of mining for outliers in sequential datasets is crucial to forward appropriate analysis of data. Therefore, many approaches for the discovery of such anomalies have been proposed. However, most of them use a sample of known typical sequences to build the model. Besides, they remain greedy in terms of memory usage. In this paper we propose an extension of one such approach, based on a Probabilistic Suffix Tree and on a measure of similarity. We add a pruning criterion which reduces the size of the tree while improving the model, and a sharp inequality for the concentration of the measure of similarity, to better sort the outliers. We prove the feasability of our approach through a set of experiments over a protein database.
- Published
- 2008