1. Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian Statistics
- Author
-
Zhongjie Zhang and Jian Huang
- Subjects
Computer science ,Sampling (statistics) ,Value (computer science) ,Computer Science Applications ,Human-Computer Interaction ,Bayesian statistics ,Set (abstract data type) ,Distribution (mathematics) ,Control and Systems Engineering ,Sample size determination ,Probability of error ,Multiple sampling ,Electrical and Electronic Engineering ,Algorithm ,Software ,Information Systems - Abstract
Sampling from large dataset is commonly used in the frequent patterns (FPs) mining. To tightly and theoretically guarantee the quality of the FPs obtained from samples, current methods theoretically stabilize the supports of all the patterns in random samples, despite only FPs do matter, so they always overestimate the sample size. We propose an algorithm called multiple sampling-based FPs mining (MSFP). The MSFP first generates the set of approximate frequent items (AFI), and uses the AFI to form the set of approximate FPs without supports ( AFP*), where it does not stabilize the value of any item's or pattern's support, but only stabilizes the relationship ≥ or < between the support and the minimum support, so the MSFP can use small samples to successively obtain the AFI and AFP*, and can successively prune the patterns not contained by the AFI and not in the AFP*. Then, the MSFP introduces the Bayesian statistics to only stabilize the values of supports of AFP*'s patterns. If a pattern's support in the original dataset is unknown, the MSFP regards it as random, and keeps updating its distribution by its approximations obtained from the samples taken in the progressive sampling, so the error probability can be bound better. Furthermore, to reduce the I/O processes in the progressive sampling, the MSFP stores a large enough random sample in memory in advance. The experiments show that the MSFP is reliable and efficient.
- Published
- 2023