1. A formal framework for database sampling
- Author
-
David A. Bell, Jesus Bisbal, and Jane Grimson
- Subjects
Spatiotemporal database ,Computer science ,View ,Database schema ,InformationSystems_DATABASEMANAGEMENT ,computer.software_genre ,Database design ,Database tuning ,Database testing ,Computer Science Applications ,Database theory ,Data mining ,computer ,Software ,Information Systems ,Database model - Abstract
Database sampling is commonly used in applications like data mining and approximate query evaluation in order to achieve a compromise between the accuracy of the results and the computational cost of the process. The authors have recently proposed the use of database sampling in the context of populating a prototype database, that is, a database used to support the development of data-intensive applications. Existing methods for constructing prototype databases commonly populate the resulting database with synthetic data values. A more realistic approach is to sample a database so that the resulting sample satisfies a predefined set of integrity constraints. The resulting database, with domain-relevant data values and semantics, is expected to better support the software development process. This paper presents a formal study of database sampling. A Denotational Semantics description of database sampling is first discussed. Then the paper characterises the types of integrity constraints that must be considered during sampling. Lastly, the sampling strategy presented here is applied to improve the data quality of a (legacy) database. In this context, database sampling is used to incrementally identify the set of tuples which are the cause of inconsistencies in the database, and therefore should be the ones to be addressed by the data cleaning process.
- Published
- 2005
- Full Text
- View/download PDF