Back to Search
Start Over
Measuring and Facilitating Data Repeatability in Web Science
- Source :
- Datenbank-Spektrum. 19:117-126
- Publication Year :
- 2019
- Publisher :
- Springer Science and Business Media LLC, 2019.
-
Abstract
- Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets. To alleviate these problems and reach what we define as “partial data repeatability”, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much. We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments.
- Subjects :
- Computer science
Character (computing)
Process (engineering)
05 social sciences
02 engineering and technology
Repeatability
computer.software_genre
World Wide Web
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
Web science
Privacy law
0509 other social sciences
050904 information & library sciences
computer
Web scraping
Subjects
Details
- ISSN :
- 16101995 and 16182162
- Volume :
- 19
- Database :
- OpenAIRE
- Journal :
- Datenbank-Spektrum
- Accession number :
- edsair.doi...........1962b16ad019ac4a18a7a97b742cfab3
- Full Text :
- https://doi.org/10.1007/s13222-019-00316-9