Back to Search
Start Over
Automatic Versioning of Time Series Datasets: a FAIR Algorithmic Approach
- Publication Year :
- 2022
- Publisher :
- Zenodo, 2022.
-
Abstract
- As one of the fundamental concepts underpinning the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles, data provenance entails keeping track of each version for a given dataset from its original to its latest version. However, standard terms to determine and include versioning information in the metadata of a given dataset are still ambiguous and do not explicitly define how to assess the overlap of information between items along a versioning stream. In this work, we propose a novel approach for automatic versioning of time series datasets, based on the use of parameters from two dimensionality reduction approaches, namely Principal Component Analysis and Autoencoders. That is to say, we systematically detect and measure similarities (information distances)in datasets via dimensionality reduction, encode them as different versions, and then automatically generate provenance metadata via a FAIRversioning service using the W3C DCAT 3.0 nomenclature. We illustrate this approach with two time series datasets and demonstrate how the proposed parameters effectively assess the similarity between different data versions. Our results have shown that the proposed version similarity metrics are robust (\(s^{(0,1)} = 1\)) to the alteration of up to 60% of cells, the removal of up to 60% of rows, and the log-scale transformation of variables. In contrast, row-wise transformations (e.g. converting absolute values to a percentage of a second variable) yield minimal similarity values (\(s^{(0,1)} < 0.75\)). Our code and datasets are openly available to enable reproducibility.&nbsp
- Subjects :
- paper-presentation
Subjects
Details
- Database :
- OpenAIRE
- Accession number :
- edsair.doi.dedup.....4b0e1aea017f76e4e408b4ed00058d98
- Full Text :
- https://doi.org/10.5281/zenodo.7158371