Start Over

Multi-level and relevance-based parallel clustering of massive data streams in smart manufacturing

Authors :: Devis Bianchini
Valeria De Antonellis
Ada Bagozi
Source :: Information Sciences. 577:805-823
Publication Year :: 2021
Publisher :: Elsevier BV, 2021.
Abstract: Parallel implementations of incremental clustering have been provided to increase performances of data stream processing in smart factories, to enable real-time anomaly detection, remote diagnosis, condition-based monitoring of Cyber-Physical Systems. Incremental clustering algorithms iteratively extract and update over time clusters of data points (often denoted as micro-clusters) whose maximum number is bounded. However, the capability of controlling costs derived from the exploitation of computational resources on the distributed architecture is challenging to enable a sustainable processing of massive data streams. In this paper, we present a multi-level parallelization approach for clustering massive data streams based on an horizontal scaling platform for Big Data processing. In particular, the following levels are considered: (i) a first parallelization level is based on a multi-dimensional model with exploration facets used to perform a first, coarse-grained partition of data streams, according to a divide-and-conquer strategy; (ii) a second parallelization level is based on a buffering mechanism, that splits the data stream into portions of data points on which processing is performed in parallel; (iii) the third level of parallelization is defined over the set of micro-clusters that are generated and change over time. The approach is conceived for anomaly detection in smart manufacturing, where the concept of data relevance, defined in terms of distance from critical conditions of monitored systems, is used in order to force a stronger parallelization (and therefore higher resource usage) only when necessary, that is, when approaching to critical conditions. The scalability and efficiency of the approach are evaluated using a real dataset in a smart factory scenario. In particular, experiments demonstrated that when the maximum number of allowed micro-clusters decreases and the buffer size increases, parallelization based on buffering does not ensure good scalability. Additionally, as the number of features (that is, the complexity of data stream) increases, the parallelization based on buffering may present scalability issues. This paves the way to the advantages of tuning different parallelization levels according to the approach proposed in this paper.

Subjects :: Data stream
Information Systems and Management
Apache Spark
Computer science
Data stream mining
Parallel clustering
Distributed computing
Anomaly detection
Big data
Partition (database)
Computer Science Applications
Theoretical Computer Science
Set (abstract data type)
Data point
Artificial Intelligence
Control and Systems Engineering
Scalability
Cluster analysis
Software

Details

ISSN :: 00200255
Volume :: 577
Database :: OpenAIRE
Journal :: Information Sciences
Accession number :: edsair.doi.dedup.....20ebbd0b815e1aafe5975aba9c08578a

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Multi-level and relevance-based parallel clustering of massive data streams in smart manufacturing

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Multi-level and relevance-based parallel clustering of massive data streams in smart manufacturing

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources