1. A Distributed Execution Pipeline for Clustering Trajectories Based on a Fuzzy Similarity Relation
- Author
-
Soufiane Maguerra, Azedine Boulmakoul, Lamia Karim, and Hassan Badir
- Subjects
Apache Spark ,big data ,distributed computing ,clustering analysis ,fuzzy relational clustering approaches ,partitioning strategy ,Industrial engineering. Management engineering ,T55.4-60.8 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The proliferation of indoor and outdoor tracking devices has led to a vast amount of spatial data. Each object can be described by several trajectories that, once analysed, can yield to significant knowledge. In particular, pattern analysis by clustering generic trajectories can give insight into objects sharing the same patterns. Still, sequential clustering approaches fail to handle large volumes of data. Hence, the necessity of distributed systems to be able to infer knowledge in a trivial time interval. In this paper, we detail an efficient, scalable and distributed execution pipeline for clustering raw trajectories. The clustering is achieved via a fuzzy similarity relation obtained by the transitive closure of a proximity relation. Moreover, the pipeline is integrated in Spark, implemented in Scala and leverages the Core and Graphx libraries making use of Resilient Distributed Datasets (RDD) and graph processing. Furthermore, a new simple, but very efficient, partitioning logic has been deployed in Spark and integrated into the execution process. The objective behind this logic is to equally distribute the load among all executors by considering the complexity of the data. In particular, resolving the load balancing issue has reduced the conventional execution time in an important manner. Evaluation and performance of the whole distributed process has been analysed by handling the Geolife project’s GPS trajectory dataset.
- Published
- 2019
- Full Text
- View/download PDF