Back to Search
Start Over
CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
- Source :
- PLoS ONE, Vol 13, Iss 10, p e0203670 (2018), PLoS ONE
- Publication Year :
- 2018
- Publisher :
- Public Library of Science (PLoS), 2018.
-
Abstract
- As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.
- Subjects :
- Optimization
Proteomics
0301 basic medicine
Computer and Information Sciences
Computer science
Social Sciences
lcsh:Medicine
Research and Analysis Methods
Internet topology
Biochemistry
Social Networking
Computer Communication Networks
Clustering Algorithms
03 medical and health sciences
0302 clinical medicine
Sociology
Similarity (network science)
Spark (mathematics)
Cluster Analysis
Data Mining
Cluster analysis
lcsh:Science
Selection (genetic algorithm)
Data Processing
Multidisciplinary
Computers
Applied Mathematics
Simulation and Modeling
lcsh:R
Biology and Life Sciences
Bloom filter
030104 developmental biology
Social Networks
Physical Sciences
Scalability
Protein Interaction Networks
lcsh:Q
Information Technology
Algorithm
Algorithms
Mathematics
Network Analysis
030217 neurology & neurosurgery
Research Article
Network analysis
Subjects
Details
- Language :
- English
- ISSN :
- 19326203
- Volume :
- 13
- Issue :
- 10
- Database :
- OpenAIRE
- Journal :
- PLoS ONE
- Accession number :
- edsair.doi.dedup.....754db292fb07747ff54d3bc05b8c002d