562 results on '"Complete-linkage clustering"'
Search Results
2. Robust Mean-Variance Portfolio Selection with Ward and Complete Linkage Clustering Algorithm
- Author
-
Rosadi Dedi, Abdurakhman Abdurakhman, and Gubu La
- Subjects
Economics and Econometrics ,Applied Mathematics ,Statistics ,Mean variance ,Portfolio ,Complete-linkage clustering ,Selection (genetic algorithm) ,Computer Science Applications ,Mathematics - Published
- 2020
- Full Text
- View/download PDF
3. A study of speaker clustering for speaker attribution in large telephone conversation datasets.
- Author
-
Ghaemmaghami, Houman, Dean, David, Sridharan, Sridha, and van Leeuwen, David A.
- Subjects
- *
VOICEPRINTS , *TELEPHONE calls , *HIERARCHICAL clustering (Cluster analysis) , *AUTOMATIC speech recognition , *FACTOR analysis - Abstract
This paper proposes the task of speaker attribution as speaker diarization followed by speaker linking. The aim of attribution is to identify and label common speakers across multiple recordings. To do this, it is necessary to first carry out diarization to obtain speaker-homogeneous segments from each recording. Speaker linking can then be conducted to link common speaker identities across multiple inter-session recordings. This process can be extremely inefficient using the traditional agglomerative cluster merging and retraining commonly employed in diarization. We thus propose an attribution system using complete-linkage clustering (CLC) without model retraining. We show that on top of the efficiency gained through elimination of the retraining phase, greater accuracy is achieved by utilizing the farthest-neighbor criterion inherent to CLC for both diarization and linking. We first evaluate the use of CLC against an agglomerative clustering (AC) without retraining approach, traditional agglomerative clustering with retraining (ACR) and single-linkage clustering (SLC) for speaker linking. We show that CLC provides a relative improvement of 20%, 29% and 39% in attribution error rate (AER) over the three said approaches, respectively. We then propose a diarization system using CLC and show that it outperforms AC, ACR and SLC with relative improvements of 32%, 50% and 70% in diarization error rate (DER), respectively. In our work, we employ the cross-likelihood ratio (CLR) as the model comparison metric for clustering and investigate its robustness as a stopping criterion for attribution. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
4. Breast Cancer Risk Prediction Using Different Clustering Techniques
- Author
-
Mounita Ghosh, M. Raihan, Laboni Akter, Ferdib-Al-Islam, Md. Mohsin Sarker Raihan, and Nasif Alvi
- Subjects
Breast cancer ,Principal component analysis ,Radial basis function kernel ,Statistics ,medicine ,k-means clustering ,Unsupervised learning ,Cluster analysis ,medicine.disease ,Complete-linkage clustering ,Mathematics ,Hierarchical clustering - Abstract
Breast Cancer is one of the topmost well-known diseases with a high death rate among women. It is a non-communicable disease that is seen in numerous women in all over the world. With the early analysis of this disease, the endurance will arise from 56% to over 86%. In this analysis, several unsupervised learning techniques were used with the kernel techniques of Principle Component Analysis (PCA). K-Means and several Hierarchical Clustering techniques with different linkages such as ward, complete, and average were applied and highest accuracy of 70.91% was obtained from Hierarchical Clustering with average linkage. The better performances were in Recall and F1-score from K-Means compared to Ward and Complete linkage clustering techniques. The Specificity, Precision, Recall, and F1-score have shown satisfactory performances in Average linkage with the values of 60%, 70.58%, 80%, and 75% correspondingly.
- Published
- 2021
- Full Text
- View/download PDF
5. An Efficient Algorithm for Complete Linkage Clustering with a Merging Threshold
- Author
-
Tapas Kumar Ballabh, Amlan Chakrabarti, and Payel Banerjee
- Subjects
Set (abstract data type) ,ComputingMethodologies_PATTERNRECOGNITION ,Triangle inequality ,Computer science ,Pairwise comparison ,Data mining ,Cluster analysis ,computer.software_genre ,Complete-linkage clustering ,computer ,Complete linkage ,Field (computer science) ,Hierarchical clustering - Abstract
In recent years, one of the serious challenges envisaged by experts in the field of data science is dealing with the gigantic volume of data, piling up at a high speed. Apart from collecting this avalanche of data, another major problem is extracting useful information from it. Clustering is a highly powerful data mining tool capable of finding hidden information from a totally unlabelled dataset. Complete Linkage Clustering is a distance-based Hierarchical clustering algorithm, well-known for providing highly compact clusters. The algorithm because of its high convergence time is unsuitable for large datasets, and hence our paper proposes a preclustering method that not only reduces the convergence time of the algorithm but also makes it suitable for partial clustering of streaming data. The proposed preclustering algorithm uses triangle inequality to take a clustering decision without comparing a pattern with all the members of a cluster, unlike the classical Complete Linkage algorithm. The preclusters are then subjected to an efficient Complete Linkage algorithm for getting the final set of compact clusters in a relatively shorter time in comparison to all those existing variants where the pairwise distance between all the patterns are required for the Clustering process.
- Published
- 2020
- Full Text
- View/download PDF
6. An Improved Method Multi-View Group Recommender System (IMVGRS)
- Author
-
Mir Mohsen Pedram, Maryam Sadeghi, and Seyyed Amir Asghari
- Subjects
Information retrieval ,Group (mathematics) ,Computer science ,020206 networking & telecommunications ,Improved method ,02 engineering and technology ,Recommender system ,Complete-linkage clustering ,Complete linkage ,Singular value decomposition ,0202 electrical engineering, electronic engineering, information engineering ,Decomposition (computer science) ,020201 artificial intelligence & image processing ,Dimension (data warehouse) - Abstract
Today, one of the users' issues on the web is finding their desired information from a massive amount of data. Recommender systems aid users in making decisions and choosing their suitable items by personalizing the contents for users by their interest. In the past, most of the researches has been done on individual recommender systems. But now, attention has been drawn to group recommender systems. For this reason, this paper tried to improve a group recommender system. In this article, an Improved Multi-View Group Recommender System (IMVGRS) has been proposed. This multi-view group recommender system recommends to a group of the user from two standpoints of user preferences (ratings) and social connection (trust). First, the dimension of the data has been reduced with the Singular-Value Decomposition (SVD) method. Second, the system has been clustered with the complete linkage method. Experimental results, show the effectiveness of the proposed improved method.
- Published
- 2020
- Full Text
- View/download PDF
7. Hierarchical Clustering Given Confidence Intervals of Metric Distances
- Author
-
Alejandro Ribeiro and Weiyu Huang
- Subjects
Social and Information Networks (cs.SI) ,FOS: Computer and information sciences ,Discrete mathematics ,Physics - Physics and Society ,Fuzzy clustering ,Theoretical computer science ,Single-linkage clustering ,FOS: Physical sciences ,Computer Science - Social and Information Networks ,020206 networking & telecommunications ,Physics and Society (physics.soc-ph) ,02 engineering and technology ,Complete-linkage clustering ,Hierarchical clustering ,Metric space ,020204 information systems ,Signal Processing ,Metric (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Cluster analysis ,k-medians clustering ,Mathematics - Abstract
This paper considers metric the exact dissimilarities between pairs of points are not unknown but known to belong to some interval. The goal is to study methods for the determination of hierarchical clusters, i.e., a family of nested partitions indexed by a resolution parameter, induced from the given distance intervals of the dissimilarities. Our construction of hierarchical clustering methods is based on defining admissible methods to be those methods that satisfy the axioms of value—nodes in a metric space with two nodes are clustered together at the convex combination of the upper and lower bounds determined by a parameter—and transformation—when both distance bounds are reduced, the output may become more clustered but not less. Two admissible methods are constructed and are shown to provide universal bounds in the space of admissible methods. Practical implications are explored by clustering moving points via snapshots and by clustering coauthorship networks representing collaboration between researchers from different communities. The proposed clustering methods succeed in identifying underlying hierarchical clustering structures via the maximum and minimum distances in all snapshots, as well as in differentiating collaboration patterns in journal publications between different research communities based on bounds of network distances.
- Published
- 2018
- Full Text
- View/download PDF
8. Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy
- Author
-
Yu Fang, Liu Yaohui, and Ma Zhengming
- Subjects
DBSCAN ,Clustering high-dimensional data ,Information Systems and Management ,Fuzzy clustering ,Computer science ,Single-linkage clustering ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Management Information Systems ,k-nearest neighbors algorithm ,Artificial Intelligence ,CURE data clustering algorithm ,0202 electrical engineering, electronic engineering, information engineering ,Cluster analysis ,k-medians clustering ,Constrained clustering ,020206 networking & telecommunications ,Graph ,ComputingMethodologies_PATTERNRECOGNITION ,Data stream clustering ,Canopy clustering algorithm ,020201 artificial intelligence & image processing ,Data mining ,computer ,Algorithm ,Software - Abstract
Recently a density peaks based clustering algorithm (dubbed as DPC) was proposed to group data by setting up a decision graph and finding out cluster centers from the graph fast. It is simple but efficient since it is noniterative and needs few parameters. However, the improper selection of its parameter cutoff distance dc will lead to the wrong selection of initial cluster centers, but the DPC cannot correct it in the subsequent assignment process. Furthermore, in some cases, even the proper value of dc was set, initial cluster centers are still difficult to be selected from the decision graph. To overcome these defects, an adaptive clustering algorithm (named as ADPC-KNN) is proposed in this paper. We introduce the idea of K-nearest neighbors to compute the global parameter dc and the local density ρi of each point, apply a new approach to select initial cluster centers automatically, and finally aggregate clusters if they are density reachable. The ADPC-KNN requires only one parameter and the clustering is automatic. Experiments on synthetic and real-world data show that the proposed clustering algorithm can often outperform DBSCAN, DPC, K-Means++, Expectation Maximization (EM) and single-link.
- Published
- 2017
- Full Text
- View/download PDF
9. Cluster evolution analysis: Identification and detection of similar clusters and migration patterns
- Author
-
Roy Gelbard and Roni Ramon-Gonen
- Subjects
Computer science ,05 social sciences ,General Engineering ,02 engineering and technology ,computer.software_genre ,Stability (probability) ,Complete-linkage clustering ,Computer Science Applications ,Artificial Intelligence ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,Cluster (physics) ,050211 marketing ,020201 artificial intelligence & image processing ,Point (geometry) ,Data mining ,Cluster analysis ,computer - Abstract
A model for temporal cluster analysis that reflects behavior patterns over time.Detecting migration between clusters, appearance and disappearance of clusters.Two visual tools were developed to enable displaying entire data in a single graph.Model's functioning is illustrated in the complex situation of corporate bonds. Cluster analysis often addresses a specific point in time, ignoring previous cluster analysis products. The present study proposes a model entitled Cluster Evolution Analysis (CEA) that addresses three phenomena likely to occur over time: (1) changes in the number of clusters; (2) changes in cluster characteristics; (3) between-cluster migration of objects.To achieve this goal, two new techniques are implemented: to find similarities between clusters at different points in time, we used the moving average of cluster centroid technique, and to detect prominent migration patterns we used the clustering of clusters technique. The research introduces two new visual tools displaying all the clusters over the entire time period under study in a single graph.The model was tested on five-year trade data of corporate bonds (20102014). The results obtained by the CEA model were checked and validated against the bond rating report issued periodically by the local bond rating company.The results proved the model capable of identifying repeated clusters at various points in time, and detecting patterns that predict prospective loss of value, as well as patterns that indicate stability and preservation of value over time.
- Published
- 2017
- Full Text
- View/download PDF
10. Natural neighbor-based clustering algorithm with local representatives
- Author
-
Lijun Yang, Jinlong Huang, Dongdong Cheng, Quanwang Wu, and Qingsheng Zhu
- Subjects
Clustering high-dimensional data ,Information Systems and Management ,Fuzzy clustering ,Computer science ,Single-linkage clustering ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Management Information Systems ,Artificial Intelligence ,CURE data clustering algorithm ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Cluster analysis ,k-medians clustering ,business.industry ,Constrained clustering ,Pattern recognition ,Manifold ,Data set ,Data stream clustering ,Canopy clustering algorithm ,020201 artificial intelligence & image processing ,Artificial intelligence ,Data mining ,business ,computer ,Software - Abstract
Clustering by identifying cluster centers is important for detecting patterns in a data set. However, many center-based clustering algorithms cannot process data sets containing non-spherical clusters. In this paper, we propose a novel clustering algorithm called NaNLORE based on natural neighbor and local representatives. Natural neighbor is a new neighbor concept and introduced to compute local density and find local representatives which are points with local maximum density. We first find local representatives and then select cluster centers from the local representatives. The density-adaptive distance is introduced to measure the distance between local representatives, which helps to solve the problem of clustering data sets with complex manifold structure. Cluster centers are characterized by higher density than their neighbors and a relatively large density-adaptive distance from any local representatives with higher density. In experiments, we compare the proposed algorithm NaNLORE with existing algorithms on synthetic and real data sets. Results show that NaNLORE performs better than existing algorithm, especially on clustering non-spherical data and manifold data.
- Published
- 2017
- Full Text
- View/download PDF
11. Fat node leading tree for data stream clustering with density peaks
- Author
-
Ji Xu, Tianrui Li, Guoyin Wang, Weihui Deng, and Guanglei Gou
- Subjects
Data stream ,DBSCAN ,Clustering high-dimensional data ,Information Systems and Management ,Fuzzy clustering ,Computer science ,Single-linkage clustering ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Management Information Systems ,Biclustering ,Artificial Intelligence ,CURE data clustering algorithm ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Cluster analysis ,k-medians clustering ,Determining the number of clusters in a data set ,Data stream clustering ,Canopy clustering algorithm ,FLAME clustering ,Affinity propagation ,020201 artificial intelligence & image processing ,Data mining ,Algorithm ,computer ,Software - Abstract
Detecting clusters of arbitrary shape and constantly delivering the results for newly arrived items are two critical challenges in the study of data stream clustering. However, the existing clustering methods could not deal with these two problems simultaneously. In this paper, we employ the density peaks based clustering (DPClust) algorithm to construct a leading tree (LT) and further transform it into a fat node leading tree (FNLT) in a granular computing way. FNLT is a novel interpretable synopsis of the current state of data stream for clustering. New incoming data is blended into the evolving FNLT structure quickly, and thus the clustering result of the incoming data can be delivered on the fly. During the interval between the delivery of the clustering results and the arrival of new data, the FNLT with blended data is granulated as a new FNLT with a constant number of fat nodes. The FNLT of the current data stream is maintained in a real-time fashion by the Blending-Granulating-Fading mechanism. At the same time, the change points are detected using the partial order relation between each pair of the cluster centers and the martingale theory. Compared to several state-of-the-art clustering methods, the presented model shows promising accuracy and efficiency.
- Published
- 2017
- Full Text
- View/download PDF
12. Enhancing point symmetry-based distance for data clustering
- Author
-
Sriparna Saha
- Subjects
Mathematical optimization ,Fuzzy clustering ,Correlation clustering ,Single-linkage clustering ,02 engineering and technology ,Complete-linkage clustering ,Theoretical Computer Science ,Determining the number of clusters in a data set ,CURE data clustering algorithm ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Geometry and Topology ,Cluster analysis ,Algorithm ,Software ,k-medians clustering ,Mathematics - Abstract
In this paper, at first a new point symmetry-based similarity measurement is proposed which satisfies the closure and the symmetry properties of any distance function. The different desirable properties of the new distance are elaborately explained. Thereafter a new clustering algorithm based on the search capability of genetic algorithm is developed where the newly developed point symmetry-based distance is used for cluster assignment. The allocation of points to different clusters is performed in such a way that the closure property is satisfied. The proposed GA with newly developed point symmetry distance based (GAnPS) clustering algorithm is capable of determining different symmetrical shaped clusters having any sizes or convexities. The effectiveness of the proposed GAnPS clustering technique in identifying the proper partitioning is shown for twenty-one data sets having various characteristics. Performance of GAnPS is compared with existing symmetry-based genetic clustering technique, GAPS, three popular and well-known clustering techniques, K-means, expectation maximization and average linkage algorithm. In a part of the paper, the utility of the proposed clustering technique is shown for partitioning a remote sensing satellite image. The last part of the paper deals with the development of some automatic clustering techniques using the newly proposed symmetry-based distance.
- Published
- 2017
- Full Text
- View/download PDF
13. Clustering Efficiency Comparison of Outliers Data in Data Mining
- Author
-
Natthawan Phonchan
- Subjects
Chebyshev distance ,complete-linkage clustering ,Manhattan distance ,single-linkage clustering ,Euclidean distance ,k-mean clustering ,average-linkage clustering ,outlier - Abstract
Thai Journal of Science and Technology, 9, 5, 589-602
- Published
- 2020
- Full Text
- View/download PDF
14. Efficient Sequential and Parallel Algorithms for Incremental Record Linkage Using Complete Linkage Clustering
- Author
-
Abdullah Baihan and Sanguthevar Rajasekaran
- Subjects
Linkage (software) ,Speedup ,Computer science ,Chaining ,Parallel algorithm ,Data mining ,Cluster analysis ,computer.software_genre ,Complete-linkage clustering ,computer ,Record linkage ,Complete linkage - Abstract
In the biomedical domain, the record linkage is considered as a crucial problem. When the number of records is very large, existing algorithms for record linkage take too much time. Often, we have to link a small set of new records with a large set of old records. This can be done by putting together the old and new records and performing a linkage on all the records. Clearly, this will call for an enormous amount of time. An alternative is to develop algorithms that perform linkage in an incremental manner. We refer to any such algorithm as an Incremental Record Linkage (IRL) algorithm. In this paper we present an efficient IRL algorithm. In addition to taking large amounts of time, existing algorithms might also suffer from a chaining problem and hence introduce some errors in linking. As has been observed in the literature, this chaining problem can be solved by performing clustering under complete linkage. The IRL algorithm we present in this paper employs complete linkage and is called as Incremental Record Linking Algorithm using Complete Linkage “IRLA-CL”. We propose sequential and parallel versions of this algorithm. IRLA-CL can handle any number of datasets. In contrast, many of the existing algorithms can only link two datasets at a time. Our algorithm outperforms previous algorithms and offer state-of-the-art solutions to the IRL problem as well. Our algorithms have been tested on millions of records on synthetic and real datasets and outperform the best-known RLA-CL algorithm when the number of new records is up to around 20% of the total number of old records and achieve a very nearly linear speedup in parallel.
- Published
- 2019
- Full Text
- View/download PDF
15. Border-Peeling Clustering
- Author
-
Nadav Bar, Hadar Averbuch-Elor, and Daniel Cohen-Or
- Subjects
FOS: Computer and information sciences ,DBSCAN ,Computer Vision and Pattern Recognition (cs.CV) ,Applied Mathematics ,Single-linkage clustering ,Correlation clustering ,Computer Science - Computer Vision and Pattern Recognition ,OPTICS algorithm ,02 engineering and technology ,Complete-linkage clustering ,Convolutional neural network ,Computational Theory and Mathematics ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Cluster (physics) ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Cluster analysis ,Algorithm ,Software ,Mathematics - Abstract
In this paper, we present a novel non-parametric clustering technique. Our technique is based on the notion that each latent cluster is comprised of layers that surround its core, where the external layers, or border points, implicitly separate the clusters. Unlike previous techniques, such as DBSCAN, where the cores of the clusters are defined directly by their densities, here the latent cores are revealed by a progressive peeling of the border points. Analyzing the density of the local neighborhoods allows identifying the border points and associating them with points of inner layers. We show that the peeling process adapts to the local densities and characteristics to successfully separate adjacent clusters (of possibly different densities). We extensively tested our technique on large sets of labeled data, including high-dimensional datasets of deep features that were trained by a convolutional neural network. We show that our technique is competitive to other state-of-the-art non-parametric methods using a fixed set of parameters throughout the experiments., Comment: 9 pages, 9 figures, supplementary material added as ancillary file
- Published
- 2019
16. APLIKASI DAERAH RAWAN PENYAKIT MENGGUNAKAN ALGORITMA COMPLETE-LINKAGE CLUSTERING
- Author
-
Uun Amalia Ramadhani, Natalis Ransi, and Jumadil Nangi
- Subjects
Complete-Linkage Clustering ,Diarrhea ,Tuberculosis (TB) ,Health Service ,Dengue Fever - Abstract
Kendari City Health Office is a group that has one task to control the spread of a certain particular disease. To carry out its duties and functions, ideally, the Health Office must have the ability to divide areas that are vulnerable and not prone to disease. To facilitate the dividing, this study was made which aims to build an application that is implemented using the complete-linkage clustering method by forming clusters of data on the number of cases of the disease in 10 sub-districts in Kendari City in 2012 to 2016. Identified diseases are DHF, Diarrhea, and TB. Algorithm Complete-Linkage Clustering is a group analysis method that attempts to build a hierarchy of data groups, grouping data with two or more objects that have the closest similarity, then proceed to another object that has a second closeness. The results showed that the areas prone to DHF were Kadia and Wua-Wua sub-districts with an index value of 47.80. Areas that are prone to Diarrhea disease are Puuwatu sub-district with an index value of 1181.40, area which is susceptible to TB disease is West Kendari sub-district with an index value of 92.40. The system can group and map disease-prone areas in the city of Kendari, using Complete-Linkage Clustering.
- Published
- 2019
- Full Text
- View/download PDF
17. Preventive Maintenance Operations Scheduling Based on Eigenvalue and Clustering Methods
- Author
-
Abdelhakim Abdelhadi
- Subjects
Job shop scheduling ,Computer science ,Scheduling (production processes) ,Manufacturing operations ,Incidence matrix ,Data mining ,Cluster analysis ,computer.software_genre ,Preventive maintenance ,computer ,Complete-linkage clustering ,Hierarchical clustering - Abstract
Maintenance is the vital parts of any manufacturing operations, especially in the current global intense competitiveness pressure were companies are looking for any source of advantages to compete. This research paper introduces a new approach to forming maintainable machines into virtual cells to conduct the required maintenance tasks. After developing machine - failure incidence matrix, the proposed approach works in two stages. In the first stage, an eigenvector is used to develop a similarity matrix that identifies the relative weight relation between failures and machines. In the second stage, agglomerative hierarchal methods are used on the similarity matrix: a well-known clustering algorithm called complete linkage clustering is applied to the development of machine cells and the assignment of failures to the most suitable machine cells. The proposed mathematical approach allows the designer the flexibility to select the number of cells and find the level of similarity between machine cells or failure types.
- Published
- 2019
- Full Text
- View/download PDF
18. Water Age Clustering for Water Distribution Systems
- Author
-
Avi Ostfeld and Elad Salomons
- Subjects
Mathematical optimization ,010504 meteorology & atmospheric sciences ,Node (networking) ,Computation ,Single-linkage clustering ,General Medicine ,010501 environmental sciences ,01 natural sciences ,Complete-linkage clustering ,Numbering ,Cluster (physics) ,Cluster analysis ,Algorithm ,k-medians clustering ,0105 earth and related environmental sciences ,Mathematics - Abstract
This work presents an algorithm for water distribution systems water age clustering. The objective is to cluster a distribution system into water age sub-zones whose water age variability is minimized within each cluster. The algorithm stages are: (1) water age computation for each system node, (2) kick-off at a number of clusters equal to the number of nodes (i.e., each node initially acts as a cluster), (3) search for the two connected (by link) clusters which have the smallest absolute water age difference, and combine them into a single cluster; characterize their water age value as the weighted arithmetic mean of the two clusters, and (4) repeat step 3 until all nodes are lumped into a single cluster (i.e., the entire water distribution system). The algorithm thus spans all possible clusters starting from the total number of system nodes and up to a one cluster which holds the entire system layout. The model, through a clustering numbering trade-off, is demonstrated on a mid-size water distribution system.
- Published
- 2017
- Full Text
- View/download PDF
19. DenPEHC: Density peak based efficient hierarchical clustering
- Author
-
Ji Xu, Guoyin Wang, and Weihui Deng
- Subjects
0209 industrial biotechnology ,Information Systems and Management ,Fuzzy clustering ,Single-linkage clustering ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Computer Science Applications ,Theoretical Computer Science ,Hierarchical clustering ,ComputingMethodologies_PATTERNRECOGNITION ,020901 industrial engineering & automation ,Artificial Intelligence ,Control and Systems Engineering ,CURE data clustering algorithm ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,Cluster analysis ,computer ,Algorithm ,Software ,k-medians clustering ,Mathematics - Abstract
Existing hierarchical clustering algorithms involve a flat clustering component and an additional agglomerative or divisive procedure. This paper presents a density peak based hierarchical clustering method (DenPEHC), which directly generates clusters on each possible clustering layer, and introduces a grid granulation framework to enable DenPEHC to cluster large-scale and high-dimensional (LSHD) datasets. This study consists of three parts: (1) utilizing the distribution of the parameter γ , which is defined as the product of the local density ρ and the minimal distance to data points with higher density δ in “clustering by fast search and find of density peaks” (DPClust), and a linear fitting approach to select clustering centers with the clustering hierarchy decided by finding the “stairs” in the γ curve; (2) analyzing the leading tree (in which each node except the root is led by its parent to join the same cluster) as an intermediate result of DPClust, and constructing the clustering hierarchy efficiently based on the tree; and (3) designing a framework to enable DenPEHC to cluster LSHD datasets when a large number of attributes can be grouped by their semantics. The proposed method builds the clustering hierarchy by simply disconnecting the center points from their parents with a linear computational complexity O ( m ), where m is the number of clusters. Experiments on synthetic and real datasets show that the proposed method has promising efficiency, accuracy and robustness compared to state-of-the-art methods.
- Published
- 2016
- Full Text
- View/download PDF
20. Effectively clustering by finding density backbone based-on kNN
- Author
-
Xiaoyun Chen, Longjie Li, Lina Pan, Bo Wang, Mei Chen, and Jianjun Cheng
- Subjects
Computer science ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Facial recognition system ,Variable (computer science) ,Artificial Intelligence ,020204 information systems ,Face (geometry) ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Cluster (physics) ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Data mining ,Cluster analysis ,computer ,Software - Abstract
Clustering plays an important role in discovering underlying patterns of data points according to their similarities. Many advanced algorithms have difficulty when dealing with variable clusters. In this paper, we propose a simple but effective clustering algorithm, CLUB. First, CLUB finds initial clusters based on mutual k nearest neighbours. Next, taking the initial clusters as input, it identifies the density backbones of clusters based on k nearest neighbours. Then, it yields final clusters by assigning each unlabelled point to the cluster which the unlabelled point's nearest higher-density-neighbour belongs to. To comprehensively demonstrate the performance of CLUB, we benchmark CLUB with six baselines including three classical and three state-of-the-art methods, on nine two-dimensional various-sized datasets containing clusters with various shapes and densities, as well as seven widely-used multi-dimensional datasets. In addition, we also use Olivetti Face dataset to illustrate the effectiveness of our method on face recognition. Experimental results indicate that CLUB outperforms the six compared algorithms in most cases. HighlightsCLUB can easily find clusters with various densities, shapes and sizes.A new density computing method is presented.A novel cluster backbones identification method is proposed.Comprehensive experiments are performed to verify the performance of CLUB.
- Published
- 2016
- Full Text
- View/download PDF
21. A new index for clustering validation with overlapped clusters
- Author
-
Diego H. Milone, Georgina Stegmayer, and D.N. Campo
- Subjects
EXTERNAL VALIDATION ,Fuzzy clustering ,Single-linkage clustering ,Correlation clustering ,INGENIERÍAS Y TECNOLOGÍAS ,02 engineering and technology ,computer.software_genre ,CLUSTER PERTURBATION ,Complete-linkage clustering ,VALIDATION INDEX ,Similarity (network science) ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Cluster analysis ,Ingeniería Eléctrica, Ingeniería Electrónica e Ingeniería de la Información ,Mathematics ,Ingeniería de Sistemas y Comunicaciones ,General Engineering ,Probabilistic logic ,OVERLAPPED CLUSTERS ,Computer Science Applications ,Data analysis ,020201 artificial intelligence & image processing ,Data mining ,computer - Abstract
External validation indexes allow similarities between two clustering solutions to be quantified. With classical external indexes, it is possible to quantify how similar two disjoint clustering solutions are, where each object can only belong to a single cluster. However, in practical applications, it is common for an object to have more than one label, thereby belonging to overlapped clusters; for example, subjects that belong to multiple communities in social networks. In this study, we propose a new index based on an intuitive probabilistic approach that is applicable to overlapped clusters. Given that recently there has been a remarkable increase in the analysis of data with naturally overlapped clusters, this new index allows to comparing clustering algorithms correctly. After presenting the new index, experiments with artificial and real datasets are shown and analyzed. Results over a real social network are also presented and discussed. The results indicate that the new index can correctly measure the similarity between two partitions of the dataset when there are different levels of overlap in the analyzed clusters. Fil: Campo, David Nazareno. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentina. Universidad Tecnológica Nacional. Facultad Regional Santa Fe. Centro de Investigación y Desarrollo de Ingeniería en Sistemas de Información; Argentina Fil: Stegmayer, Georgina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentina. Universidad Tecnológica Nacional. Facultad Regional Santa Fe. Centro de Investigación y Desarrollo de Ingeniería en Sistemas de Información; Argentina Fil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentina
- Published
- 2016
- Full Text
- View/download PDF
22. A study of speaker clustering for speaker attribution in large telephone conversation datasets
- Author
-
Houman Ghaemmaghami, Sridha Sridharan, David A. van Leeuwen, and David Dean
- Subjects
business.industry ,Computer science ,Speech recognition ,media_common.quotation_subject ,020206 networking & telecommunications ,Pattern recognition ,02 engineering and technology ,Complete-linkage clustering ,Theoretical Computer Science ,Hierarchical clustering ,Human-Computer Interaction ,Speaker diarisation ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Robustness (computer science) ,Fundamental attribution error ,0202 electrical engineering, electronic engineering, information engineering ,Conversation ,Artificial intelligence ,0305 other medical science ,business ,Cluster analysis ,Attribution ,Software ,media_common - Abstract
Large dataset speaker clustering is more efficient using linkage clustering (O(n log(n)).Need for cluster merging and retraining is eliminated through linkage clustering.Complete-linkage speaker clustering outperforms common retraining-based clustering.Robust stopping criterion by using complete-linkage and cross-likelihood ratio.Robustness of clustering stopping criterion is evaluated on varying datasets. This paper proposes the task of speaker attribution as speaker diarization followed by speaker linking. The aim of attribution is to identify and label common speakers across multiple recordings. To do this, it is necessary to first carry out diarization to obtain speaker-homogeneous segments from each recording. Speaker linking can then be conducted to link common speaker identities across multiple inter-session recordings. This process can be extremely inefficient using the traditional agglomerative cluster merging and retraining commonly employed in diarization. We thus propose an attribution system using complete-linkage clustering (CLC) without model retraining. We show that on top of the efficiency gained through elimination of the retraining phase, greater accuracy is achieved by utilizing the farthest-neighbor criterion inherent to CLC for both diarization and linking. We first evaluate the use of CLC against an agglomerative clustering (AC) without retraining approach, traditional agglomerative clustering with retraining (ACR) and single-linkage clustering (SLC) for speaker linking. We show that CLC provides a relative improvement of 20%, 29% and 39% in attribution error rate (AER) over the three said approaches, respectively. We then propose a diarization system using CLC and show that it outperforms AC, ACR and SLC with relative improvements of 32%, 50% and 70% in diarization error rate (DER), respectively. In our work, we employ the cross-likelihood ratio (CLR) as the model comparison metric for clustering and investigate its robustness as a stopping criterion for attribution.
- Published
- 2016
- Full Text
- View/download PDF
23. Online frame-based clustering with unknown number of clusters
- Author
-
Nasser Kehtarnavaz and Fatemeh Saki
- Subjects
DBSCAN ,Clustering high-dimensional data ,Fuzzy clustering ,Computer science ,Correlation clustering ,Single-linkage clustering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Artificial Intelligence ,CURE data clustering algorithm ,020204 information systems ,Consensus clustering ,0202 electrical engineering, electronic engineering, information engineering ,Cluster analysis ,k-medians clustering ,Brown clustering ,Constrained clustering ,Determining the number of clusters in a data set ,Support vector machine ,ComputingMethodologies_PATTERNRECOGNITION ,Data stream clustering ,Signal Processing ,Canopy clustering algorithm ,FLAME clustering ,Affinity propagation ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Data mining ,computer ,Software - Abstract
This paper presents an online frame-based clustering algorithm (OFC) for unsupervised classification applications in which data are received in a streaming manner as time passes by with the number of clusters being unknown. This algorithm consists of a number of steps including density-based outlier removal, new cluster generation, and cluster update. It is designed for applications when data samples are received in an online manner in frames. Such frames are first passed through an outlier removal step to generate denoised frames with consistent data samples during transitions times between clusters. A classification step is then applied to find whether frames belong to any of existing clusters. When frames do not get matched to any of existing clusters and certain criteria are met, a new cluster is created in real time and in an on-the-fly manner by using support vector domain descriptors. Experiments involving four synthetic and two real datasets are conducted to show the performance of the introduced clustering algorithm in terms of cluster purity and normalized mutual information. Comparison results with similar clustering algorithms designed for streaming data are also reported exhibiting the effectiveness of the introduced online frame-based clustering algorithm. Online frame-based clustering algorithm without having any knowledge of number of clusters.For applications when samples of a class appear in streaming frames.Superior to existing algorithms applicable to online frame-based clustering.
- Published
- 2016
- Full Text
- View/download PDF
24. Adaptive fuzzy clustering by fast search and find of density peaks
- Author
-
Hussain Dawood, Rongfang Bie, Shanshan Ruan, Rashid Mehmood, and Yunchuan Sun
- Subjects
0301 basic medicine ,Clustering high-dimensional data ,Fuzzy clustering ,Computer science ,Correlation clustering ,Single-linkage clustering ,02 engineering and technology ,Management Science and Operations Research ,computer.software_genre ,Fuzzy logic ,Complete-linkage clustering ,03 medical and health sciences ,CURE data clustering algorithm ,0202 electrical engineering, electronic engineering, information engineering ,Cluster analysis ,k-medians clustering ,Computer Science Applications ,Hierarchical clustering ,ComputingMethodologies_PATTERNRECOGNITION ,030104 developmental biology ,Data stream clustering ,Hardware and Architecture ,FLAME clustering ,020201 artificial intelligence & image processing ,Data mining ,computer - Abstract
Clustering by fast search and find of density peaks (CFSFDP) is proposed to cluster the data by finding of density peaks. CFSFDP is based on two assumptions that: a cluster center is a high dense data point as compared to its surrounding neighbors, and it lies at a large distance from other cluster centers. Based on these assumptions, CFSFDP supports a heuristic approach, known as decision graph to manually select cluster centers. Manual selection of cluster centers is a big limitation of CFSFDP in intelligent data analysis. In this paper, we proposed a fuzzy-CFSFDP method for adaptively selecting the cluster centers, effectively. It uses the fuzzy rules, based on aforementioned assumption for the selection of cluster centers. We performed a number of experiments on nine synthetic clustering datasets and compared the resulting clusters with the state-of-the-art methods. Clustering results and the comparisons of synthetic data validate the robustness and effectiveness of proposed fuzzy-CFSFDP method.
- Published
- 2016
- Full Text
- View/download PDF
25. Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data
- Author
-
András Király, Ágnes Vathy-Fogarassy, and János Abonyi
- Subjects
Clustering high-dimensional data ,Fuzzy clustering ,Theoretical computer science ,Logic ,business.industry ,Correlation clustering ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Complete-linkage clustering ,Medoid ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,FLAME clustering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Cluster analysis ,computer ,k-medians clustering ,Mathematics - Abstract
Clustering high dimensional data and identifying central nodes in a graph are complex and computationally expensive tasks. We utilize k-nn graph of high dimensional data as efficient representation of the hidden structure of the clustering problem. Initial cluster centers are determined by graph centrality measures. Cluster centers are fine-tuned by minimizing fuzzy-weighted geodesic distances. The shortest-path based representation is parallel to the concept of transitive closure. Therefore, our algorithm is capable to cluster networks or even more complex and abstract objects based on their partially known pairwise similarities.The algorithm is proven to be effective to identify senior researchers in a co-author network, central cities in topographical data, and clusters of documents represented by high dimensional feature vectors.
- Published
- 2016
- Full Text
- View/download PDF
26. Application of Product Moment Correlation and Complete Linkage Clustering Methods in Analyzing the Results of the Lecturer Questionnaire
- Author
-
Harliyus Agustian
- Subjects
Moment (mathematics) ,Variable (computer science) ,symbols.namesake ,Distance matrix ,Product (mathematics) ,Statistics ,symbols ,Object (computer science) ,Cluster analysis ,Complete-linkage clustering ,Pearson product-moment correlation coefficient ,Mathematics - Abstract
The use of questionnaires is needed by a lecturer to make improvements in the implementation of the teaching process in the classroom, so that it can correct deficiencies in the teaching process that has taken place. The results of the questionnaire cannot show the variables that must be corrected by a lecturer based on the item questionnaire. So we need a grouping of data for each variable in the results of the questionnaire. The clustering approach model cannot directly group a variable into an object to be clustered. This study clustered the variables on a questionnaire with the approach of Complete Lingkage Clustering by calculating the distance of the matrix using the Product Moment Corelation to find correlations for each questionnaire variable, so the cluster results obtained were several optimal clusters with the membership of each cluster variables from the questionnaire. Clustering data for questionnaire variables can be applied properly by applying product moment correlation to calculate the distance matrix. The cluster results can show the components of the questionnaire variables that must be corrected by the lecturer.
- Published
- 2018
- Full Text
- View/download PDF
27. Determining The Senior High School Major Using Agglomerative Hierarchial Clustering Algorithm
- Author
-
Heru Agus Santoso, Junta Zeniarja, Ayu Pertiwi, Adhitya Nugraha, Ardytha Luthfiarta, and Mahendra Arista Harum Perdana
- Subjects
Fuzzy clustering ,business.industry ,Computer science ,Single-linkage clustering ,Linkage (mechanical) ,Machine learning ,computer.software_genre ,Complete-linkage clustering ,Field (computer science) ,law.invention ,Hierarchical clustering ,law ,ComputingMilieux_COMPUTERSANDEDUCATION ,Artificial intelligence ,business ,Cluster analysis ,computer ,Selection (genetic algorithm) - Abstract
Determining the senior high school major is still a dilemma for some junior high school students. The selection of high school majors must be tailored to the interests, talents and academic skills of students so that later students can develop a better competencies, attitudes and academic skills in the new environment. The selection of the appropriate high scholl major will influence students' interests and abilities in exploring a field of science so that later it will be easier for students to go to university which is expected and in accordance with their current interests and abilities. This will obviously be very beneficial for the student in preparing for his future. Clustering is one technique known in the data mining process. The core concept of clustering is to group a number of data or objects into a group or several groups where each group contains data that has similarities that are very close to other data. There are two types of grouping methods known as hierarchical clustering and partitioning. The hierarchical clustering method consists of several types, namely complete linkage clustering, single linkage clustering, average linkage clustering and centroid linkage clustering. While the partitioning method itself consists of the following types namely k-means clustering and k-means fuzzy clustering. In this study, the authors have applied and analyzed the Agglomerative Hierarchical Clustering technique in the data of students of SMP Negeri 2 Purwodadi to classify students based on their respective interests and skills to fit the selection of high school majors. In the implementation, the author uses 5 attributes of pre-processing results which are used as experimental data processing variables. The results of this study succeeded in developing a prototype application that has implemented the Agglomerative Hierarchical Clustering algorithm which is used to visualize data processing so that it can help students determine high school majors. From the various experiments that have been carried out, this application has shown good resultsl.
- Published
- 2018
- Full Text
- View/download PDF
28. Applying clustering analysis for discovering time series heterogeneity using Saint Petersburg morbidity rate as an illustration
- Author
-
Kseniya Yu. Staroverova and Vladimir M. Bure
- Subjects
Clustering high-dimensional data ,Control and Optimization ,Fuzzy clustering ,General Computer Science ,Applied Mathematics ,Single-linkage clustering ,Correlation clustering ,computer.software_genre ,Complete-linkage clustering ,Geography ,CURE data clustering algorithm ,Consensus clustering ,Data mining ,Cluster analysis ,computer - Abstract
One of the machine learning approaches for unsupervised learning is clustering. Clustering has the task of exploring the structure of data with the aim of assigning a set of objects in such a way that objects belonging to the same group are more similar to each other than the objects drawn from different groups. Determining the number of clusters in a data set, searching for stable clusters, selection of dissimilarity measure and algorithm are significant tasks of cluster analysis. Multidimentional clustering is often used when an object is characterized by a vector. A dissimilarity measure or distance is selected with respect to the purpose and features of a certain task. But there are also such fields as economics, geology, medicine, sociology that are often presented by time series. Time series are random processes but not a random vector. That is why it is important to construct such a similarity (or dissimilarity) measure which would take into consideration that data are time-dependent. The research of morbidity rate of Saint Petersburg from 1999 to 2014 years and clustering of 18 districts are conducted. Several different similarity measures are used for clustering. Besides, an interesting aspect is clustering of multidimentional time series. There are two approaches. The first concept is to split multidimentional time series into several univariate time series, whilst the second one is to consider it as a whole unit that preserves the influence of data interdependence. Research is made with application of TSclust, tseries packages in R and missed algorithms are realised there. As a result of clustering of Saint Petersburg districts applying several similarity measures three stable clusters are found out but seven districts do not belong to any cluster. Refs 10. Figs 2.
- Published
- 2016
- Full Text
- View/download PDF
29. Clusters of Genetic-based Attributes Selection of Cancer Data
- Author
-
Vijaya Sri Kompalli and K. Usha Rani
- Subjects
Fuzzy clustering ,Computer science ,Fuzzy C-Means ,Single-linkage clustering ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Fuzzy logic ,Coupling ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Cluster analysis ,k-medians clustering ,General Environmental Science ,Genetic Algorithm ,Fitness function ,Determining the number of clusters in a data set ,ComputingMethodologies_PATTERNRECOGNITION ,Cluster ,Cohesion ,General Earth and Planetary Sciences ,FLAME clustering ,020201 artificial intelligence & image processing ,Data mining ,computer - Abstract
Clustering of data simplifies the task of data analysis and results in better disease diagnosis. Well-existing K-Means clustering hard computes clusters. Due to which the data may be centered to a specific cluster having less concentration on the effect of the coupling of clusters. Soft Computing methods are widely used in medical field as it contains fuzzy natured data. A Soft Computing approach of clustering called Fuzzy C-Means (FCM) deals with coupling. FCM clustering soft computes the clusters to determine the clusters based on the probability of having memberships in each of the clusters. The probability function used, determines the extent of coupling among the clusters. In order to achieve the computational efficiency and binding of features genetic evaluation is introduced. Genetic-based features are identified having more cohesion based on the fitness function values and then the coupling of the clusters is done using K-Means clustering in one trial and FCM in another trial. Analysis of coupling and cohesion is performed on Wisconsin Breast Cancer Dataset. Nature of clusters formations are observed with respect to coupling and cohesion.
- Published
- 2016
- Full Text
- View/download PDF
30. Representative points clustering algorithm based on density factor and relevant degree
- Author
-
Di Wu, Long Sheng, and Jiadong Ren
- Subjects
DBSCAN ,Single-linkage clustering ,Correlation clustering ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Artificial Intelligence ,CURE data clustering algorithm ,Nearest-neighbor chain algorithm ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Data mining ,Cluster analysis ,Algorithm ,computer ,Software ,k-medians clustering ,Mathematics - Abstract
Most of the existing clustering algorithms are affected seriously by noise data and high cost of time. In this paper, on the basis of CURE algorithm, a representative points clustering algorithm based on density factor and relevant degree called RPCDR is proposed. The definition of density factor and relevant degree are presented. The primary representative point whose density factor is less than the prescribed threshold will be deleted directly. New representative points can be reselected from non representative points in corresponding cluster. Moreover, the representative points of each cluster are modeled by using K-nearest neighbor method. Relevant degree is computed by comprehensive considering the correlations of objects within a cluster and between different clusters. And then whether the two clusters need to merge is judged. The theoretic experimental results and analysis prove that RPCDR has better clustering accuracy and execution efficiency.
- Published
- 2015
- Full Text
- View/download PDF
31. A Fuzzy Approach for Text Mining
- Author
-
Deepa B. Patil and Yashwant V. Dongre
- Subjects
Fuzzy clustering ,business.industry ,Computer Science::Information Retrieval ,Single-linkage clustering ,Correlation clustering ,Pattern recognition ,computer.software_genre ,Complete-linkage clustering ,ComputingMethodologies_PATTERNRECOGNITION ,CURE data clustering algorithm ,FLAME clustering ,Artificial intelligence ,Data mining ,Cluster analysis ,business ,computer ,k-medians clustering ,Mathematics - Abstract
Document clustering is an integral and important part of text mining. There are two types of clustering, namely, hard clustering and soft clustering. In case of hard clustering, data item belongs to only one cluster whereas in soft clustering, data point may fall into more than one cluster. Thus, soft clustering leads to fuzzy clustering wherein each data point is associated with a membership function that expresses the degree to which individual data points belong to the cluster. Accuracy is desired in information retrieval, which can be achieved by fuzzy clustering. In the work presented here, a fuzzy approach for text classification is used to classify the documents into appropriate clusters using Fuzzy C Means (FCM) clustering algorithm. Enron email dataset is used for experimental purpose. Using FCM clustering algorithm, emails are classified into different clusters. The results obtained are compared with the output produced by k means clustering algorithm. The comparative study showed that the fuzzy clusters are more appropriate than hard clusters.
- Published
- 2015
- Full Text
- View/download PDF
32. A novel validity index with dynamic cut-off for determining true clusters
- Author
-
M. S. Bhargavi and Sahana D. Gowda
- Subjects
Fuzzy clustering ,Correlation clustering ,Single-linkage clustering ,k-means clustering ,computer.software_genre ,Complete-linkage clustering ,Determining the number of clusters in a data set ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,Signal Processing ,Computer Vision and Pattern Recognition ,Data mining ,Cluster analysis ,computer ,Software ,k-medians clustering ,Mathematics - Abstract
In a multi-surveillance environment, voluminous data is generated over a period of time. Data analysis for summarization and conclusion has paved a way for the need of an efficient clusterization. Clustering, an unsupervised way of learning about data aims at defining clusters. Validation of clusters formed indicates the trueness of the clusters. In this paper, a novel validation technique with dynamic termination of clustering process has been proposed to obtain true clusters. In the validation process, the validity index is based on both global cluster proximity relationship and local proximity relationship. The validity index is computed for validating the available clusters using 'within-cluster sum-of-squares', 'between-cluster sum-of-squares', 'total-sum-of-squares', 'intra-cluster distances' and 'inter-cluster distances'. The ratio between two consecutive validity indices is the extent of variation which specifies the cut-off point. Cut-off terminates the clustering process dynamically indicating the number of clusters and validates the obtained clusters. The proposed method is tested on several real and synthetic data sets. Comparisons with the existing methods demonstrate the efficiency of the proposed method in detecting true clusters. A novel validation technique has been proposed to obtain true clusters.The method dynamically terminates the clustering at true conception of clusters.Global and local proximity relationship of clusters are considered for validation.
- Published
- 2015
- Full Text
- View/download PDF
33. Feature selection for clustering using instance-based learning by exploring the nearest and farthest neighbors
- Author
-
Chien-Hsing Chen
- Subjects
Information Systems and Management ,business.industry ,Feature selection ,Pattern recognition ,Mutual information ,computer.software_genre ,Complete-linkage clustering ,Computer Science Applications ,Theoretical Computer Science ,k-nearest neighbors algorithm ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,Control and Systems Engineering ,Salience (neuroscience) ,Salient ,Data mining ,Artificial intelligence ,Instance-based learning ,Cluster analysis ,business ,computer ,Software ,Mathematics - Abstract
Feature selection for clustering is an active research topic and is used to identify salient features that are helpful for data clustering. While partitioning a dataset into clusters, a data instance and its nearest neighbors will belong to the same cluster, and this instance and its farthest neighbors will belong to different clusters. We propose a new Feature Selection method to identify salient features that are useful for maintaining the instance's Nearest neighbors and Farthest neighbors (referred to here as FSNF). In particular, FSNF uses the mutual information criterion to estimate feature salience by considering maintainability. Experiments on benchmark datasets demonstrate the effectiveness of FSNF within the context of cluster analysis.
- Published
- 2015
- Full Text
- View/download PDF
34. Relaxing Weighted Clustering Algorithm for Reduction of Clusters and Cluster Head
- Author
-
Vijayanand Kumar
- Subjects
ComputingMethodologies_PATTERNRECOGNITION ,Fuzzy clustering ,CURE data clustering algorithm ,Correlation clustering ,Single-linkage clustering ,Canopy clustering algorithm ,Data mining ,computer.software_genre ,Cluster analysis ,Complete-linkage clustering ,computer ,k-medians clustering ,Mathematics - Abstract
Adhoc clustering in MANET is to divide mobile nodes into different virtual groups. Mobile nodes are allocated geographically adjacent in the cluster. Clustering is done by some rule specific in the network. In this paper, we have concentrated to modify weight based clustering algorithm to improve the performance in this wireless technology. By slight improvement in existing weight based clustering algorithm [1] reduction in cluster as well as cluster head can be observed through experiment. Proposed Algorithm specifies relaxing the weight criteria for isolated cluster head nodes in range of other cluster head as well as reconsidering the nodes in the range which is already participated in cluster formation. Reconsidering the participated node again in clustering formation could help in creation of better cluster, forming better routes as well as in conserving the cluster head energy.
- Published
- 2015
- Full Text
- View/download PDF
35. Microbial species delineation using whole genome sequences
- Author
-
Nikos C. Kyrpides, Supratim Mukherjee, Konstantinos T. Konstantinidis, Amrita Pati, Kostas Mavrommatis, Neha Varghese, and Natalia Ivanova
- Subjects
Genetics ,Genetic diversity ,Bacteria ,Computational Biology ,Genomics ,Computational biology ,Biology ,Classification ,Archaea ,Listeria monocytogenes ,Genome ,Complete-linkage clustering ,Genome, Archaeal ,Phylogenetics ,Species classification ,Cluster Analysis ,Taxonomy (biology) ,Algorithms ,Genome, Bacterial ,Phylogeny ,Global biodiversity - Abstract
Increased sequencing of microbial genomes has revealed that prevailing prokaryotic species assignments can be inconsistent with whole genome information for a significant number of species. The long-standing need for a systematic and scalable species assignment technique can be met by the genome-wide Average Nucleotide Identity (gANI) metric, which is widely acknowledged as a robust measure of genomic relatedness. In this work, we demonstrate that the combination of gANI and the alignment fraction (AF) between two genomes accurately reflects their genomic relatedness. We introduce an efficient implementation of AF,gANI and discuss its successful application to 86.5M genome pairs between 13,151 prokaryotic genomes assigned to 3032 species. Subsequently, by comparing the genome clusters obtained from complete linkage clustering of these pairs to existing taxonomy, we observed that nearly 18% of all prokaryotic species suffer from anomalies in species definition. Our results can be used to explore central questions such as whether microorganisms form a continuum of genetic diversity or distinct species represented by distinct genetic signatures. We propose that this precise and objective AF,gANI-based species definition: the MiSI (Microbial Species Identifier) method, be used to address previous inconsistencies in species classification and as the primary guide for new taxonomic species assignment, supplemented by the traditional polyphasic approach, as required.
- Published
- 2015
- Full Text
- View/download PDF
36. Visual hierarchical cluster structure: A refined co-association matrix based visual assessment of cluster tendency
- Author
-
Jingsheng Lei, Caiming Zhong, and Xiaodong Yue
- Subjects
Fuzzy clustering ,business.industry ,Single-linkage clustering ,Correlation clustering ,Pattern recognition ,Complete-linkage clustering ,Spectral clustering ,Hierarchical clustering ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,CURE data clustering algorithm ,Signal Processing ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,Cluster analysis ,Software ,Mathematics - Abstract
We employ a refined and transformed co-association matrix as the input of VAT.An efficient path-based similarity algorithm is presented and its time complexity is O(N2).A simple approach to analyze D* and obtain the clustering is designed.A visual hierarchical cluster structure can be presented. A hierarchical clustering algorithm, such as Single-linkage, can depict the hierarchical relationship of clusters, but its clustering quality mainly depends on the similarity measure used. Visual assessment of cluster tendency (VAT) reorders a similarity matrix to reveal the cluster structure of a data set, and a VAT-based clustering discovers clusters by image segmentation techniques. Although VAT can visually present the cluster structure, its performance also relies on the similarity matrix employed. In this paper, we take a refined co-association matrix, which is originally used in ensemble clustering, as an initial similarity matrix and transform it by path-based measure, and then apply it to VAT. The final clustering is achieved by directly analyzing the transformed and reordered similarity matrix. The proposed method can deal with data sets with some complex cluster structures and reveal the relationship of clusters hierarchically. The experimental results on synthetic and real data sets demonstrate the above mentioned properties.
- Published
- 2015
- Full Text
- View/download PDF
37. Agglomerative joint clustering of metabolic data with spike at zero: A Bayesian perspective
- Author
-
Vahid Partovi Nia and Mostafa Ghannad-Rezaie
- Subjects
0301 basic medicine ,Statistics and Probability ,Dendrogram ,General Medicine ,computer.software_genre ,01 natural sciences ,Complete-linkage clustering ,Data matrix (multivariate statistics) ,Data modeling ,Hierarchical clustering ,010104 statistics & probability ,03 medical and health sciences ,Tree (data structure) ,030104 developmental biology ,Data mining ,0101 mathematics ,Statistics, Probability and Uncertainty ,Cluster analysis ,computer ,Row ,Mathematics - Abstract
In many biological applications, for example high-dimensional metabolic data, the measurements consist of several continuous measurements of subjects or tissues over multiple attributes or metabolites. Measurement values are put in a matrix with subjects in rows and attributes in columns. The analysis of such data requires grouping subjects and attributes to provide a primitive guide toward data modeling. A common approach is to group subjects and attributes separately, and construct a two-dimensional dendrogram tree, once on rows and then on columns. This simple approach provides a grouping visualization through two separate trees, which is difficult to interpret jointly. When a joint grouping of rows and columns is of interest, it is more natural to partition the data matrix directly. Our suggestion is to build a dendrogram on the matrix directly, thus generalizing the two-dimensional dendrogram tree to a three-dimensional forest. The contribution of this research to the statistical analysis of metabolic data is threefold. First, a novel spike-and-slab model in various hierarchies is proposed to identify discriminant rows and columns. Second, an agglomerative approach is suggested to organize joint clusters. Third, a new visualization tool is invented to demonstrate the collection of joint clusters. The new method is motivated over gas chromatography mass spectrometry (GCMS) metabolic data, but can be applied to other continuous measurements with spike at zero property.
- Published
- 2015
- Full Text
- View/download PDF
38. Efficient agglomerative hierarchical clustering
- Author
-
Andy Song, Athman Bouguettaya, Xumin Liu, Qi Yu, and Xiangmin Zhou
- Subjects
Brown clustering ,business.industry ,Correlation clustering ,Single-linkage clustering ,General Engineering ,Pattern recognition ,computer.software_genre ,Complete-linkage clustering ,Computer Science Applications ,Hierarchical clustering ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,CURE data clustering algorithm ,Artificial intelligence ,Data mining ,Hierarchical clustering of networks ,business ,Cluster analysis ,computer ,Mathematics - Abstract
An efficient hybrid hierarchical clustering is proposed based on agglomerative method.It performs consistently with different distance measures.It performs consistently on data with different distributions and sizes. Hierarchical clustering is of great importance in data analytics especially because of the exponential growth of real-world data. Often these data are unlabelled and there is little prior domain knowledge available. One challenge in handling these huge data collections is the computational cost. In this paper, we aim to improve the efficiency by introducing a set of methods of agglomerative hierarchical clustering. Instead of building cluster hierarchies based on raw data points, our approach builds a hierarchy based on a group of centroids. These centroids represent a group of adjacent points in the data space. By this approach, feature extraction or dimensionality reduction is not required. To evaluate our approach, we have conducted a comprehensive experimental study. We tested the approach with different clustering methods (i.e., UPGMA and SLINK), data distributions, (i.e., normal and uniform), and distance measures (i.e., Euclidean and Canberra). The experimental results indicate that, using the centroid based approach, computational cost can be significantly reduced without compromising the clustering performance. The performance of this approach is relatively consistent regardless the variation of the settings, i.e., clustering methods, data distributions, and distance measures.
- Published
- 2015
- Full Text
- View/download PDF
39. Hierarchical clustering with planar segments as prototypes
- Author
-
Marian Kotas and Jacek M. Leski
- Subjects
Fuzzy clustering ,Brown clustering ,business.industry ,Correlation clustering ,Single-linkage clustering ,Constrained clustering ,Pattern recognition ,Complete-linkage clustering ,Hierarchical clustering ,Artificial Intelligence ,Signal Processing ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,Cluster analysis ,Software ,Mathematics - Abstract
Clustering methods divide a set of observations into groups in such a way that members of the same group are more similar to one another than to the members of the other groups. One of the scientifically well known methods of clustering is the hierarchical agglomerative one. For data of different properties different clustering methods appear favorable. If the data possess locally linear form, application of planar (or hyperplanar) prototypes should be advantageous. However, although a clustering method using planar prototypes, based on a criterion minimization, is known, it has a crucial drawback. It is an infinite extent of such prototypes that can result in addition of very distant data points to a cluster. Such distant points can considerably differ from the majority within a cluster. The goal of this work is to overcome this problem by developing a hierarchical agglomerative clustering method that uses the prototypes confined to the segments of hyperplanes. In the experimental part, we show that for data that possess locally linear form this method is highly competitive to the method of the switching regression models (the accuracy improvement of 24%) as well as to other well-known clustering methods (the accuracy improvement of 16%).
- Published
- 2015
- Full Text
- View/download PDF
40. A simple statistics-based nearest neighbor cluster detection algorithm
- Author
-
José-A. Nieves-Vázquez, Gonzalo Urcid, and Gerhard X. Ritter
- Subjects
business.industry ,Single-linkage clustering ,Pattern recognition ,Complete-linkage clustering ,k-nearest neighbors algorithm ,Data set ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,Nearest-neighbor chain algorithm ,Signal Processing ,Statistics ,Outlier ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Cluster analysis ,business ,Algorithm ,Software ,k-medians clustering ,Mathematics - Abstract
We propose a new method for autonomously finding clusters in spatial data. The proposed method belongs to the so called nearest neighbor approaches for finding clusters. It is a repetitive technique which produces changing averages and deviations of nearest neighbor distance parameters and results in a final set of clusters. The proposed technique is capable of eliminating background noise, outliers, and detection of clusters with different densities in a given data set. Using a wide variety of data sets, we demonstrate that the proposed cluster seeking algorithm performs at least as well as various other currently popular algorithms and in several cases surpasses them in performance. HighlightsA new clustering algorithm based on simple statistics and lattice metrics is given.Mathematical rationale is explained in detail and theorem proofs are provided.Performance classification of the SSNN algorithm is illustrated with 2D datasets.Jain?s benchmark dataset is used to show the SSNN cluster finding capability.High-dimensional image patterns are included as additional clustering examples.
- Published
- 2015
- Full Text
- View/download PDF
41. Clustering Based on Ant Colony Optimization and Relative Neighborhood (C-ACORN)
- Author
-
Ankit Vij, Padmavati Khandnor, and Parika Jhanji
- Subjects
DBSCAN ,Computer science ,Ant colony optimization algorithms ,010401 analytical chemistry ,Rand index ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Complete-linkage clustering ,0104 chemical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Data mining ,Noise (video) ,Cluster analysis ,computer ,Wireless sensor network - Abstract
Wireless sensor network has emerged as a powerful technology and is growing day by day. Ease of availability and low maintenance of small, inexpensive, fault-tolerant, self-configured, self-reliant, easily deployable sensor nodes has made them useful in several critical areas like military, healthcare, industrial process control, security and surveillance, smart homes. But wireless sensor network face challenges of energy conservation, increasing the network lifetime. Clustering is the best-known solution to this problem. In this paper, ant colony optimization and relative neighborhood-based clustering algorithm have been proposed which uses graph-based techniques to form neighbors for the ants. The algorithm is evaluated for seven datasets using the cluster validity indices like Dunn’s index (DI), modified Dunn’s index (MDI), and Rand index (RI). The results are compared with the existing clustering techniques like density-based spatial clustering applications with noise (DBSCAN) and complete linkage clustering. The comparison is done on the parameters of the quality of solution. The proposed algorithm generates good clustering results and is able to detect all the target clusters efficiently.
- Published
- 2018
- Full Text
- View/download PDF
42. Text Clustering Based on a Divide and Merge Strategy
- Author
-
Yong Shi and Man Yuan
- Subjects
Clustering high-dimensional data ,DBSCAN ,Fuzzy clustering ,Computer science ,k-means ,Feature vector ,Correlation clustering ,Single-linkage clustering ,Conceptual clustering ,computer.software_genre ,Complete-linkage clustering ,Biclustering ,CURE data clustering algorithm ,Consensus clustering ,feature extension ,Cluster analysis ,k-medians clustering ,General Environmental Science ,Brown clustering ,business.industry ,Constrained clustering ,k-means clustering ,Pattern recognition ,Document clustering ,Text clustering ,Determining the number of clusters in a data set ,Data stream clustering ,Canopy clustering algorithm ,General Earth and Planetary Sciences ,Affinity propagation ,FLAME clustering ,Data mining ,Artificial intelligence ,business ,computer - Abstract
A text clustering algorithm is proposed to overcome the drawback of division based clustering method on sensitivity of estimated class number. Complex features including synonym and co-occurring words are extracted to make a feature space containing more semantic information. Then the divide and merge strategy helps the iteration converge to a reasonable cluster number. Experimental results showed that the dynamically updated center number prevent the deterioration of clustering result when k deviates from the real class numbers. When k is too small or large, the difference of clustering results between FC-DM and k-means is more obvious and FC-DM also outperformed other benchmark algorithms.
- Published
- 2015
- Full Text
- View/download PDF
43. An information theoretic approach to hierarchical clustering combination
- Author
-
Mohammad Rahmati, Elaheh Rashedi, and Abdolreza Mirzaei
- Subjects
Fuzzy clustering ,Brown clustering ,Cognitive Neuroscience ,Single-linkage clustering ,computer.software_genre ,Complete-linkage clustering ,Computer Science Applications ,Hierarchical clustering ,Artificial Intelligence ,Consensus clustering ,Data mining ,Hierarchical clustering of networks ,Cluster analysis ,computer ,Mathematics - Abstract
In Hierarchical clustering, a set of patterns are partitioned into a sequence of groups represented as a dendrogram. The dendrogram is a tree representation where each node is associated with merging of two (or more) partitions and hence each partition is nested into the next partition. Hierarchical representation has properties that are useful for visualization and interpretation of clustering results. On one hand, different hierarchical clustering algorithms usually produce different dendrograms. On the other hand, clustering combination methods have received considerable interest in recent years and they yield superior results for clustering problems. This paper proposes a framework for combining various hierarchical clustering results which preserves the structural contents of input hierarchies. In this method, first a description matrix is created for each hierarchy, and then the description matrices of the input hierarchies are aggregated to form a consensus matrix from which the final hierarchy is derived. In this framework, we use two new measures for aggregating the description matrices, namely Renyi and Jensen–Shannon Divergences. The experimental and comparative analysis of our proposed framework shows the effectiveness of these two aggregators in hierarchical clustering combination.
- Published
- 2015
- Full Text
- View/download PDF
44. Utilizing cluster quality in hierarchical clustering for analogy-based software effort estimation
- Author
-
Jacky Keung and Jack H. C. Wu
- Subjects
business.industry ,Heuristic ,Computer science ,Computation ,Single-linkage clustering ,020206 networking & telecommunications ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Hierarchical clustering ,Set (abstract data type) ,Software ,0202 electrical engineering, electronic engineering, information engineering ,Data mining ,Cluster analysis ,business ,computer - Abstract
Analogy-based software effort estimation is one of the most popular estimation methods. It is built upon the principle of case-based reasoning (CBR) based on the k-th similar projects completed in the past. Therefore the determination of the k value is crucial to the prediction performance. Various research have been carried out to use a single and fixed k value for experiments, and it is known that dynamically allocated k values in an experiment will produce the optimized performance. This paper proposes an interesting technique based on hierarchical clustering to produce a range for k through various cluster quality criteria. We find that complete linkage clustering is more suitable for large datasets while single linkage clustering is suitable for small datasets. The method searches for optimized k values based on the proposed heuristic optimization technique, which have the advantages of easy computation and optimized for the dataset being investigated. Datasets from the PROMISE repository have been used to evaluate the proposed technique. The results of the experiments show that the proposed method is able to determine an optimized set of k values for analogy-based prediction, and to give estimates that outperformed traditional models based on a fixed k value. The implication is significant in that the analogy-based model will be optimized according the dataset being used, without the need to ask an expert to determining a single, fixed k value.
- Published
- 2017
- Full Text
- View/download PDF
45. A Clustering-Oriented Closeness Measure Based on Neighborhood Chain and Its Application in the Clustering Ensemble Framework Based on the Fusion of Different Closeness Measures
- Author
-
Shaoyi Liang and Deqiang Han
- Subjects
Fuzzy clustering ,Computer science ,Feature vector ,neighborhood chain ,Single-linkage clustering ,Closeness ,Correlation clustering ,02 engineering and technology ,lcsh:Chemical technology ,Biochemistry ,Complete-linkage clustering ,Article ,Analytical Chemistry ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,lcsh:TP1-1185 ,closeness measure ,Electrical and Electronic Engineering ,Cluster analysis ,Instrumentation ,business.industry ,clustering ensemble ,Pattern recognition ,Random walk closeness centrality ,clustering ,geometric distance ,Atomic and Molecular Physics, and Optics ,020201 artificial intelligence & image processing ,Artificial intelligence ,business - Abstract
Closeness measures are crucial to clustering methods. In most traditional clustering methods, the closeness between data points or clusters is measured by the geometric distance alone. These metrics quantify the closeness only based on the concerned data points’ positions in the feature space, and they might cause problems when dealing with clustering tasks having arbitrary clusters shapes and different clusters densities. In this paper, we first propose a novel Closeness Measure between data points based on the Neighborhood Chain (CMNC). Instead of using geometric distances alone, CMNC measures the closeness between data points by quantifying the difficulty for one data point to reach another through a chain of neighbors. Furthermore, based on CMNC, we also propose a clustering ensemble framework that combines CMNC and geometric-distance-based closeness measures together in order to utilize both of their advantages. In this framework, the “bad data points” that are hard to cluster correctly are identified; then different closeness measures are applied to different types of data points to get the unified clustering results. With the fusion of different closeness measures, the framework can get not only better clustering results in complicated clustering tasks, but also higher efficiency.
- Published
- 2017
- Full Text
- View/download PDF
46. A divisive clustering method for functional data with special consideration of outliers
- Author
-
Marcela Svarc and Ana Justel
- Subjects
Statistics and Probability ,HIERARCHICAL CLUSTERING ,Matemáticas ,Functional boxplot ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Complete-linkage clustering ,Matemática Pura ,010104 statistics & probability ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Cluster analysis ,Mathematics ,business.industry ,Applied Mathematics ,Pattern recognition ,Computer Science Applications ,Hierarchical clustering ,FUNCTIONAL BOXPLOT ,Outlier ,020201 artificial intelligence & image processing ,Data mining ,Artificial intelligence ,business ,computer ,GAP STATISTIC ,CIENCIAS NATURALES Y EXACTAS - Abstract
This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated. The number of clusters is estimated via the gap statistic. The functions are assigned to the new clusters by combining the k-means algorithm with the use of functional boxplots to identify functions that have been incorrectly classified because of their atypical local behavior. The DivClusFD method provides the number of clusters, the classification of the observed functions into the clusters and guidelines that may be for interpreting the clusters. A simulation study using synthetic data and tests of the performance of the DivClusFD method on real data sets indicate that this method is able to classify functions accurately. Fil: Justel, Ana. Universidad Autónoma de Madrid; España. Universidad Carlos III de Madrid; España Fil: Svarc, Marcela. Universidad de San Andrés; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
- Published
- 2017
- Full Text
- View/download PDF
47. SCMDOT: Spatial Clustering with Multiple Density-Ordered Trees
- Author
-
Chongcheng Chen, Hong Jiang, and Xiaozhu Wu
- Subjects
Fuzzy clustering ,Geography, Planning and Development ,Correlation clustering ,Single-linkage clustering ,spatial clustering ,Multiple Density-Ordered Trees (MDOT) ,multi-density clustering ,agglomerative hierarchical clustering ,lcsh:G1-922 ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,CURE data clustering algorithm ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Earth and Planetary Sciences (miscellaneous) ,Computers in Earth Sciences ,Cluster analysis ,Mathematics ,business.industry ,Constrained clustering ,Pattern recognition ,Data stream clustering ,020201 artificial intelligence & image processing ,Data mining ,Artificial intelligence ,business ,computer ,lcsh:Geography (General) - Abstract
With the rapid explosion of information based on location, spatial clustering plays an increasingly significant role in this day and age as an important technique in geographical data analysis. Most existing spatial clustering algorithms are limited by complicated spatial patterns, which have difficulty in discovering clusters with arbitrary shapes and uneven density. In order to overcome such limitations, we propose a novel clustering method called Spatial Clustering with Multiple Density-Ordered Trees (SCMDOT). Motivated by the idea of the Density-Ordered Tree (DOT), we firstly represent the original dataset by the means of constructing Multiple Density-Ordered Trees (MDOT). In the constructing process, we impose additional constraints to control the growth of each Density-Ordered Tree, ensuring that they all have high spatial similarity. Furthermore, a series of MDOT can be successively generated from regions of sparse areas to the dense areas, where each Density-Ordered Tree, also treated as a sub-tree, represents a cluster. In the merging process, the final clusters are obtained by repeatedly merging a suitable pair of clusters until they satisfy the expected clustering result. In addition, a heuristic strategy is applied during the process of our algorithm for suitability for special applications. The experiments on synthetic and real-world spatial databases are utilised to demonstrate the performance of our proposed method.
- Published
- 2017
48. Hierarchical clustering algorithms with automatic estimation of the number of clusters
- Author
-
Sadaaki Miyamoto, Ryosuke Abe, Yukihiro Hamasuna, and Yasunori Endo
- Subjects
Brown clustering ,05 social sciences ,Single-linkage clustering ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,Complete-linkage clustering ,Hierarchical clustering ,Determining the number of clusters in a data set ,ComputingMethodologies_PATTERNRECOGNITION ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,050211 marketing ,020201 artificial intelligence & image processing ,Data mining ,Cluster analysis ,computer ,k-medians clustering ,Mathematics - Abstract
The problem of estimating appropriate number of clusters has been a main and difficult issue in clustering researches. There are different methods for this in hierarchical clustering; a typical approach is to try clustering for different number of clusters, and compare them using a measure to estimate cluster numbers. On the other hand, there is no such method to estimate automatically the number of clusters in agglomerative hierarchical clustering (AHC), since AHC produces a family of clusters with different cluster numbers at the same time using the form of dendrograms. An exception is the Newman method in network clustering, but this method does not have a useful dendrogram output. The aim of the present paper is to propose new methods to automatically estimate the number of clusters in AHC. We show two approaches for this purpose, one is to use a variation of cluster validity measure, and another is to use statistical model selection method like BIC.
- Published
- 2017
- Full Text
- View/download PDF
49. Automatic similarity detection and clustering of data
- Author
-
Craig Einstein and Peter Chin
- Subjects
Clustering high-dimensional data ,Fuzzy clustering ,Computer science ,Single-linkage clustering ,Correlation clustering ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Complete-linkage clustering ,010309 optics ,CURE data clustering algorithm ,0103 physical sciences ,Cluster analysis ,k-medians clustering ,business.industry ,Pattern recognition ,021001 nanoscience & nanotechnology ,Determining the number of clusters in a data set ,ComputingMethodologies_PATTERNRECOGNITION ,Data stream clustering ,Affinity propagation ,Artificial intelligence ,Data mining ,0210 nano-technology ,business ,computer - Abstract
An algorithm was created which identifies the number of unique clusters in a dataset and assigns the data to the clusters. A cluster is defined as a group of data which share similar characteristics. Similarity is measured using the dot product between two vectors where the data are input as vectors. Unlike other clustering algorithms such as K-means, no knowledge of the number of clusters is required. This allows for an unbiased analysis of the data. The automatic cluster detection algorithm (ACD), is executed in two phases: an averaging phase and a clustering phase. In the averaging phase, the number of unique clusters is detected. In the clustering phase, data are matched to the cluster to which they are most similar. The ACD algorithm takes a matrix of vectors as an input and outputs a 2D array of the clustered data. The indices of the output correspond to a cluster, and the elements in each cluster correspond to the position of the datum in the dataset. Clusters are vectors in N-dimensional space, where N is the length of the input vectors which make up the matrix. The algorithm is distributed, increasing computational efficiency
- Published
- 2017
- Full Text
- View/download PDF
50. A novel clustering oriented closeness measure based on neighborhood chain
- Author
-
Shaoyi Liang, Deqiang Han, Lei Zhang, and Qinke Peng
- Subjects
Mathematical optimization ,business.industry ,Correlation clustering ,Single-linkage clustering ,Constrained clustering ,Pattern recognition ,02 engineering and technology ,Complete-linkage clustering ,CURE data clustering algorithm ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,FLAME clustering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Cluster analysis ,k-medians clustering ,Mathematics - Abstract
Closeness measures are crucial to clustering methods. In most traditional clustering methods, the closeness between data points or clusters is measured by the geometric distances alone. These metrics quantify the closeness only based on the concerned data points' positions in the feature space, and they might cause problems when dealing with clustering tasks with arbitrary clusters shapes and different clusters scales (varying clusters densities). In this paper, a novel Closeness Measure between data points based on Neighborhood Chain (CMNC) is proposed. Instead of using geometric distances alone, CMNC measures the closeness between data points by quantifying the difficulty for one data point to reach another through a chain of neighbors. Experimental results show that by substituting the geometric-distances-based closeness measures with CMNC, modified versions of the traditional clustering algorithms (e.g. k-means, single-link and CURE) perform much better than their original versions, especially when dealing with clustering tasks with clusters having arbitrary shapes and different scales.
- Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.