246 results on '"Silhouette Coefficient"'
Search Results
2. Aggregations of Fuzzy Equivalences in k-means Algorithm
- Author
-
Lasek, Piotr, Rząsa, Wojciech, and Król, Anna
- Published
- 2024
- Full Text
- View/download PDF
3. Employing Aggregations of Fuzzy Equivalences in Clustering and Visualization of Medical Data Sets
- Author
-
Lasek, Piotr, Rząsa, Wojciech, and Król, Anna
- Published
- 2024
- Full Text
- View/download PDF
4. Cascaded RFM-Based Fuzzy Clustering Model for Dynamic Customer Segmentation in the Retail Sector
- Author
-
Sobantu, Sive, Isafiade, Omowunmi E., Li, Gang, Series Editor, Filipe, Joaquim, Series Editor, Xu, Zhiwei, Series Editor, Gerber, Aurona, editor, Maritz, Jacques, editor, and Pillay, Anban W., editor
- Published
- 2025
- Full Text
- View/download PDF
5. Loose–tight cluster regularization for unsupervised person re-identification: Loose–tight cluster regularization for unsupervised...: Y. Liu et al.
- Author
-
Liu, Yixiu, Zhan, Long, Feng, Yu, Si, Pengju, Jiang, Shaowei, Zhao, Qiang, and Yan, Chenggang
- Abstract
Unsupervised person re-identification (Re-ID) is a critical and challenging task in computer vision. It aims to identify the same person across different camera views or locations without using any labeled data or annotations. Most existing unsupervised Re-ID methods adopt a clustering and fine-tuning strategy, which alternates between generating pseudo-labels through clustering and updating the model parameters through fine-tuning. However, this strategy has two major drawbacks: (1) the pseudo-labels obtained by clustering are often noisy and unreliable, which may degrade the model performance; and (2) the model may overfit to the pseudo-labels and lose its generalization ability during fine-tuning. To address these issues, we propose a novel method that integrates silhouette coefficient-based label correction and contrastive loss regularization based on loose–tight cluster guidance. Specifically, we use silhouette coefficients to measure the quality of pseudo-labels and correct the potential noisy labels, thereby reducing their negative impact on model training. Moreover, we introduce a new contrastive loss regularization term that consists of two components: a cluster-level contrast loss that encourages the model to learn discriminative features, and a regularization loss that prevents the model from overfitting to the pseudo-labels. The weights of these components are dynamically adjusted according to the silhouette coefficients. Furthermore, we adopt Vision Transformer as the backbone network to extract more robust features. We conduct extensive experiments on several public datasets and demonstrate that our method achieves significant improvements over the state-of-the-art unsupervised Re-ID methods. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
6. Game and Application Purchasing Patterns on Steam using K-Means Algorithm
- Author
-
Salman Fauzan Fahri Aulia, Yana Aditia Gerhana, and Eva Nurlatifah
- Subjects
clustering ,crisp-dm ,game ,k-means ,purity ,silhouette coefficient ,Information technology ,T58.5-58.64 - Abstract
Online games are visual games that utilize the internet or LAN networks. With the growth of the gaming industry, platforms like Steam offer a wide variety of games, making it challenging for users to decide which game to play. This study employs the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology to address this issue by understanding user preferences. The k-means algorithm clusters game data based on similar characteristics, helping users and developers identify the most popular game types. Data sourced from Kaggle, obtained through the Steam API and Steamspy, consists of 85,103 entries. A normalization process is applied to enhance calculation accuracy. The elbow method determines the optimal number of clusters, resulting in three clusters from the k-means algorithm. The evaluation includes the silhouette coefficient, which measures the proximity between variables, and precision purity, which compares labels by assigning a value of 1 (actual) or 0 (false). The study finds an average silhouette coefficient of 0.345 and a precision purity value of 0.734, indicating that the k-means algorithm performs optimally based on the precision purity metric. The findings reveal that free-to-play games are the most popular among users, while the "Animation Modelling" category is the most expensive based on price comparisons
- Published
- 2024
- Full Text
- View/download PDF
7. Application of Enhanced K-Means and Cloud Model for Structural Health Monitoring on Double-Layer Truss Arch Bridges.
- Author
-
Gui, Chengzhong, Han, Dayong, Gao, Liang, Zhao, Yingai, Wang, Liang, Xu, Xianglong, and Xu, Yijun
- Subjects
INFRASTRUCTURE (Economics) ,K-means clustering ,ARCH bridges ,STRUCTURAL models ,COMMUNICATION infrastructure ,STRUCTURAL health monitoring - Abstract
Bridges, as vital infrastructure, require ongoing monitoring to maintain safety and functionality. This study introduces an innovative algorithm that refines bridge component performance assessment through the integration of modified K-means clustering, silhouette coefficient optimization, and cloud model theory. The purpose is to provide a reliable method for monitoring the safety and serviceability of critical infrastructure, particularly double-layer truss arch bridges. The algorithm processes large datasets to identify patterns and manage uncertainties in structural health monitoring (SHM). It includes field monitoring techniques and a model-driven approach for establishing assessment thresholds. The main findings, validated by case studies, show the algorithm's effectiveness in enhancing clustering quality and accurately evaluating bridge performance using multiple indicators, such as statistical significance, cluster centroids, average silhouette coefficient, Davies–Bouldin index, average deviation, and Sign-Rank test p-values. The conclusions highlight the algorithm's utility in assessing structural integrity and aiding data-driven maintenance decisions, offering scientific support for bridge preservation efforts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Silhouette coefficient-based weighting k-means algorithm
- Author
-
Lai, Huixia, Huang, Tao, Lu, BinLong, Zhang, Shi, and Xiaog, Ruliang
- Published
- 2024
- Full Text
- View/download PDF
9. Implementation of the K-Means Clustering Algorithm in Determining Productive Oil Palm Blocks at Pt Arta Prigel
- Author
-
Yesi Pitaloka Anggriani, Alfis Arif, and Febriansyah Febriansyah
- Subjects
k-means clustering ,rapid miner ,sawit ,silhouette coefficient ,Information technology ,T58.5-58.64 ,Computer software ,QA76.75-76.765 - Abstract
The purpose of this study is to implement the K-Means Clustering method to determine the patterns of productive oil palm production based on their blocks at Pt Arta Prigel. The research is motivated by issues within the oil palm blocks, such as the absence of productive block summaries, insufficient plantation land analysis, and erroneous decision-making. The development method utilizes CRISP-DM, with data spanning 2 years from October 2021 to October 2023. From the 1275 production records, after cleaning, 1015 records remain. Filtering the initial 51 blocks results in 37 blocks for the years 2021 and 2022, and 46 blocks for the year 2023. After clustering, the production outcomes for the year 2021 are as follows: cluster_0 has 34 blocks, cluster_1 has 10 blocks. For the year 2022, cluster_0 has 24 blocks, cluster_1 has 37 blocks. In the year 2023, cluster_0 has 44 blocks, cluster_1 has 27 blocks. The testing method employs the silhouette coefficient, and the silhouette score testing results indicate the formation of 2 clusters (K=2) with a value of 0.62, the results obtained from testing with 2 clusters indicate that the formed clusters are accurate. The findings of this study include patterns, graphs, and production tables generated using the K-Means Clustering method at Pt Arta Prigel.
- Published
- 2024
- Full Text
- View/download PDF
10. Sparrow search algorithm-driven clustering analysis of rock mass discontinuity sets.
- Author
-
Wu, Wenxuan, Feng, Wenkai, Yi, Xiaoyu, Zhao, Jiachen, and Zhou, Yongjian
- Subjects
- *
ROCK analysis , *SEARCH algorithms , *ROCK mechanics , *IMPACT strength , *ROCK deformation - Abstract
Rock discontinuity has a crucial impact on the deformation and strength of rock masses, and thus, the clustering of discontinuities is a critical aspect of rock mechanics. Traditional clustering methods require initial cluster centers to be specified and involve a multitude of parameter calculations, leading to a complex and cumbersome process. In this paper, a novel clustering approach based on the sparrow search algorithm (SSA) is introduced to overcome these limitations. This method utilizes a sparrow population coding technique and fitness function tailored to the unique characteristics of rock discontinuity orientation data. The SSA is adeptly applied to the clustering of rock joints, and the optimal number of clusters are automatically determined via the silhouette coefficient method. This methodology was tested on artificial datasets and actual discontinuity survey results from the underground powerhouse of the Henan Wuyue Hydropower Station to evaluate its feasibility and efficacy in analyzing rock discontinuities. Comparative data analysis reveals that the proposed method outperforms classic algorithms such as FCM and KPSO in terms of clustering accuracy and stability. The proposed method stands out among various clustering methods of discontinuity orientation for its ability to achieve convergent results without user intervention, demonstrating significant practical utility. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Clustering analysis for classifying fake real estate listings.
- Author
-
Amin, Maifuza Mohd, Sani, Nor Samsiah, Nasrudin, Mohammad Faidzul, Abdullah, Salwani, Chhabra, Amit, and Kadir, Faizal Abd
- Subjects
REAL estate listings ,REAL estate sales ,CLUSTER analysis (Statistics) ,RANDOM forest algorithms ,MACHINE learning ,DEEP learning - Abstract
With the rapid growth of online property rental and sale platforms, the prevalence of fake real estate listings has become a significant concern. These deceptive listings waste time and effort for buyers and sellers and pose potential risks. Therefore, developing effective methods to distinguish genuine from fake listings is crucial. Accurately identifying fake real estate listings is a critical challenge, and clustering analysis can significantly improve this process. While clustering has been widely used to detect fraud in various fields, its application in the real estate domain has been somewhat limited, primarily focused on auctions and property appraisals. This study aims to fill this gap by using clustering to classify properties into fake and genuine listings based on datasets curated by industry experts. This study developed a K-means model to group properties into clusters, clearly distinguishing between fake and genuine listings. To assure the quality of the training data, data pre-processing procedures were performed on the raw dataset. Several techniques were used to determine the optimal value for each parameter of the K-means model. The clusters are determined using the Silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index. It was found that the value of cluster 2 is the best and the Camberra technique is the best method when compared to overlapping similarity and Jaccard for distance. The clustering results are assessed using two machine learning algorithms: Random Forest and Decision Tree. The observational results have shown that the optimized K-means significantly improves the accuracy of the Random Forest classification model, boosting it by an impressive 96%. Furthermore, this research demonstrates that clustering helps create a balanced dataset containing fake and genuine clusters. This balanced dataset holds promise for future investigations, particularly for deep learning models that require balanced data to perform optimally. This study presents a practical and effective way to identify fake real estate listings by harnessing the power of clustering analysis, ultimately contributing to a more trustworthy and secure real estate market. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. An analysis of causative factors for road accidents using partition around medoids and hierarchical clustering techniques.
- Author
-
Manasa, Pendyala, Ananth, Pragya, Natarajan, Priyadarshini, Somasundaram, K., Rajkumar, E. R., Ravichandran, Kattur Soundarapandian, Balasubramanian, Venkatesh, and Gandomi, Amir H.
- Subjects
HIERARCHICAL clustering (Cluster analysis) ,FACTOR analysis ,TRAFFIC accidents ,TRAFFIC regulations ,ROAD safety measures ,TRAFFIC congestion - Abstract
Insufficient progress in the development of national highways and state highways, coupled with a lack of public awareness regarding road safety, has resulted in prevalent traffic congestion and a high rate of accidents. Understanding the dominant and contributing factors that may influence road traffic accident severity is essential. This study identified the primary causes and the most significant target‐specific causative factors for road accident severity. A modified partitioning around medoids model determined the dominant road accident features. These clustering algorithms will extract hidden information from the road accident data and generate new features for our implementation. Then, the proposed method is compared with the other state‐of‐the‐art clustering techniques with three performance metrics: the silhouette coefficient, the Davies–Bouldin index, and the Calinski–Harabasz index. This article's main contribution is analyzing six different scenarios (different angles of the problem) concerning grievous and non‐injury accidents. This analysis provides deeper insights into the problem and can assist transport authorities in Tamil Nadu, India, in deriving new rules for road traffic. The output of different scenarios is compared with hierarchical clustering, and the overall clustering of the proposed method is compared with other clustering algorithms. Finally, it is proven that the proposed method outperforms other recently developed techniques. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. An Algorithmic Framework for Clustering Cab Pickup Geo-points for Cab Recommender System (CRS)
- Author
-
Mann, Supreet Kaur, Chawla, Sonal, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Somani, Arun K., editor, Mundra, Ankit, editor, Gupta, Rohit Kumar, editor, Bhattacharya, Subhajit, editor, and Mazumdar, Arka Prokash, editor
- Published
- 2024
- Full Text
- View/download PDF
14. Decision Tree Clustering for Time Series Data: An Approach for Enhanced Interpretability and Efficiency
- Author
-
Higashi, Masaki, Sung, Minje, Yamane, Daiki, Inamuro, Kenta, Nagai, Shota, Kobayashi, Ken, Nakata, Kazuhide, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Fenrong, editor, Sadanandan, Arun Anand, editor, Pham, Duc Nghia, editor, Mursanto, Petrus, editor, and Lukose, Dickson, editor
- Published
- 2024
- Full Text
- View/download PDF
15. Determination of Optimum K Value for K-means Segmentation of Diseased Tea Leaf Images
- Author
-
Das, Anuj Kumar, Ahmed, Syed Sazzad, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Tan, Kay Chen, Series Editor, Deka, Jatindra Kumar, editor, Robi, P. S., editor, and Sharma, Bobby, editor
- Published
- 2024
- Full Text
- View/download PDF
16. Implementasi K-Medoids Dalam Pengelompokan Fasilitas Pelayanan Kesehatan Pada Kasus Tuberculosis
- Author
-
Refanisa Putri, Freza Riana, and Berlina Wulandari
- Subjects
kata kunci: algoritma k-medoids ,fasilitas pelayanan kesehatan (fasyankes) ,silhouette coefficient ,tuberculosis (tbc). keywords: k-medoids algorithm ,health service facilities (fasyankes) ,tuberculosis (tb). ,Information technology ,T58.5-58.64 - Abstract
Tuberculosis (TBC) adalah penyakit menular yang disebabkan oleh bakteri Mycobacterium Tuberculosis, kadang disebut juga TB paru. Pada tahun 2021 kasus TBC di Kota Bogor terdapat 4855 kasus, banyaknya kasus TBC di Kota Bogor ini, diperlukan pengelompokan penyebaran penyakit TBC berdasarkan Fasilitas Pelayanan Kesehatan (FASYANKES) di Kota Bogor menggunakan algoritma K-Medoids, yang bertujuan untuk mengetahui karakteristik FASYANKES dalam kasus TBC. Algoritma K -Medoids adalah sebuah algoritma yang menggunakan metode partisi clustering untuk mengelompokkan sejumlah n objek menjadi k klaster. Pada penelitian ini diterapkan pengujian Silhouette Coefficient untuk memaksimalkan hasil clustering, hasil clustering yang diperoleh adalah terbentuk 2 klaster dengan Silhouette = 0,574652. Sehingga dengan implementasi K-Medoids clustering diperoleh hasil 2 klaster yakni, klaster 0 terdapat 15 FASYANKES yang berisi karakteristik teridentifikasi yang tinggi dan hasil diagnosis positif TBC yang tinggi. Namun, terdapat 3 FASYANKES yang memiliki nilai jumlah pasien teridentifikasi tinggi tetapi nilai hasil diagnosis positifnya rendah. FASYANKES tersebut termasuk ke dalam klaster 0, karena dipengaruhi oleh nilai jumlah pasien teridentifikasi yang tinggi. Sedangkan klaster 1 terdapat 29 FASYANKES yang berisi karakteristik teridentifikasi yang rendah dan hasil diagnosis positif TBC yang rendah. Namun, terdapat 9 FASYANKES yang memiliki nilai jumlah pasien teridentifikasi rendah tetapi nilai hasil diagnosis positifnya tinggi. FASYANKES tersebut termasuk ke dalam klaster 1, karena dipengaruhi oleh nilai jumlah pasien teridentifikasi yang rendah. Tuberculosis (TB) is an infectious disease caused by the bacteria Mycobacterium Tuberculosis, sometimes also called pulmonary TB. In 2021, there were 4855 cases of tuberculosis in Bogor City. The large number of tuberculosis cases in Bogor City requires clustering the spread of tuberculosis disease based on Health Service Facilities (FASYANKES) in Bogor City using the K-Medoids algorithm, which aims to determine the characteristics of FASYANKES in tuberculosis cases. The K-Medoids algorithm is an algorithm that uses the clustering partition method to group a number of n objects into k clusters. In this study, Silhouette Coefficient testing was applied to maximize clustering results, the clustering results obtained were 2 clusters formed with Silhouette = 0.574652. So that with the implementation of K-Medoids clustering, the results of 2 clusters are obtained, namely, cluster 0 there are 15 FASYANKES which contain high identified characteristics and high positive TB diagnosis results. However, there are 3 FASYANKES that have a high number of identified patients but a low value of positive diagnosis results. These FASYANKES are included in cluster 0, because they are influenced by the high number of identified patients.While cluster 1 contained 29 FASYANKES that contained low identified characteristics and low positive TB diagnosis results. However, there are 9 FASYANKES that have a low number of identified patients but a high positive diagnosis result. These FASYANKES are included in cluster 1, as they are influenced by the low number of identified patients.
- Published
- 2024
- Full Text
- View/download PDF
17. CLUSTERING OF POPULAR SPOTIFY SONGS IN 2023 USING K-MEANS METHOD AND SILHOUETTE COEFFICIENT
- Author
-
Nur Rohman and Arief Wibowo
- Subjects
clustering ,data mining ,k-means ,silhouette coefficient ,spotify ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The rapid advancement of technology and globalization in this era has brought about comprehensive and easily accessible music streaming services, one of which is Spotify. According to Kompas.com, Spotify has experienced a rise in subscribers up to 130 million, as a platform that offers various features besides music streaming. Spotify also provides a better user experience and has the ability to compete with other music streaming platforms. The mission of this research is to classify popular Spotify song data in 2023, which can aid in a deeper understanding of listener preferences or music trends. Based on the test results, there were 2 clusters obtained with cluster 0 containing 863 data and cluster 1 containing 90 data. From the testing results conducted in the K-Means analysis, a Silhouette Coefficient of 0.81 was obtained, which falls into the category of Strong Structure. From these results, it can be suggested that cluster formation was done very well to provide more personalized and relevant music recommendations to Spotify platform users. By understanding the preferences and patterns of listeners revealed through clustering, streaming services can enhance user experience by providing more tailored content.
- Published
- 2024
- Full Text
- View/download PDF
18. Exploratory Data Analytics and PCA-Based Dimensionality Reduction for Improvement in Smart Meter Data Clustering.
- Author
-
Shamim, Gulezar and Rihan, Mohd
- Subjects
- *
DATA distribution , *DATA analytics , *SUM of squares , *PRINCIPAL components analysis , *DATA science , *SMART meters - Abstract
The smart meter sends the meter readings to the utilities at desired frequency allowing better visibility of consumer electricity consumption behaviour by providing more data points for in-depth analysis and for generating insights using advanced data analytics and data science techniques. The granulated data helps utilities in designing schemes for audience suitable for demand response management to shift the peak hour demand to off-peak hours. In this paper, a method is proposed for load profile segmentation which can be used by utilities for identifying the characteristics of different users and targeting those whose demand curve can be flattened during peak hours with various demand response management schemes. Firstly, exploratory data analysis is done on the cleaned dataset to find the optimal epoch size, understand the distribution of data in each epoch, and use it for dimensionality reduction. For reducing the clustering computation time, dimensionality reduction is done by around 64% using Principal Component Analysis. The first six principal components are identified as carrying maximum variance using the cumulative variance technique in each epoch. Unsupervised Machine Learning based k-means clustering technique is applied to these principal components. The optimal value of k is evaluated using the WCSS technique where k = 5 and k = 3 for residential and SME users respectively is found. The average silhouette coefficient for residential users is 0.48 and for SME users is 0.51. Hence, well-separated clusters are formed with minimum intra-cluster distance using PCA for dimensionality reduction which is used for load profile segmentation and Post Clustering Analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Performance Comparison of Dimensional Reduction using Principal Component Analysis with Alternating Least Squares in Modified Fuzzy Possibilistic C-Means and Fuzzy Possibilistic C-Means.
- Author
-
Satriyanto, Edi, Wardhani, Ni Wayan Surya, Anam, Syaiful, and Mahmudy, Wayan Firdaus
- Subjects
PRINCIPAL components analysis ,LEAST squares ,SUM of squares - Abstract
The clustering method is said to be good if it has resistance to outlier data. One cluster method resistant to outlier data is Fuzzy Possibilistic C-Means (FPCM). FPCM performance on outlier data still has the potential for overlap between cluster members in different clusters, resulting in decreased cluster quality. The Modified Fuzzy Possibilistic C-Means (MFPCM) method is used to modify FPCM in its objective function by inserting updated weight values to increase FPCM performance. In this research, improving the quality of FPCM and MFPCM clusters was carried out by reducing data dimensions through Principal Component Analysis using Alternating Least Squares (PRINCALS) so that members of each cluster do not overlap in the right cluster. The PRINCALS results of the FPCM method have better performance with silhouette values and BSS/TSS ratios of 0.4108 and 60% compared to values without PRINCALS of 0.355 and 43%. The MFPCM method with PRINCALS also performs better, namely 0.4299 and 61%, compared to 0.368 and 42% without PRINCALS. In this study, the performance of MFPCM with PRINCALS or without PRINCALS was better than that of the FPCM method. Overall, PRINCALS can improve the performance of the MFPCM and FPCM methods, resulting in better clusters. PRINCALS in this cluster produce an average silhouette value greater than 0.3 and an average BSS/TSS ratio greater than 50% so that each cluster member is in the right cluster and does not overlap. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Clustering analysis for classifying fake real estate listings
- Author
-
Maifuza Mohd Amin, Nor Samsiah Sani, Mohammad Faidzul Nasrudin, Salwani Abdullah, Amit Chhabra, and Faizal Abd Kadir
- Subjects
K-means ,Clustering analysis ,Real estates ,Random forest ,Silhouette coefficient ,Calinski-Harabasz index ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
With the rapid growth of online property rental and sale platforms, the prevalence of fake real estate listings has become a significant concern. These deceptive listings waste time and effort for buyers and sellers and pose potential risks. Therefore, developing effective methods to distinguish genuine from fake listings is crucial. Accurately identifying fake real estate listings is a critical challenge, and clustering analysis can significantly improve this process. While clustering has been widely used to detect fraud in various fields, its application in the real estate domain has been somewhat limited, primarily focused on auctions and property appraisals. This study aims to fill this gap by using clustering to classify properties into fake and genuine listings based on datasets curated by industry experts. This study developed a K-means model to group properties into clusters, clearly distinguishing between fake and genuine listings. To assure the quality of the training data, data pre-processing procedures were performed on the raw dataset. Several techniques were used to determine the optimal value for each parameter of the K-means model. The clusters are determined using the Silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index. It was found that the value of cluster 2 is the best and the Camberra technique is the best method when compared to overlapping similarity and Jaccard for distance. The clustering results are assessed using two machine learning algorithms: Random Forest and Decision Tree. The observational results have shown that the optimized K-means significantly improves the accuracy of the Random Forest classification model, boosting it by an impressive 96%. Furthermore, this research demonstrates that clustering helps create a balanced dataset containing fake and genuine clusters. This balanced dataset holds promise for future investigations, particularly for deep learning models that require balanced data to perform optimally. This study presents a practical and effective way to identify fake real estate listings by harnessing the power of clustering analysis, ultimately contributing to a more trustworthy and secure real estate market.
- Published
- 2024
- Full Text
- View/download PDF
21. AN ANALYSIS OF CLUSTER TIMES SERIES FOR THE NUMBER OF COVID-19 CASES IN WEST JAVA
- Author
-
Nurfitri Imro'ah and Nur'ainul Miftahul Huda
- Subjects
acf distance cluster ,hierarchy ,silhouette coefficient ,Mathematics ,QA1-939 - Abstract
The government may be able to develop more effective strategies for dealing with COVID-19 cases if it groups districts and cities according to the features of the number of Covid-19 cases being reported in each district or city. The data can be more easily summarized with the help of cluster analysis, which organizes items into groups according to the degree of similarity between members. Since it is possible to group more than one period together, the generation of clusters based on time series is a more efficient method than clusters that are created for each individual unit. Using a time series cluster hierarchical technique that has complete linkage, the purpose of this study is to categorize the number of instances of Covid-19 that have been found in West Java by district or city. The data that was used comes from monthly reports of Covid-19 instances compiled by West Java districts from 2020 to 2022. The Autocorrelation Function (ACF) distance cluster was utilized in this investigation to determine how closely cluster members are related to one another. According to the findings, there could be as many as seven separate clusters, each including a unique assortment of districts and cities. Cluster 3, which is comprised of three different cities and regencies, including Bandung City, West Bandung Regency, and Sumedang Regency, has an average number of cases that is 66, making it the cluster with the highest number of cases overall. A value of 0.2787590 is obtained for the silhouette coefficient as a result of the established grouping. This value suggests that the structure of the newly created cluster is quite fragile.The government may be able to develop more eective strategies fordealing with COVID-19 cases if it groups districts and cities according to the featuresof the number of Covid-19 cases being reported in each district or city. The data canbe more easily summarized with the help of cluster analysis, which organizes items intogroups according to the degree of similarity between members. Since it is possible togroup more than one period together, the generation of clusters based on time series isa more ecient method than clusters that are created for each individual unit. Using atime series cluster hierarchical technique that has complete linkage, the purpose of thisstudy is to categorize the number of instances of Covid-19 that have been found in WestJava by district or city. The data that was used comes from monthly reports of Covid-19 instances compiled by West Java districts from 2020 to 2022. The AutocorrelationFunction (ACF) distance cluster was utilized in this investigation to determine howclosely cluster members are related to one another. According to the ndings, there couldbe as many as seven separate clusters, each including a unique assortment of districtsand cities. Cluster 3, which is comprised of three dierent cities and regencies, includingBandung City, West Bandung Regency, and Sumedang Regency, has an average numberof cases that is 66, making it the cluster with the highest number of cases overall. Avalue of 0.2787590 is obtained for the silhouette coecient as a result of the establishedgrouping. This value suggests that the structure of the newly created cluster is quitefragile.
- Published
- 2024
- Full Text
- View/download PDF
22. K-Means Clustering with KNN and Mean Imputation on CPU Benchmark Compilation Data
- Author
-
Rofiq Muhammad Syauqi, Puspita Nurul Sabrina, and Irma Santikarama
- Subjects
clustering ,imputasi knn ,imputasi mean ,k-means ,silhouette coefficient ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
In the rapidly evolving digital age, data is becoming a valuable source for decision-making and analysis. Clustering, as an important technique in data analysis, has a key role in organizing and understanding complex datasets. One of the effective clustering algorithms is k-means. However, this algorithm is prone to the problem of missing values, which can significantly affect the quality of the resulting clusters. To overcome this challenge, imputation methods are used, including mean imputation and K-Nearest Neighbor (KNN) imputation. This study aims to analyze the impact of imputation methods on CPU Benchmark Compilation clustering results. Evaluation of the clustering results using the silhouette coefficient showed that clustering with mean imputation achieved a score of 0.782, while with KNN imputation it achieved a score of 0.777. In addition, the cluster interpretation results show that the KNN method produces more information that is easier for users to understand. This research provides valuable insights into the effectiveness of imputation methods in improving the quality of data clustering results in assisting CPU selection decisions on CPU Benchmark Compilation data.
- Published
- 2023
- Full Text
- View/download PDF
23. Comparison of Hierarchical, K-Means and DBSCAN Clustering Methods for Credit Card Customer Segmentation Analysis Based on Expenditure Level
- Author
-
Hafid Ramadhan, Mohammad Rizal Abdan Kamaludin, Muhammad Alfan Nasrullah, and Dwi Rolliawati
- Subjects
clustering ,credit card ,comparison ,segmentation ,silhouette coefficient ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The amount of data from credit card users is increasing from year to year. Credit cards are an important need for people to make payments. The increasing number of credit card users is because it is considered more effective and efficient. The third method used today has a function to determine the effective outcome of credit card user scenarios. In this study, a comparison was made using the Hierarchical Clustering, K-Means and DBSCAN methods to determine the results of credit card customer segmentation analysis to be used as a market strategy. The results obtained based on the best silhouette coefficient score method is two cluster hierarchical clustering with 0.82322 score. Based on the best mean value customers are divided into two segments, and it is suggested to develop strategies for both segments.
- Published
- 2023
- Full Text
- View/download PDF
24. Application of Enhanced K-Means and Cloud Model for Structural Health Monitoring on Double-Layer Truss Arch Bridges
- Author
-
Chengzhong Gui, Dayong Han, Liang Gao, Yingai Zhao, Liang Wang, Xianglong Xu, and Yijun Xu
- Subjects
bridge infrastructure ,performance assessment ,enhanced k-means clustering ,silhouette coefficient ,cloud model ,structural health monitoring ,Technology - Abstract
Bridges, as vital infrastructure, require ongoing monitoring to maintain safety and functionality. This study introduces an innovative algorithm that refines bridge component performance assessment through the integration of modified K-means clustering, silhouette coefficient optimization, and cloud model theory. The purpose is to provide a reliable method for monitoring the safety and serviceability of critical infrastructure, particularly double-layer truss arch bridges. The algorithm processes large datasets to identify patterns and manage uncertainties in structural health monitoring (SHM). It includes field monitoring techniques and a model-driven approach for establishing assessment thresholds. The main findings, validated by case studies, show the algorithm’s effectiveness in enhancing clustering quality and accurately evaluating bridge performance using multiple indicators, such as statistical significance, cluster centroids, average silhouette coefficient, Davies–Bouldin index, average deviation, and Sign-Rank test p-values. The conclusions highlight the algorithm’s utility in assessing structural integrity and aiding data-driven maintenance decisions, offering scientific support for bridge preservation efforts.
- Published
- 2024
- Full Text
- View/download PDF
25. Multi-level Logistics Network Node Siting Model Based on K-Means
- Author
-
Liu, Jie, Tian, Shuang, Wang, Qingqing, Zhang, Chenguang, Ning, Min, Li, Changlong, Dou, Runliang, Editor-in-Chief, Liu, Jing, Editor-in-Chief, Khasawneh, Mohammad T., Editor-in-Chief, Balas, Valentina Emilia, Series Editor, Bhowmik, Debashish, Series Editor, Khan, Khalil, Series Editor, Masehian, Ellips, Series Editor, Mohammadi-Ivatloo, Behnam, Series Editor, Nayyar, Anand, Series Editor, Pamucar, Dragan, Series Editor, Shu, Dewu, Series Editor, Hemachandran, K., editor, Boddu, Raja Sarath Kumar, editor, and Alhasan, Waseem, editor
- Published
- 2023
- Full Text
- View/download PDF
26. Clustering Individuals Based on Multivariate EMA Time-Series Data
- Author
-
Ntekouli, Mandani, Spanakis, Gerasimos, Waldorp, Lourens, Roefs, Anne, Wiberg, Marie, editor, Molenaar, Dylan, editor, González, Jorge, editor, Kim, Jee-Seon, editor, and Hwang, Heungsun, editor
- Published
- 2023
- Full Text
- View/download PDF
27. Social network analysis of mythology field
- Author
-
Akça, Sümeyye and Akbulut, Müge
- Published
- 2023
- Full Text
- View/download PDF
28. APPLYING NDVI FROM DIFFERENT NUMBER OF SPECTRAL SENSORS IN DELIMITING SOYBEAN FERTILIZATION MANAGEMENT ZONES.
- Author
-
CHEN, H., WANG, X., ZHANG, W., WANG, X. Z., DI, X. D., and QI, L. Q.
- Subjects
NORMALIZED difference vegetation index ,DATA acquisition systems ,DETECTORS - Abstract
The normalized difference vegetation index (NDVI) obtained by GreenSeeker sensors can be used to delimit fertilization management zones and realize variable fertilization. We used six GreenSeeker sensors to build a soybean NDVI data acquisition system, studied the effect of different number of GreenSeeker sensors on the delimitation of fertilization management zones. The NDVI data collected by the different number of GreenSeeker sensors were used to delimit the management zones by the fuzzy c-means algorithm. Silhouette coefficient and adjusted Rand index were used to evaluate the effect. The results show that using different number of GreenSeeker sensors has a particular effect on the delimitation of fertilization management zones. In the case of delimiting fertilization management zones through 10,000 NDVI data collected by three GreenSeeker sensors, silhouette coefficient is 0.563, and adjusted Rand index is 0.76. In the case of 20,000 NDVI data collected by three GreenSeeker sensors, silhouette coefficient is 0.559, and adjusted Rand index is 0.698. It indicates that the effect of using three GreenSeeker sensors on the delimitation of fertilization management zones is not different from that of using six GreenSeeker sensors. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
29. A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points.
- Author
-
Mustafi, D. and Mustafi, A.
- Subjects
PARTICLE swarm optimization ,DIFFERENTIAL forms ,CORPORA ,DOCUMENT clustering ,ALGORITHMS ,DIFFERENTIAL evolution - Abstract
Document clustering is a well established technique used to segregate voluminous text corpora into distinct categories. In this paper we present an improved algorithm for clustering large text corpus. The proposed algorithm tries to overcome the challenges of clustering large corpora, while maintaining high "goodness" values for the proposed clusters. The algorithm proceeds by optimizing a fitness function using Differential Evolution to form the initial clusters. The clusters obtained after the initial phase are then "refined" by re-evaluating the points that fall at the fringes of the clusters and reassigning them to other clusters, if necessary. Two different approaches e.g. Nearest Cluster Based Re-evaluation (N-CBR) and Multiple Cluster Based Re-evaluation (M-CBR) have been proposed to select candidates during the reassignment phase and their performances have been evaluated. The result of such a post processing phase has been demonstrated on a number of standard benchmark text corpora and the algorithm is found to be quite accurate and efficient. The results obtained by the proposed method have also been compared to other evolutionary strategies e.g. Genetic Algorithm(GA), Particle Swarm Optimization(PSO), Harmony Search(HS), and have been found to be quite satisfactory. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
30. A Neutrosophic based C-Means Approach for Improving Breast Cancer Clustering Performance
- Author
-
Ahmed Abdel Hafeez, Hoda K. Mohamed, Ali Maher, and Ahmed Abdel-Monem
- Subjects
breast cancer dataset clusterability ,fuzzy c-means clustering ,neutrosophic c-means clustering ,silhouette coefficient ,Mathematics ,QA1-939 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Breast cancer is among the most prevalent cancers, and early detection is crucial to successful treatment. One of the most crucial phases of breast cancer treatment is a correct diagnosis. Numerous studies exist about breast cancer classification in the literature. However, analyzing the cancer dataset in the context of clusterability for unsupervised modeling is rare. This work analyzes pointedly the breast cancer dataset clusterability via applying the widely used c-means clustering algorithm and its evolved versions fuzzy and neutrosophic ones. An in-depth comparative study is conducted utilizing a set of quantitative and qualitative clustering efficiency metrics. The study's outcomes divulge the presented neutrosophic c-means clustering superiority in segregating similar breast cancer instances into clusters.
- Published
- 2023
- Full Text
- View/download PDF
31. Spatio-Temporal Pattern Analysis of Forest Fire in Malang based on Remote Sensing using K-Means Clustering.
- Author
-
Kirana, Annisa Puspa, Astiningrum, Mungki, Vista, Candra Bella, Bhawiyuga, Adhitya, and Amrozi, Aris Nur
- Subjects
- *
K-means clustering , *REMOTE sensing , *FOREST fire prevention & control , *DATA mining - Abstract
Forest and land fire significantly impact the balance of the environment, such as haze pollution, destruction of ecosystems, the high release of carbon in the air, deterioration of health, and losses in various other fields. Based on these factors, developing an early warning system is essential to prevent forest fires, especially in forest and land areas. One of the data that can be used to monitor areas where there are frequent fires is hotspot data taken from the NASA MODIS Fire satellite. Data mining techniques are carried out to process the hotspot data so that the distribution of hotspot swarms is obtained. The data on the distribution of the clustering of hotspots are used to detect areas that are prone to fire from year to year. This study used the K-Means clustering algorithm. The data used in this study is hotspot data from Malang District, Indonesia. The range of hotspot data from January 2018 to June 2022. We use Silhouette coefficient testing to get the best number of classes in the cluster--this study's most recent application of the K-means clustering method to analyze hotspot distribution in a spatial- temporally. We use hotspot data in Malang's forest and land area using hotspot confidence levels >80%. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
32. The Implementation of Enhanced K-Strange Points Clustering Method in Classifying Undergraduate Thesis Titles
- Author
-
Madeira, Malcolm Andrew, Jacob, Teslin, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Shakya, Subarna, editor, Balas, Valentina Emilia, editor, Kamolphiwong, Sinchai, editor, and Du, Ke-Lin, editor
- Published
- 2022
- Full Text
- View/download PDF
33. Edutech Digital Start-Up Customer Profiling Based on RFM Data Model Using K-Means Clustering
- Author
-
Dedy Panji Agustino, I Gede Harsemadi, and I Gede Bintang Arya Budaya
- Subjects
customer segmentation ,silhouette coefficient ,elbow method ,davies bouldin index ,business intelligences ,Mathematics ,QA1-939 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Digital start-up is companies with a high risk because they are still looking for the most fitting business model and the right market. The company's growth is the primary goal of the start-up. As a newly established company, digital start-ups have one challenge, it is the ineffectiveness of the marketing process and strategic schemes in terms of maintaining customer loyalty, the same goes for edutech digital start-ups. Ineffective and inefficient plans can waste resources. Hence, a method is needed to find out the optimal solution to understanding the customer characteristic. Business Intelligence is needed, with the customer profiling process using transaction data based on the RFM (Retency, Frequency, Monetary) model using the K-Means algorithm. In this study, the transaction data comes from an education platform digital start-up assisted by the STIKOM Bali business incubator. Based on three metrics, namely the Elbow Method, Silhouette Scores, and Davis Bouldin Index, transaction data for sales retency, sales frequency, and sales monetary can be analyzed and can find the optimal solution. For this case, K = 2 is the optimum cluster solution, where the first cluster is the customer who needs more engagement, and the second cluster is the best customer
- Published
- 2022
- Full Text
- View/download PDF
34. Pemetaan UMKM dalam Upaya Pengentasan Kemiskinan dan Penyerapan Tenaga Kerja Menggunakan Algoritma K-Means
- Author
-
Herwinda Kurniadewi, Rijal Abdul Hakim, Mohamad Jajuli, and Jajam Haerul Jaman
- Subjects
algoritma k-means ,clustering ,python ,silhouette coefficient ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Pandemi Covid menciptakan krisis ekonomi. Meningkatkan angka kemiskinan sebanyak dua digit dalam satu tahun di Indonesia. Pandemi Covid juga berdampak terhadap kondisi ketenagakerjaan Indonesia seperti mulai sulit mencari pekerjaan. Penyerapan tenaga kerja memiliki korelasi yang erat dengan kemiskinan. Tenaga kerja mempunyai pengaruh yang signifikan terhadap tingkat kemiskinan salah satu Kabupaten di Jawa Barat yang memiliki angka kemiskinan dan pencari tenaga kerja yang cukup tinggi meningkat dibanding tahun sebelumnya adalah Kabupaten Purwakarta. Pengentasan kemiskinan dengan mengembangkan UMKM memiliki potensi yang cukup baik. Pengembangan UMKM akan dapat menyerap lebih banyak tenaga kerja dan meningkatkan pendapatan masyarakat sehingga dapat mendorong laju pertumbuhan ekonomi. Dalam Penelitian ini menggunakan metodologi CRISP-DM. Pada penelitian ini dilakukan pengelompokan UMKM di Kabupaten Purwakarta berdasarkan lokasi, jumlah UMKM, jumlah penduduk miskin dan jumlah pencari kerja dengan menggunakan algoritma k-means dan dilakukan pemetaan menggunakan bahasa pemrograman python. Hasil pengelompokkan didapatkan 3 cluster yaitu cluster prioritas tinggi yaitu sebanyak 6 kecamatan, cluster prioritas rendah sebanyak 8 kecamatan dan cluster prioritas rendah sebanyak 3 kecamatan. Untuk mengetahui performa dari model, dilakukan evaluasi silhouette coefficient yang didapatkan nilai sebesar 0.45
- Published
- 2022
- Full Text
- View/download PDF
35. Dominant partitioning method of rock mass discontinuity based on DBSCAN selective clustering ensemble
- Author
-
ZHANG Hua-jin, WU Shun-chuan, and HAN Long-qiang
- Subjects
rock mass discontinuity ,dominant attitude ,clustering ensemble ,density-based spatial clustering of applications with noise (dbscan) ,silhouette coefficient ,Engineering geology. Rock mechanics. Soil mechanics. Underground construction ,TA703-712 - Abstract
For the problems existing in the traditional single discontinuity (structural plane) based clustering model, such as the risk of misclassification or omission and the inability to identify noise and isolated values, a dominant partitioning method of rock mass discontinuity based on selective clustering ensemble using density-based spatial clustering of applications with noise (DBSCAN) algorithm is proposed. Firstly, the spatial coordinate transformation is performed with the attitude of discontinuity, and the sine of the angle between the unit normal vectors is defined as similarity measurement. Then, a certain number of different base clusters are constructed based on the DBSCAN algorithm, with the selective clustering ensemble technology, some excellent base clusters are selected. Finally, the consistent ensemble technology is used to fuse these base clusters to generate a highly reliable selective clustering ensemble result. The DIPS software data set and the discontinuity survey result in the dam site area of Songta hydropower station are used to test the feasibility and effectiveness of the proposed method. The research results show that the clustering effect of the proposed method is significantly better than that of common clustering algorithms. The clustering results are objective and reasonable. It not only effectively identifies noise and isolated values, but also overcomes the shortcomings of over-segmentation or under-segmentation of the single discontinuity based clustering model. The research results are valuable in accurately determining the dominant group of discontinuity.
- Published
- 2022
- Full Text
- View/download PDF
36. Clustering Data Penduduk Miskin Dampak Covid-19 Menggunakan Algoritma K-Medoids
- Author
-
Novi Widiawati, Betha Nurina Sari, and Tesa Nur Padilah
- Subjects
clustering ,data mining ,k-medoids ,penduduk miskin ,silhouette coefficient ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Kemiskinan merupakan masalah yang mendasar, kemiskinan bisa berakibat pada terhambatnya pembangunan nasional. Ada beberapa aspek yang berkaitan dengan kemiskinan yaitu faktor ekonomi, politik, dan psikososial. Secara ekonomi, kemiskinan diartikan sebagai kurangnya sumber daya untuk memenuhi kebutuhan hidup dan meningkatkan kesejahteraan. Pada penelitian ini data yang digunakan pada tahun 2020 yang bersumber dari Badan Pusat Statistika. Dalam upaya menemukan kasus kemiskinan dampak covid-19 dapat menggunakan Data Mining. Tujuan dari penelitian ini untuk mengelompokkan kabupaten/kota yang memiliki kemiskinan dampak covid-19 dengan tingkat tinggi dan rendah di Indonesia. Penelitian yang akan dilakukan dengan langkah data mining yaitu CRISP-DM (Cross Industry Standart for Data Mining) yang terdiri dari 6 fase yaitu pemahaman bisnis (business understanding), pemahaman data (data understanding), pengolahan data (data preparation), pemodelan (modelling), evaluasi (evaluation), dan penyebaran (deployment). Algoritme yang digunakan pada penelitian ini yaitu K-Medoids. Pengukuran menggunakan bahasa R dengan bantuan fungsi Pamk sehingga hasil yang didapatkan pada dataset Penduduk Miskin Tahun 2020 memiliki cluster optimal sebanyak 2 cluster. Cluster1 dengan jumlah 121 kabupaten/kota dengan kategori tinggi, sedangkan cluster2 dengan jumlah 427 dengan kategori rendah. Hasil dari evaluasi nilai Silhouette Coefficinet sebesar 0,4735719 .
- Published
- 2022
- Full Text
- View/download PDF
37. RFM model for customer purchase behavior using K-Means algorithm
- Author
-
P. Anitha and Malini M. Patil
- Subjects
Recency ,Frequency ,Monetary ,Silhouette coefficient ,Business intelligence ,Segmentation ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The objective of this study is to apply business intelligence in identifying potential customers by providing relevant and timely data to business entities in the Retail Industry. The data furnished is based on systematic study and scientific applications in analyzing sales history and purchasing behavior of the consumers. The curated and organized data as an outcome of this scientific study not only enhances business sales and profit, but also equips with intelligent insights in predicting consumer purchasing behavior and related patterns. In order to execute and apply the scientific approach using K-Means algorithm, the real time transactional and retail dataset are analyzed. Spread over a specific duration of business transactions, the dataset values and parameters provide an organized understanding of the customer buying patterns and behavior across various regions. This study is based on the RFM (Recency, Frequency and Monetary) model and deploys dataset segmentation principles using K-Means Algorithm. A variety of dataset clusters are validated based on the calculation of Silhouette Coefficient. The results thus obtained with regard to sales transactions are compared with various parameters like Sales Recency, Sales Frequency and Sales Volume.
- Published
- 2022
- Full Text
- View/download PDF
38. The detection algorithm for disguised missing value based on filter-Kmeans.
- Author
-
Shi, Jinyu, Sun, Yuming, and Du, Xiaohan
- Subjects
LAW of large numbers ,ALGORITHMS - Abstract
In order to reduce the impact of disguised missing value on data analysis, this paper proposes a new algorithm - The Detection Algorithm for Disguised Missing Value Based on Filter-Kmeans. The algorithm identifies mainly disguised missing value by the clustering effect, and is applied to the data set with certain probability of data points. Firstly, the suitable data object points are selected using the silhouette coefficient method and Bernoulli's law of large numbers. And then the weighted average distance is used to control the cluster traversal. Finally, the filtering operation is performed during the process of cluster traversal. According to the experimental results, the algorithm achieves better improvement in the precision ratio, recall ratio and F1-measure on the open dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
39. Identification of geological characteristics from construction parameters during shield tunnelling.
- Author
-
Yan, Tao, Shen, Shui-Long, and Zhou, Annan
- Subjects
- *
TUNNEL design & construction , *FUZZY sets , *PRINCIPAL components analysis , *SET theory , *ROCK music , *FUZZY algorithms - Abstract
This paper proposes a framework to identify geological characteristics (GC) based on borehole data and operational data during shield tunnelling using a fuzzy C-means algorithm. The proposed fuzzy C-means model was established by integrating the K-means ++ algorithm into the fuzzy set theory. The identified factors for GC include advance rate, cutterhead rotation speed, thrust, cutterhead torque, penetration rate, torque penetration index, field penetration index, and specific energy. Principal component analysis was employed to reduce the dimensions of these factors. The first six principal components were employed to analyse the GC and establish the input data set in the fuzzy C-means model. The types of GC were determined based on elbow method, silhouette coefficient, fuzzy partition coefficient and the geological profile from borehole data. The proposed approach was validated by a case of Guangzhou intercity tunnel construction. The results present that the proposed fuzzy C-means model can effectively determine GC and provide membership to reveal the proportion of hard rock. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
40. INDONESIAN TERRITORY CLUSTERING BASED ON HARVESTED AREA AND RICE PRODUCTIVITY USING CLUSTERING ALGORITHM.
- Author
-
Kurniawati, Imelda Putri, Pratiwi, Hasih, and Sugiyanto
- Subjects
- *
RICE yields , *PLANT development , *COEFFICIENTS (Statistics) , *INDONESIANS , *PADDY fields - Abstract
Rice (Latin: Oryza sativa) is one of the most important cultivated plants in civilization. This plant is the main commodity for almost all Indonesian people. Indonesia is in third place as the largest rice producing country in the world. However, based on data from the Statistics Indonesia, Indonesia will still import rice until 2022. The transfer of paddy fields is one of the reasons why Indonesia is still importing rice to this day. Many lands that used to be paddy fields have turned into airports, industrial land, housing, and so on. Rice production is one of the important topics to be discussed in order to develop rice production in areas that are still relatively low. The purpose of this research is to classify cities/regencies in Indonesia based on rice production data in 2021. In this study, three clustering methods were used, namely, Partitioning Around Medoid (PAM), Clustering Large Applications (CLARA) and Fuzzy C-Means (FCM). Then the three methods are compared based on their silhouette coefficient values. The best obtained method is FCM method with two clusters and a silhouette value of 0.828. Results clustering with the best method is used as a reference in making maps clustering. Areas that are still relatively low are expected to increase rice productivity. The PAM algorithm produces two clusters with a silhouette coefficient value of 0.58. The CLARA algorithm with 100 samples produces three clusters with a silhouette coefficient of 0.59. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
41. K-RBBSO Algorithm: A Result-Based Stochastic Search Algorithm in Big Data.
- Author
-
Park, Sungjin and Kim, Sangkyun
- Subjects
SEARCH algorithms ,BEES algorithm ,BIG data ,ALGORITHMS ,SIMULATED annealing ,MACHINE learning ,TABU search algorithm ,SWARM intelligence - Abstract
Clustering is widely used in client-facing businesses to categorize their customer base and deliver personalized services. This study proposes an algorithm to stochastically search for an optimum solution based on the outcomes of a data clustering process. Fundamentally, the aforementioned goal is achieved using a result-based stochastic search algorithm. Hence, shortcomings of existing stochastic search algorithms are identified, and the k-means-initiated rapid biogeography-based silhouette optimization (K-RBBSO) algorithm is proposed to overcome them. The proposed algorithm is validated by creating a data clustering engine and comparing the performance of the K-RBBSO algorithm with those of currently used stochastic search techniques, such as simulated annealing and artificial bee colony, on a validation dataset. The results indicate that K-RBBSO is more effective with larger volumes of data compared to the other algorithms. Finally, we describe some prospective beneficial uses of a data clustering algorithm in unsupervised learning based on the findings of this study. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
42. Learning Performance Styles in Gamified College Classes Using Data Clustering.
- Author
-
Park, Sungjin and Kim, Sangkyun
- Abstract
This study aimed to investigate the efficacy of learning gamification in developing sustainable educational environments. To this end, gamified class data were analyzed to identify students' learning performance patterns. The study sample comprised 369 data points collected across four point domains: Activity, Game, Project, and Exam Points, which students obtained in their gamified college courses conducted between 2016 and 2019. A K-means data clustering algorithm and silhouette analysis were utilized to evaluate student performances and determine differential learning styles in gamified environments. Cluster analysis revealed three types of learning patterns centered on performance, mastery, and avoidance. Based on our findings, we propose suggestions regarding class design for instructors considering using gamification strategies to support a sustainable educational environment. We also highlight the scope for future research in both in-person and online gamified learning. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. Comparison Between Davies-Bouldin Index and Silhouette Coefficient Evaluation Methods in Retail Store Sales Transaction Data Clusterization Using K-Medoids Algorithm.
- Author
-
Amrulloh, Kholiq, Pudjiantoro, Tacbir Hendro, Sabrina, Puspita Nurul, and Hadiana, Asep Id
- Subjects
RETAIL stores ,MARKETING channels ,MARKETING strategy ,INVENTORY control ,INDUSTRIAL engineering - Abstract
Retail business is the business of selling goods or services to consumers in units or retail. This retail business is part of the distribution channel that plays a vital role in a series of marketing activities as well as a liaison between the interests of producers and consumers. Based on sales transaction data in retail stores in 2020 obtained from www.kaggle.com, the inventory of goods is not proportional to the sales of goods. Excessive inventory and low sales levels resulted in goods accumulation in retail stores. When the sales cycle of goods is down, the stock must be prepared according to the level of sales. It takes a grouping of data to schedule an inventory of interests following the status of the purchase of goods. The data grouping used in this study uses the K-Medoids algorithm. K-Medoids is a method of partitioning clustering to group a set of (n) objects into several (k) clusters. Based on the elbow method, the optimal cluster number is 2 clusters. From the clustering process, the results obtained are cluster 1 has 320 data and cluster 2 has 765 data. The accuracy level of the cluster formed using the Davies-Bouldin Index method is 0.662748, and the Silhouette Coefficient is 0.276353. [ABSTRACT FROM AUTHOR]
- Published
- 2022
44. Spatial Rough k-Means Algorithm for Unsupervised Multi-spectral Classification
- Author
-
Raj, Aditya, Minz, Sonajharia, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Senjyu, Tomonobu, editor, Mahalle, Parikshit N., editor, Perumal, Thinagaran, editor, and Joshi, Amit, editor
- Published
- 2021
- Full Text
- View/download PDF
45. Identify Elementary Student Distribution Based on Kompetisi Sains Madrasah Data Using Probabilistic Distance Clustering
- Author
-
Yusuf, Ahmad, Wahyudi, Noor, Ulya, Zakiyatul, Ulinnuha, Nurissaidah, Rolliawati, Dwi, Mustofa, Ali, Fauzi, Ahmad, Asyhar, Ahmad Hanif, Kusaeri, Indriyati, Ratna, Novitasari, Dian Candra Rini, Maryunah, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Zhang, Yu-Dong, editor, Senjyu, Tomonoby, editor, SO–IN, Chakchai, editor, and Joshi, Amit, editor
- Published
- 2021
- Full Text
- View/download PDF
46. Research and Application of Precision Marketing Algorithms for ETC Credit Card Based on Telecom Big Data
- Author
-
Tang, Xinyi, Cheng, Chen, Xu, Lexi, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zhang, Junjie James, Series Editor, Wang, Yue, editor, Xu, Lexi, editor, Yan, Yufeng, editor, and Zou, Jiaqi, editor
- Published
- 2021
- Full Text
- View/download PDF
47. Effects of Performance Clustering in User Modelling for Learning Style Knowledge Representation
- Author
-
Teoh, Chin-Wei, Ho, Sin-Ban, Dollmat, Khairi Shazwan, Chai, Ian, Mohd-Isa, Wan-Noorshahida, Tan, Chuie-Hong, Teh, Sek-Kit, Raihan, Manzoor Shahida, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fujita, Hamido, editor, Selamat, Ali, editor, Lin, Jerry Chun-Wei, editor, and Ali, Moonis, editor
- Published
- 2021
- Full Text
- View/download PDF
48. Secure and Evaluable Clustering Based on a Multifunctional and Privacy-Preserving Outsourcing Computation Toolkit
- Author
-
Jialin Li, Penghao Lu, and Xuemin Lin
- Subjects
Privacy-preserving ,secure outsourcing computation ,K-means ,silhouette coefficient ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Although tremendous revolution has been made in the emerging cloud computing technologies over digital devices, privacy gradually becomes a big concern in outsourcing computation. Homomorphic encryption has been proposed to facilitate the preservation of data privacy while computational tasks being executed on ciphertext. However, many existing studies only support limited homomorphic calculation functions which barely satisfy complex computing tasks such as machine learning with massive computing resources and rich types of function. To address this problem, a novel multifunctional and privacy-preserving outsourcing computation toolkit is proposed in this paper, which supports several homomorphic computing protocols including division and power on ciphertext of integers and floating point numbers. Specifically, we first implement the homomorphic mutual conversion protocol between integer and floating point ciphertext to balance the efficiency and feasibility, considering the high-precision ciphertext operation on floating point numbers costs 100x computational overhead than that on integers. Second, we implement a homomorphic K-means algorithm based on our proposed toolkit for clustering and design the homomorphic silhouette coefficient as the evaluation index, thereby providing an informative cluster assessment for local users with limited resources. Then, we simulate the protocols of our proposed toolkit to explore the parameter sensitivity in terms of computational efficiency. Last, we report security analysis to prove the security of our toolkit without privacy leakage to unauthorized parties. Comprehensive experiments further demonstrate the efficiency and utility of our toolkit.
- Published
- 2022
- Full Text
- View/download PDF
49. Identification of the distribution village maturation: Village classification using Density-based spatial clustering of applications with noise
- Author
-
Okfalisa Okfalisa, Angraini Angraini, Shella Novi, Hidayati Rusnedy, Lestari Handayani, and Mustakim Mustakim
- Subjects
clustering ,density-based spatial clustering of applications with noise ,python ,silhouette coefficient ,village maturity ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The rural development measurement is undoubtedly not easy due to its particular needs and conditions. This study classifies village performance from social, economic, and ecological indices. One thousand five hundred ninety-one villages from the Community and Village Empowerment Office at Riau Province, Indonesia, are grouped into five village maturation classes: very under-developed village, under-developed village, developing village, developed village, and independent village. To date, Density-based spatial clustering of applications with noise (DBSCAN) is utilized in mining 13 of the villages’ attributes. Python programming is applied to analyze and evaluate the DBSCAN activities. The study reveals the grouping’s silhouette coefficient values at 0.8231, thus indicating the well-being clustering performance. The epsilon and minimum points values are considered in DBSCAN evaluation with percentage splits simulation. This grouping can be used as guidelines for governments in analyzing the distribution of rural development subsidies more optimal.
- Published
- 2021
- Full Text
- View/download PDF
50. Kombinasi Single Linkage Dengan K-Means Clustering Untuk Pengelompokan Wilayah Desa Kabupaten Pemalang
- Author
-
Sintiya Sintiya, Tri Ginanjar Laksana, and Nia Annisa Ferani Tanjung
- Subjects
cluster ,davies boulldin index ,k-means ,silhouette coefficient ,single linkage ,prediction ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
K-Means is very dependent on determining the center cluster initial which has an impact on the quality of clusters resulting, in addition to determining the center of cluster the number of k that will be used it can also affect the quality of the cluster from the method K-Means. Poverty is mostly experienced by rural communities, this can be seen from the lack of existing facilities to serve the interests of the community in various fields. To avoid the imbalance that occurs, a development plan is needed in accordance with the characteristics of the welfare of the people in the region. Therefore, we need an effort to group villages so that policy making is right on target. One of the algorithms clustering that is commonly used is the K-Means algorithm because it is quite simple, easy to implement, and has the ability to group large data groups very quickly. However, the K-Means algorithm has a weakness in determining the center cluster initial given. Initialization of centers cluster randomly may result in formation clusters changing (inconsistent). For this reason, the K-Means method needs to be combined with the hierarchical method in determining the center cluster initial. This combination method is called Hierarchical K-Means which is a combination of methods hierarchical and partitioning, where the process is hierarchical used to find the initial center initialization cluster and the process partitioning to get the cluster optimal. The hierarchical method used in this study is the method single linkage. Based on the method Elbow , the recommended amount of k for this study is k = 4.The combination of the single linkage and k-means algorithms with k = 4 in this study results in avalue silhouette coefficient of 0.685 which is a feasible or appropriate cluster category, while the evaluation measurement by Davies The Boulldin Index yielded a value of 0.577.
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.