Ashabi, Ardavan (2022) Enhancement of parallel K-means algorithm for clustering big datasets. PhD thesis, Universiti Teknologi Malaysia.
PDF
760kB |
Official URL: http://dms.library.utm.my:8080/vital/access/manage...
Abstract
Big Data encompasses huge amounts of complex data which is generated in different areas such as business, marketing, educational systems, IoT, and healthcare. For instance, in the healthcare domain, huge amounts of data are generated daily from different sources such as health monitoring and medical diagnosis systems by health service providers. Data mining aims to extract meaningful and valuable patterns from a set of raw data to transform data into meaningful information for better decision-making. However, Big Data is very complex and voluminous, and traditional methods of Data Mining are not capable to process and analyze this data efficiently. Data clustering, one of the main methods of data mining, eases the extraction of information from each cluster separately. Since 1960s, K-means algorithm has been known as one of the most classical techniques of data clustering. Even though there has been an extremely rich bibliography about improving the efficiency of K-means for years now, traditional K-means still suffers from some weaknesses, especially in dealing with Big Data. Despite many attempts to optimize K-means algorithm to handle Big Data using different techniques such as parallelization, the proposed methods are still not able to cluster Big Datasets efficiently due to lack of improvement in some effective parameters such as the number of clusters and the initial clusters' centroids. This study aims to understand the current limitations of K-means algorithm and to overcome the limitations in order to produce more efficient performance in clustering big datasets from healthcare domain. To develop the optimized extension of K-means algorithm, a systematic literature review (SLR) was conducted to investigate the current limitations and existing solutions for the K-means limitations over Big Data. Based on the the SLR, this study proposed an enhanced parallel version of K-means clustering algorithm to reduce the execution time of the clustering process over the big datasets with the minimum negative impact on the clustering’s accuracy. Determining the optimum number of clusters, obtaining the suitable initial centroids, and improving the process of parallelization were the three steps of the optimization process. To avoid any random results, the proposed hybrid solution defined the optimum number of clusters by using elbow method. In addition, the proposed algorithm obtained the ideal initial centroids by utilizing a careful seed selection method, performing K-means with a fuzzy technique to increase the precision of the clustering, and parallelizing the clustering process by using Hadoop platform with the optimized Map and Reduce functions to reduce the execution time of the process. The evaluation of the proposed algorithm revealed that the new method performed the clustering process over multiple big datasets with shorter execution time compared to the study’s benchmarks: Apache Mahout K-means, K-means++, and Fuzzy K-means. Also, the results of the three selected cluster validity indices - Silhouette, Dunn, and Davies-Bouldin - verified that there was no negative impact on the quality of the clusters.
Item Type: | Thesis (PhD) |
---|---|
Uncontrolled Keywords: | Big Data, K-means, Hadoop platform |
Subjects: | T Technology > T Technology (General) |
Divisions: | Razak School of Engineering and Advanced Technology |
ID Code: | 102827 |
Deposited By: | Widya Wahid |
Deposited On: | 24 Sep 2023 03:20 |
Last Modified: | 24 Sep 2023 03:20 |
Repository Staff Only: item control page