K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets

Usman, Dauda (2014) K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets. PhD thesis, Universiti Teknologi Malaysia, Faculty of Science.

Preview

PDF
4MB

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases.

Item Type:	Thesis (PhD)
Additional Information:	Thesis (Ph.D (Matematik)) - Universiti Teknologi Malaysia, 2014; Supervisor : Assoc. Prof. Dr. Ismail Mohamad
Subjects:	Q Science > QA Mathematics
Divisions:	Science
ID Code:	77643
Deposited By:	Fazli Masari
Deposited On:	26 Jun 2018 07:37
Last Modified:	26 Jun 2018 07:37

Repository Staff Only: item control page