Universiti Teknologi Malaysia Institutional Repository

The hybrid feature selection k-means method for Arabic webpage classification

Alghamdi, Hanan and Selamat, Ali (2014) The hybrid feature selection k-means method for Arabic webpage classification. Jurnal Teknologi, 70 (5). pp. 73-79. ISSN 0127-9696

Full text not available from this repository.

Official URL: http://dx.doi.org/10.11113/jt.v70.3518

Abstract

The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best solution. Therefore, in this paper, we propose a feature selection model that incorporates three different feature selection methods (CHI-squared, mutual information, and term frequency-inverse document frequency) to build a hybrid feature selection model (Hybrid-FS) for k-means clustering. This model represents text data in a high structure (consisting of three types of objects, namely, the terms, documents and categories). We evaluate the model on a set of common Arabic online newspapers. We assess the effect of using the Hybrid-FS with standard k-means clustering. The experimental results show that the proposed method increases purity by 28% and lowers the runtime by 80% compared to the standard k-means algorithm. We conclude that the proposed hybrid feature selection model enhances the accuracy of the k-means algorithm and successfully produces coherent-compact clusters that are well-separated when applied to high-dimensional datasets.

Item Type:Article
Uncontrolled Keywords:feature selection, webpage classification
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computing
ID Code:62935
Deposited By: Fazli Masari
Deposited On:03 Oct 2017 04:26
Last Modified:01 Nov 2017 04:17

Repository Staff Only: item control page