Universiti Teknologi Malaysia Institutional Repository

Arabic script web page language identification using hybrid-KNN method

Selamat, Ali and Subroto, I. M. I. and Ng, Choon Ching (2009) Arabic script web page language identification using hybrid-KNN method. International Journal of Computational Intelligence and Applications, 8 (3). pp. 315-343. ISSN 14690268

Full text not available from this repository.

Official URL: http://dx.doi.org/10.1142/S146902680900262X

Abstract

In this paper, we proposed hybrid-KNN methods on the Arabic script web page language identification. One of the crucial tasks in the text-based language identification that utilizes the same script is how to produce reliable features and how to deal with the huge number of languages in the world. Specifically, it has involved the issue of feature representation, feature selection, identification performance, retrieval performance, and noise tolerance performance. Therefore, there are a number of methods that have been evaluated in this work; k -nearest neighbor (KNN), support vector machine (SVM), backpropagation neural networks (BPNN), hybrid KNN-SVM, and KNN-BPNN, in order to justify the capability of the state-of-the-art methods. KNN is prominent in data clustering or data filtering, SVM and BPNN are well known in supervised classification, and we have proposed hybrid-KNN for noise removal on web page language identification. We have used the standard measurements which are accuracy, precision, recall and F 1 measurements to evaluate the effectiveness of the proposed hybrid-KNN. From the experiment, we have observed that BPNN is able to produce precise identification if the data set given is clean. However, when increasing the level of noise in the training data, KNN-SVM performs better than KNN-BPNN against the misclassification data, even on the level of 50% noise. Therefore, it is proven that KNN-SVM produce promising identification performance, in which KNN is able to reduce the noise in the data set and SVM is reliable in the language identification.

Item Type:Article
Uncontrolled Keywords:arabic script language identifications, Backpropagation neural networks (BPNN), Hybrid-KNN
Subjects:Q Science > QA Mathematics > QA76 Computer software
Divisions:Computer Science and Information System (Formerly known)
ID Code:13184
Deposited By: Ms Zalinda Shuratman
Deposited On:22 Jul 2011 02:09
Last Modified:22 Jul 2011 02:09

Repository Staff Only: item control page