Universiti Teknologi Malaysia Institutional Repository

Language identifications of Arabic script web documents using independent component analysis

Selamat, Ali and Lee, Zhi-Sam (2008) Language identifications of Arabic script web documents using independent component analysis. In: Proceedings - 2nd Asia International Conference on Modelling and Simulation, AMS 2008. Institute of Electrical and Electronics Engineers, New York, 427 -432. ISBN 978-076953136-6

Full text not available from this repository.

Official URL: http://dx.doi.org/10.1109/AMS.2008.46

Abstract

We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.

Item Type:Book Section
Additional Information:ISBN: 978-076953136-6; 2nd Asia International Conference on Modelling and Simulation, AMS 2008; Kuala Lumpur; 13 May 2008 through 15 May 2008
Uncontrolled Keywords:alpha particle spectrometers, asset management, feature extraction, hemodynamics, image retrieval, information services, information theory, linguistics, particle spectrometers, query languages, search engines, security of data, separation, singular value decomposition, speech recognition, world wide web
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computer Science and Information System
ID Code:12612
Deposited By: Liza Porijo
Deposited On:14 Jun 2011 05:11
Last Modified:14 Jun 2011 05:11

Repository Staff Only: item control page