Universiti Teknologi Malaysia Institutional Repository

Improving language identification of web page using optimum profile

Ng, C. -C. and Selamat, Ali (2011) Improving language identification of web page using optimum profile. In: Software Engineering and Computer Systems: Second International Conference, ICSECS 2011, Kuantan, Pahang, Malaysia, June 27-29, 2011, Proceedings, Part II. Springer Berlin Heidelberg, Dordrecht, South Holland, pp. 157-166. ISBN 978-364222190-3

Full text not available from this repository.

Official URL: http://dx.doi.org/10.1007/978-3-642-22191-0_14

Abstract

Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc.). The ability to identify other languages in relation to English is highly desirable. It is the goal of this research to improve the method used to achieve this end. Three methods have been studied in this research are distance measurement, Boolean method, and the proposed method, namely, optimum profile. From the initial experiments, we have found that, distance measurement and Boolean method is not reliable in the European web page identification. Therefore, we propose optimum profile which is using N-grams frequency and N-grams position to do web page language identification. The result show that the proposed method gives the highest performance with accuracy 91.52%.

Item Type:Book Section
Uncontrolled Keywords:boolean method, distance measurement, N-grams profile, optimum profile, rank-order statistics
Subjects:Q Science > QA Mathematics
Divisions:Others
ID Code:29186
Deposited By: Liza Porijo
Deposited On:25 Feb 2013 07:09
Last Modified:04 Feb 2017 08:39

Repository Staff Only: item control page