Universiti Teknologi Malaysia Institutional Repository

Features extraction for illicit web pages identification using identification component analysis

Lee, Zhi Sam and Maarof, Mohd. Aizaini and Selamat, Ali and Shamsuddin, Siti Mariyam (2007) Features extraction for illicit web pages identification using identification component analysis. In: International Conference on Intelligent and Advanced Systems (ICIAS’07), 2007, Kuala Lumpur.

Full text not available from this repository.


The illicit Web content such as pornography, violence, gambling, etc. have greatly polluted the mind of immature web users. Pornography perhaps is one of the biggest threats related to current childrenpsilas and teenagerspsila healthy mental life. A proper way to identify illicit web pages efficiently is highly desired. In this paper, we analyze the textual content of web pages such as pornography, gynecology, sex education and general business news using independent component analysis (ICA) algorithm. We establish three similar models which are principal component analysis (PCA) model, ICA model and PCA-ICA model as comparison. We evaluate the effectiveness of these proposed models using information retrieval measurement such as precision, recall, F1 and accuracy. Our experiment result shown that PCA and PCA-ICA models are capable to identify illicit web pages correctly with overall performance above than 90%. The idea of this research would give researchers an insight into textual content-based for web pages categorization.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:illicit web pages, component analysis
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computer Science and Information System
ID Code:13979
Deposited By: Liza Porijo
Deposited On:16 Aug 2011 09:54
Last Modified:06 Aug 2017 04:27

Repository Staff Only: item control page