Universiti Teknologi Malaysia Institutional Repository

Design consideration for improved term weighting scheme for pornographic web sites

Salam, H. and Maarof, M. A. and Zainal, A. (2015) Design consideration for improved term weighting scheme for pornographic web sites. In: 4th World Congress on Information and Communication Technologies, WICT 2014, 8-11 Dec 2014, Melaka.

Full text not available from this repository.

Official URL: http://dx.doi.org/10.1007/978-3-319-17398-6_25

Abstract

Illicit Web content filtering is a content- based analysis technique, applied to censor inappropriate contents on the Internet. Web content filtering can recognize undesirable contents through the application of AI techniques, linguistic analysis, or machine learning to classify Web pages into a set of predefined categories. However, the capacity to distinguish between useful and harmful Web content remains a major research challenge, which usually leads to the problem of underblocking and over- blocking. Further, the extraction of best term representation for classifier presents a major limitation due to curse of dimensionality, where a feature can have the same term frequency (TF) in two or more categories but has different semantic meanings such as illicit pornography and sex education context also known as ambiguous issues. Besides, the high dimensionality of features on a Web page, even for moderate size, it has made the term representation value for classifier more complex, which affects the performance of classification. Thus, this research proposes a modified term weighting scheme (TWS) for narrative and discrete Web in order to increase the classification performance. Characteristics of pornography Web site were extracted and significant characteristics were identified and mapped against term weighting factors. Initial result revealed that other criteria such as rare feature have potential to be regarded as significant criteria in TWS technique to distinguish high- similarity Web content.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:feature selection, term weighting scheme, text categorization
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computing
ID Code:59215
Deposited By: Haliza Zainal
Deposited On:18 Jan 2017 01:50
Last Modified:07 Apr 2022 01:52

Repository Staff Only: item control page