Universiti Teknologi Malaysia Institutional Repository

Web based cross language semantic plagiarism detectio

Chow, Kok Kent (2013) Web based cross language semantic plagiarism detectio. Masters thesis, Universiti Teknologi Malaysia, Faculty of Computing.

Full text not available from this repository.

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

Recently, cross language and semantic plagiarism are on the rise. Many plagiarism detection tools are not capable to detect such plagiarism cases. In this research, we propose a new framework which involves summarization, cross language and semantic plagiarism detection. We consider Bahasa Melayu as the input language of the submitted document and English as the language of, possibly plagiarised documents. In this framework we shorten the query document by utilising fuzzy swarm-based summarisation approach. With this summarisation approach, sentences are chosen based on their importance level that determined by five predefined sentence features, which integrated with fuzzy logic. This technique is chosen for its effectiveness achieved in previous research. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. We integrate the use of Stanford Parser and WordNet to determine the semantic similarity level between the suspected documents and candidate source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as nouns, verbs and adjectives. Based on these roles, we represent each sentence in a predicate form and similarity is measured based on those predicates using information content value from WordNet taxonomy. The testing dataset is built up from two sets of Malay documents which are produced based on different plagiarism practices. The result of our proposed semantic based similarity measurement shows that it can achieve higher precision, recall and f-measure compared to the conventional Longest Common Subsequence (LCS) approach, which determines similarity between sentences based on their common subsequence from left to right with maximum length, regardless of their consecutive arrangement.

Item Type:Thesis (Masters)
Additional Information:Thesis (Sarjana Sains (Sains Komputer)) - Universiti Teknologi Malaysia, 2013; Supervisor : Prof. Dr. Naomie Salim
Subjects:Q Science > QA Mathematics > QA76 Computer software
Divisions:Computing
ID Code:42237
Deposited By: Haliza Zainal
Deposited On:09 Oct 2014 09:21
Last Modified:19 Aug 2020 08:11

Repository Staff Only: item control page