Universiti Teknologi Malaysia Institutional Repository

Cross-lingual sentiment classification using semi-supervised learning

Hajmohammadi, Mohammad Sadegh (2015) Cross-lingual sentiment classification using semi-supervised learning. PhD thesis, Universiti Teknologi Malaysia, Faculty of Computing.

[img]
Preview
PDF
288kB

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

Cross-lingual sentiment classification aims to utilize annotated sentiment resources in one language for text sentiment classification in another language. Automatic machine translation services are the most commonly used tools to directly project information from one language into another. However, different term distribution between translated and original documents, translation errors and different intrinsic structure of documents in various languages are the problems that lead to low performance in sentiment classification. Furthermore, due to the existence of different linguistic terms in different languages, translated documents cannot cover all vocabularies which exist in the original documents. The aim of this thesis is to propose an enhanced framework for cross-lingual sentiment classification to overcome all the aforementioned problems in order to improve the classification performance. Combination of active learning and semi-supervised learning in both single view and bi-view frameworks is proposed to incorporate unlabelled data from the target language in order to reduce term distribution divergence. Using bi-view documents can partially alleviate the negative effects of translation errors. Multi-view semisupervised learning is also used to overcome the problem of low term-coverage through employing multiple source languages. Features that are extracted from multiple source languages can cover more vocabularies from test data and consequently, more sentimental terms can be used in the classification process. Content similarities of labelled and unlabelled documents are used through graphbased semi-supervised learning approach to incorporate the structure of documents in the target language into the learning process. Performance evaluation performed on sentiment data sets in four different languages certifies the effectiveness of the proposed approaches in comparison to the well-known baseline classification methods. The experiments show that incorporation of unlabelled data from the target language can effectively improve the classification performance. Experimental results also show that using multiple source languages in the multi-view learning model outperforms other methods. The proposed framework is flexible enough to be applied on any new language, and therefore, it can be used to develop multilingual sentiment analysis systems.

Item Type:Thesis (PhD)
Additional Information:Thesis (Ph.D (Sains Komputer)) - Universiti Teknologi Malaysia, 2015; Supervisors : Dr. Roliana Ibrahim, Prof. Dr. Ali Selamat
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computing
ID Code:77727
Deposited By: Fazli Masari
Deposited On:29 Jun 2018 21:45
Last Modified:29 Jun 2018 21:45

Repository Staff Only: item control page