Universiti Teknologi Malaysia Institutional Repository

Enhanced text stemmer for standard and non-standard word patterns in Malay texts

Kassim, Mohamad Nizam (2020) Enhanced text stemmer for standard and non-standard word patterns in Malay texts. PhD thesis, Universiti Teknologi Malaysia, Faculty of Engineering - School of Computing.

[img]
Preview
PDF
1MB

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years, few text stemmers have been developed for the Malay language but unfortunately, these text stemmers suffer from various stemming errors. It is due to the difficulty in dealing with the complexity of the Malay language morphological rules. These text stemmers are developed for text stemming against affixation words only whereas there are other affixation, reduplication and compounding words in the Malay language. Furthermore, none of these text stemmers has been developed for text stemming against social media texts which comprise of the non-standard derived words. Therefore, this research study aims to improve the existing text stemmers capability of stemming affixation, reduplication and compounding words while minimising the possible stemming errors. Moreover, this research study also aims to address text stemming process for non-standard derived words on the social media platforms by removing non-standard affixes, clitics and particles. This research study adopts a multiple text stemming approach that use affix removal method and dictionary lookup in specific arrangement order to correctly stem standard and non-standard affixation, reduplication and compounding words in the standard texts and social media texts. The proposed text stemmer is evaluated against various text documents using the direct evaluation method and the text classification is used as the indirect evaluation method to validate the effectiveness of the proposed enhanced text stemmer. In general, the proposed enhanced text stemmer outperforms the baseline text stemmer. The stemming accuracy of the proposed enhanced text stemmer achieves an average of 98.7% against the standard texts and an average of 73.7% against the social media texts. Meanwhile, the performance of the proposed enhanced text stemmer in the sports news classification application achieves an average of 85% accuracy and the illicit content classification application achieves an average of 75% accuracy. Meanwhile, the baseline text stemmer achieves an average of 63.5% stemming accuracy against the standard texts but unfortunately, it is unable to stem non-standard derived words in the social media texts. The baseline text stemmer performs poorly in sports news classification and illicit content classification with an average accuracy of 78% and 63% respectively. In short, the experimental results suggest that the proposed enhanced text stemmer has promising stemming accuracy for text stemming against the standard texts and social media texts. It also influences the performance of the text classification application.

Item Type:Thesis (PhD)
Uncontrolled Keywords:proposed enhanced text stemmer, affixation, non-standard derived words
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computing
ID Code:98431
Deposited By: Yanti Mohd Shah
Deposited On:08 Jan 2023 02:12
Last Modified:08 Jan 2023 02:12

Repository Staff Only: item control page