Universiti Teknologi Malaysia Institutional Repository

Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients

Dzulkefli, Syasya Farina (2022) Data sampling methods on imbalanced datasets for pneumonia detection in covid-19 patients. Masters thesis, Universiti Teknologi Malaysia.

[img] PDF
223kB

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

Data classification is one of the important aspects in the real-world decision- making support function which can be severely affected by an imbalance class distribution in the training data especially in the medical field. In medical datasets, the data are mainly had imbalanced datasets problem which composed of minority of normal samples and majority of abnormal samples. As for today's example, the outbreak of novel coronavirus disease or also called as COVID-19 in late 2019 is still on-going which we can see new variants have been discovered from time to time and this can lead to increasing of number of cases around the world. The medical staffs can detect the patients by checking on the symptoms but one of the common COVID-19 symptoms that will be investigating in this research is pneumonia. It is important to detect the pneumonia faster at early stage to avoid it become more severe. Thus, Chest Xray scan images can be considered as one of the confirmatory approaches as they are fast to obtain and easily accessible. Diagnosing diseases in general is a considerable application of data analysis for medical science. In this research, data sampling methods will be explored and implemented for pneumonia detection for imbalanced datasets. The imbalanced datasets of pneumonia X-Ray images from Kaggle dataset will be obtained and different existing data sampling methods also new proposed methods that are achieved by combining or modifying exiting methods will be implemented to balance the images between majority and minority classes of the datasets. After achieved a balanced dataset, CNN model will be implemented to set benchmark of detection accuracy in terms of confusion matrix, precision, accuracy, F1-score and recall for each method and those results will be compared to choose which method will give the highest accuracy in detecting pneumonia. The best undersampling method is near miss with 85.47% accuracy, the best oversampling method is data augmentation with 88.78% accuracy and the best combination method is SMOTE + Tomek with 83.20% accuracy compared to 79% of accuracy when there is no method being implemented on the imbalanced dataset. Implementing data sampling methods will boost the performance of data classification in all applications especially in detecting pneumonia in COVID-19 patients.

Item Type:Thesis (Masters)
Uncontrolled Keywords:abnormal samples, COVID-19 symptoms, SMOTE + Tomek
Subjects:T Technology > TK Electrical engineering. Electronics Nuclear engineering
Divisions:Electrical Engineering
ID Code:102725
Deposited By: Widya Wahid
Deposited On:20 Sep 2023 03:24
Last Modified:20 Sep 2023 03:24

Repository Staff Only: item control page