Universiti Teknologi Malaysia Institutional Repository

Support Vector Machine – Recursive Feature Elimination for feature selection on multi-omics lung cancer data

Azman, Nuraina Syaza and A. Samah, Azurah and Lin, Ji Tong and Abdul Majid, Hairudin and Ali Shah, Zuraini and Nies, Hui Wen and Chan, Weng Howe (2023) Support Vector Machine – Recursive Feature Elimination for feature selection on multi-omics lung cancer data. Progress in Microbes and Molecular Biology, 6 (1). pp. 1-23. ISSN 2637-1049

[img] PDF
620kB

Official URL: http://dx.doi.org/10.36877/pmmb.a0000327

Abstract

Biological data obtained from sequencing technologies is growing exponentially. Multi-omics data is one of the biological data that exhibits high dimensionality, or more commonly known as the curse of dimensionality. The curse of dimensionality occurs when the dataset contains many features or attributes but with significantly fewer samples or observations. The study focuses on mitigating the curse of dimensionality by implementing Support Vector Machine – Recursive Feature Elimination (SVM-RFE) as the selected feature selection method in the lung cancer (LUSC) multi-omics dataset integrated from three single omics dataset comprising genomics, transcriptomics and epigenomics, and assess the quality of the selected feature subsets using SDAE and VAE deep learning classifiers. In this study, the LUSC datasets first undergo data pre-processing, including checking for missing values, normalization, and removing zero variance features. The cleaned LUSC datasets are then integrated to form a multi-omics dataset. Feature selection was performed on the LUSC multi-omics data using SVM-RFE to select several optimal feature subsets. The five smallest feature subsets (FS) are used in classification using SDAE and VAE neural networks to assess the quality of the feature subsets. The results show that all 5 VAE models can obtain an accuracy and AUC score of 1.000, while only 2 out of 5 SDAE models (FS 1000 & 4000) can do so. 3 out of 5 SDAE models have an AUC score of 0.500, indicating zero capability in separating the binary class labels. The study concludes that a fine-tuned supervised learning VAE model has better capability in classification tasks compared to SDAE models for this specific study. Additionally, 1000 and 4000 are the two most optimal feature subsets selected by the SVM-RFE algorithm. The SDAE and VAE models built with these feature subsets achieve the best classification results.

Item Type:Article
Uncontrolled Keywords:multi-omics analysis, Stacked Denoising Autoencoder (SDAE), Support Vector Machine – Recursive Feature Elimination (SVM-RFE), Variational Autoencoder (VAE)
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computing
ID Code:106549
Deposited By: Yanti Mohd Shah
Deposited On:09 Jul 2024 06:52
Last Modified:09 Jul 2024 06:52

Repository Staff Only: item control page