Universiti Teknologi Malaysia Institutional Repository

Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine

Machap, Logenthiran (2021) Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine. PhD thesis, Universiti Teknologi Malaysia.

[img]
Preview
PDF
1MB

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide the genes into different clusters, where a gene or a condition only belongs to one cluster. A gene may contribute to more than one biological process, so a gene may belong to multiple clusters. Besides, the centroid in the objective function of network-assisted coclustering for the identification of cancer subtypes (NCIS) dragged with outliers. So, these outliers get their cluster instead of being ignored. Hence, this research is focusing on improving the NCIS method. Enhanced NCIS (iNCIS) is basically assigned weights to genes base on a gene interaction network, and it imperatively optimizes the sum-squared residue to get co-clusters. Next, supervised infinite feature selection with multiple support vector machine (SinfFS-mSVM) is proposed to obtain significant genes from a high dimensional data by using the classes obtained from iNCIS and improve the accuracy of classification. The effectiveness of iNCIS and SinfFS-mSVM is being evaluated on a large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM) from The Cancer Genome Atlas (TCGA) project. From the implementation, there are five breast cancer gene subtypes and four glioblastoma multiforme cancer gene subtypes that have been successfully identified. The weighted co-clustering approach in iNCIS provides a unique solution to integrate gene network interaction into the clustering process. The improvement of the co-clustering Rand Index and F1-measure is 54.5% and 33.9% for BRCA and 34.2% and 31.5% for GBM. Meanwhile, a significant gene subset with higher classification accuracy was selected from SinfFS-mSVM. The classification accuracy for the selected gene subset improved by 3.00% and 2.99% for BRCA and GBM, correspondingly. Furthermore, biological validation conducted on the selected genes from each subtype is to justify the validity of the results. In conclusion, the empirical study on large-scale cancer datasets using iNCIS and SinfFS-mSVM comprehensively find cancer gene subtypes and genes by achieving higher clustering and classification accuracy. Future works are needed to integrate more comprehensive gene network information and to select optimal parameters.

Item Type:Thesis (PhD)
Uncontrolled Keywords:tumour heterogeneity, cancer subtypes, The Cancer Genome Atlas (TCGA) project
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computing
ID Code:96282
Deposited By: Narimah Nawil
Deposited On:05 Jul 2022 08:07
Last Modified:05 Jul 2022 08:07

Repository Staff Only: item control page