Universiti Teknologi Malaysia Institutional Repository

Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology

Ismail, Surayati (2010) Sequence comparison latent semantic analysis and support vector machine to detect remote protein homology. Masters thesis, Universiti Teknologi Malaysia, Faculty of Computer Science and Information System.



Remote protein homology detection refers to the detection of structural homology in weak proteins. Remote protein homology is important to identify function for new proteins which could assist in curing genetic diseases, performing drug design, and identifying novel enzymes. To detect remote protein homology, several problems have been identified by researchers which are hard-to-align proteins homology detection and high dimensional feature vectors of proteins caused by redundant and noisy data. To address these problems, a new remote protein homology detection computational framework has been developed. The computational framework begins by extracting structural similarity of protein using highly sensitive structural similarity algorithm which consist of four steps: split protein sequences into substring, calculate similarity using pairwise protein substring alignment, build guide tree, and extract the high structural similarity using multiple protein sequence alignment. Then, Latent Semantic Analysis algorithm (LSA) is used to produce feature vectors. The LSA consist of three steps: generate protein pattern blocks using TEIRESIAS algorithm, remove redundant data using chi-square algorithm, and noisy data using Singular Value Decomposition (SVD) algorithm. Lastly, this computational framework uses SVM to classify all the proteins into homologue or non-homologue members. The proposed computational framework is analyzed using dataset from SCOP database version 1.53 and the performance has been compared with other methods such as PSI-BLAST and SVM-Pairwise sequence comparison models, SAM and HMMER generative models, and SVM-Fisher and SVM-I-Sites discriminative classifier models in terms of Receiver Operating Characteristic (ROC), Median Rate of False Positives (MRFP), and family by family comparison of ROC. The results show that the proposed computational framework successfully outperforms other remote protein homology detection methods.

Item Type:Thesis (Masters)
Additional Information:Supervisor : Dr. Muhamad Razib Othman; Thesis (Sarjana Sains (Sains Komputer)) - Universiti Teknologi Malaysia 2010
Uncontrolled Keywords:proteins, analysis, computer programs
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Computer Science and Information System (Formerly known)
ID Code:16677
Deposited By: Zalinda Shuratman
Deposited On:02 Feb 2012 06:34
Last Modified:17 Sep 2017 08:13

Repository Staff Only: item control page