Universiti Teknologi Malaysia Institutional Repository

Detection of potential viral sequence from next generation sequencing data using convolutional neural network.

Xin, Ying Lim and Jia, Yee Lim and Weng, Howe Chan and Hui, Wen Nies (2023) Detection of potential viral sequence from next generation sequencing data using convolutional neural network. International Journal of Innovative Computing, 13 (1). pp. 13-19. ISSN 2180-4370

[img] PDF
581kB

Official URL: http://dx.doi.org/10.11113/ijic.v13n1.382

Abstract

Next Generation Sequencing (NGS) is a modern sequencing technology that can determine the sequences of RNA and DNA faster and at lower cost. The availability of NGS data has sparked numerous efforts in bioinformatics, especially in the study of genetic variation and viral sequence detection. Viral sequence detection has been one of the important processes in studying virus-induced diseases. Common methods in detecting viral sequences involve alignment of the sequence with existing databases, which remains limited as these databases might be incomplete and difficult to detect highly divergent viruses. Thus, machine learning and deep learning have been used in this regard, to unveil the patterns that distinguish viral sequences through learning from the NGS data. This study focuses on viral sequence detection using convolutional neural network (CNN). This study intended to investigate how CNN model can be used for analysis of NGS data and develop a CNN model for detecting potential viral sequences from NGS data. The CNN architecture used for this study is based on an existing design that divided into two branches namely pattern and frequency branch that cater for extracting different aspects of information from the data and lastly combined into a full model. This study further implemented slightly modified architecture that includes additional convolution layer and pooling layer. Then, parameter tuning is implemented to identify near optimal parameters for the CNN to elucidate the performance impact. The evaluation of the optimized CNN model is done using a dataset with 18,445 DNA sequences. The results show that the CNN model in this study achieved a better performance compared with existing in terms of area under receiver operating characteristics curve (AUROC) for full model (+0.1434).

Item Type:Article
Uncontrolled Keywords:Next generation sequencing, viral sequence detection, convolutional neural network, bioinformatics.
Subjects:T Technology > T Technology (General) > T58.6-58.62 Management information systems
Divisions:Computer Science and Information System
ID Code:108489
Deposited By: Muhamad Idham Sulong
Deposited On:17 Nov 2024 09:34
Last Modified:17 Nov 2024 09:34

Repository Staff Only: item control page