Options
Enhancement of needleman-wunsch and smith waterman algorithms to reduce search space and computational time for dna sequence aligment application.
Date Issued
2019
Author(s)
Fatimah Noni Muhamad
Abstract
Sequence comparison is regarded as one of the most fundamental problems of computational biology, which is usually solved with a technique known as sequence alignment. Generally, sequence alignment is the process of comparing two sequences to identify similarities and differences between them, then, a typical approach to solve this problem is to find a good and plausible alignment between the two sequences. Indeed, sequence alignment algorithm plays an important role to find the homologous groups of sequences. The data representation in this work is DNA sequence. This work intends to analyze large sequences as well as reducing search space and computation time without compromising the accuracy and efficiency. This is undertaken by evaluating the performance of the enhancement of Needleman-Wunsch algorithm and Smith-Waterman algorithm, which called Needle Program and Smith Program, respectively. Needle Program and Smith Program find the optimal alignment by exploring all possible alignments and choosing the best through the Dot Plot matrix, Scoring matrix and Traceback techniques, which is NP-hard to optimize. By the aid of parallel paradigm by using OpenMP it becomes more feasible to get the fast and accurate result in efficient time. In order to verify the accuracy and the measurements’ consistency for Needle Program and Smith Program by compared with a publicly available implementation for DNA database searchers, EMBOSS (Global) and EMBOSS. The experiments are conducted with 1000 strands test data and analyzed according to the parameters’ measurement (similarity, gap, mismatch and execution time). Based on the experiments it proved that Needle Program and Smith Program are able to align the optimal alignment and improved the performance in searching similarity as well as reduced gaps and mismatch. The analysis result on mean for Needle Program and Smith Program show that 50.52% and 45.7% of similarity, whereas 30.15% and 35.46% of gap, then 10.38% and 10.74% of mismatch, respectively, were in the range of all parameters correlate highly each other in determining accuracy. It is noteworthy to mention that parallelization is very worthwhile for larger sequence (1000 length of DNA sequence and above) because the execution times are hugely reduced by using multiple cores. OpenMP directives able to parallelize the codes and execute it faster, with four cores it can get an execution time of around 70% reduced.