Information categorisation in biological sequence alignments
Sumedha Gunewardena and Peter Jeavons
Abstract
This is a two-part report. In the first part we introduce the reader to biological sequence alignment. We discus dynamic programming as is used in sequence alignment, first in the case of two sequences and later, how it is adopted for multiple sequence alignment. Several references are given to the different sequence alignment strategies reported in the literature used to enhance the standard dynamic programming algorithm for sequence alignment to suit biological sequences. A short discussion on how alignments are scored is given. Finally, some of the existing sequence alignment tools are described.
The second part of this report presents a critical analysis of information as it relates to biological sequence alignment. Information relating to the sequences being aligned form the basis on which any alignment is built. In its basic form this information might quantify how individual residues are scored when aligned with each other or how gaps are scored when introduced between two residues. Every biological sequence has if not explicit, at least some form of implicit information relating to its residues that form distinguishing markers along the sequence. There are many ways of extracting this information such as from databases of the relevant sequences, from the literature, prior processing etc. It is reasonable to assume that the more sequence information we use in an alignment, the more confidant we can be of the resulting alignment, and hence make better hypothesis of the unknown sequences. The aim of this part of the report is to build a framework on how to represent this information in such a way as to facilitate the dynamic and flexible incorporation of it to facilitate sequence alignments.