Algorithms For Biological Sequence Analysis - Python Practicals

Overview

These practicals cover most of the Durbin string algorithms all the way from substitution matrices up to building your own multiple sequence aligner. They start with relatively simple primers on the parsing of FASTA sequences (P1). The next series deals with the computation of a log odd substitution matrix using a multiple sequence alignment. These practicals require regular expression analysis and a bit of looping to do the right statistics. Having a substitution matrix makes it possible to align sequences and this is just what P3 and P4 are all about taking you from the simplest implementation of Needlemen and Wunsch or Smith and Waterman up to Myers and Millers Hirshberg implementation. Gearing up towards multiple aligners, the next practical (P5) deals with the parsing of phylogenetic trees in newick format and the manipulation of these trees. All these results are then combined to implement a multiple sequence aligner (P6) and finally to adapt Nussinov into your own version of the Alifold algorithm(P7).

General Note on the exercises.

All these exercises work on the same principle. A task is given, for instance parse a FASTA file and a template file is then provided along with data. The template file is named x.y.foo.pb.py. This file contains missing bits that are indicated as missing #x. It also comes along with another file named x.y.foo.pb.output. The purpose of the exercise is therefore to modify the pb file so as to generate the output file. In order to help you, we also provide a binary solution file named x.y.foo.sol.pyc. This file is an executable that solves the problem and that you can use to generate new sample output.