Algorithms For Biological Sequence Analysis - Python Practicals

Overview

These practicals cover most of the Durbin string algorithms all the way from substitution matrices up to building your own multiple sequence aligner. They start with relatively simple primers on the parsing of FASTA sequences (P1). The next series deals with the computation of a log odd substitution matrix using a multiple sequence alignment. These practicals require regular expression analysis and a bit of looping to do the right statistics. Having a substitution matrix makes it possible to align sequences and this is just what P3 and P4 are all about taking you from the simplest implementation of Needlemen and Wunsch or Smith and Waterman up to Myers and Millers Hirshberg implementation. Gearing up towards multiple aligners, the next practical (P5) deals with the parsing of phylogenetic trees in newick format and the manipulation of these trees. All these results are then combined to implement a multiple sequence aligner (P6) and finally to adapt Nussinov into your own version of the Alifold algorithm(P7).

General Note on the following exercises.

All these exercises work on the same principle. A problem file is provided as x.y.fooo<_pb>.py. The purpose of the exercise is then to modify this initial file following the instruction of the exercise. There may be many ways to implement this solution. In order to help you, I have implemented my own solution and I am providing the output of this solution on the reference datasets. This sample output comes as a file named foo_.output and the way it was generated is explicited on the lines ##.

In order to make sure you have everything under control, you can regenerate different output files with the solution script.

The purpose of the exercises is therefore to figure out a way to modify the pb files so as to have them producing the output files. You can measure your progress by doing a diff between your own output and the reference

P1 - Parsing a FASTA file

1.1-Parse a fasta file using regular expressions. Start with the file 1.1.fastafile2seq.pb.py and use Python regular expressions (python_reg_expressions)

## python2 1.1.fastafile2seq.sol.pyc msa1.fasta > 1.1.fastafile2seq.sol.output

1.2-Parse a FASTA file using biopython using the online biopython documentation on input/output (http://biopython.org/wiki/SeqIO):1.2.fastafile2seqBP.pb.py

## python2 ./1.2.fastafile2seqBP.sol.pyc msa1.fasta > 1.2.fastafile2seqBP.sol.output

1.3-adapt your own script to remove the gaps from the FASTA sequences: 1.3.fastafile2ungappedseq.pb.py

## python2 1.3.fastafile2ungappedseq.sol.pyc msa1.fasta > 1.3.fastafile2ungappedseq.sol.output

1.4-translate an RNA sequence into a DNA sequence:1.4.rna2dna.pb.py

## python2 ./1.4.rna2dna.sol.pyc rna.fasta > 1.4.rna2dna.output

P2 - Computing a log odd substitution matrix

2.1-In an MSA in FASTA remove the columns containing more than X% gaps: 2.1.msa2ungappedmsa.pb.py

## python2 2.1.msa2ungappedmsa.sol.pyc msa1.fasta 50 >2.1.msa2ungappedmsa.sol.output

2.2-Turn an MSA into a substitution matrix: 2.2.fastafile2lom.pb.py

## python2 2.2.fastafile2lom.sol.pyc msa1.fasta > 2.2.fastafile2lom.sol.output

2.3-Turn an MSA into a substitution matrix using bioperl: 2.3.fastafile2lomBP.pb.py

## python2 2.3.fastafile2lomBP.sol.pyc msa1.clustal > 2.3.fastafile2lomBP.sol.output

2.4- Make a script able to parse a substitution matrix and make sure your script is correct by comparing its input and its output using diff: 2.4.readmatrix.pb.py

## python2 2.4.readmatrix.sol.pyc matrix.lom >2.4.readmatrix.sol.output

2.5-compare two substitution matrices starting with the script: 2.5.compare_matrix.pb.py and sort the output

## python2 2.5.compare_matrix.sol.pyc matrix.lom matrix2.lom > 2.5.compare_matrix.sol.output

2.6-remove gaps from an msa1 estimate the substitution matrix and compare it with the substitution matrix you had estimated on the full MSA. What are the residues that change most?

## python2 2.6.sol > 2.6.sol.output

P3 - dynamic programming Session 1

3.1-Modify the nw code so as to print the global alignment value: 3.1.nw.pb.py

## python2 3.1.nw.sol.pyc twoseq.fasta >3.1.nw.sol.output

3.2-what is the function of the table tb?

3.3-what is the meaning of the values assigned in tb

3.4-Complete the traceback of 3.4.nw.pb.py

## python2 3.4.nw.sol.pyc twoseq.fasta >3.4.nw.sol.output

3.5-Set the mismatch score to the minimal value it can accept so as to get the following alignment:3.5.nw.pb.py

            THEBIGCAT
            THER---AT

## python2 3.5.nw.sol.pyc catseq.fasta >3.5.nw.sol.output

P4 - dynamic programming Session 2

4.1-Set the mistmatch value to -8, the match value to 1, the gep tp -4 and modify the rest of the code so as to generate the following alignment when aligning catseq.fasta: 4.2.nw.pb.py

            THEBIGCAT
            THER---AT

## python2 4.2.nw.sol.pyc catseq.fasta matrix.lom >4.1.nw.sol.output

4.2-Modify the code of 4.2.nw.pb.py so as to use your own substitution matrix

## python2 4.2.nw.sol.pyc seq3.fasta matrix.lom >4.2.nw.sol.output

4.3-Modify 4.3.nw.pb.py so as to turn it into a sw algorithm

## python2 4.3.nw.sol.pyc seq3.fasta matrix.lom > 4.3.nw.sol.output

4.4-Modify 4.4.mm.sol.pyc so that it uses a substitution matrix

## python2 4.4.mm.sol.pyc seq3.fasta matrix.lom > 4.4.mm.sol.output

P5 - Parsing and manipulating binary trees

5.1-follow the code and figure out a way to print the list of leafs. To help yourself with this task, you can trace the program while parsing tree1.dnd:5.1.readtree.pb.py

## python2 5.1.readtree.sol.pyc tree1.dnd > 5.1.readtree.sol.output

5.2-Parse the tree tree2.dnd and print all the sister species - that is all the pairs of taxa that are only separated by one edge from the same parent node:5.2.readtree.pb.py

## python2 5.2.readtree.sol.pyc tree2.dnd > 5.2.readtree.sol.output

5.3-Starting with 5.3.readtree.pb.py print the list of all the leafs associated with all the nodes. Given a node as a starting point you will need to use a recursive formulation.

## python2 5.3.readtree.sol.pyc tree1.dnd > 5.3.readtree.stdout

5.4-for every node in the tree, display the list of the left children and the list of the right children. Print the splits in a hierarchical way, starting with those close to the leaf and making your way up along the tree. These splits define the order in which sequences will be aligned when doing a progressive alignment: 5.4.readtree.pb.py

## python2 5.4.readtree.sol.pyc tree2.dnd > 5.4.readtree.sol.output

5.5-print in newik the tree that has just been read in nodes. The printing can be recursive (this is the easiest solution). To do so, you should open a parentesis the first time you visit a node, separate the left and the right subtrees with a comma and close the parenthesis when you are done displaying the left group. In recursive terms, it means you open a parenthesis when entering the node, the function then calls itself to explore the left side, and emist a list when coming back from this exploration. Use diff to make sure you did it right: 5.5.readtree.pb.py

## python2 5.5.readtree.sol.pyc tree2.dnd > 5.5.readtree.output

P6 - Implementing your own multiple sequence aligner.

During this session we will implement a complete tree based progressive multiple sequence aligner. This will be achieved by combining many of the pieces of code generated across the rest of the practicals

6.1 - modify 6.1.readseq4prf.pb.py so that it returns a dictionary (hash list) with the sequence name being the key and the sequence itself being the content. Also make sure readseq does not remove the gaps - you want to read profiles later own. This function will be useful to align subset of sequences when doing the multiple sequence alignment and to subgroup sequences while aligning them

## python2 6.1.readseq4prf.sol.pyc prf1.msa > 6.1.readseq4prf.output

6.2 - modify the following script so that it takes as input two profiles and returns the score of the optimal alignment of these two profiles. The score for matching two columns will be set to the average of the matching score while ignoring gaps. Starting from 6.2.msa.pb.py

## python2 6.2.msa.sol.pyc prf1.msa prf2.msa matrix.lom > 6.2.msa.sol.output

6.3 - integrate 6.2.msa.pb.py into 6.3.msa.pb.py (a tree parser) so has to obtain your full multiple sequence aligner

## python2 6.3.msa.sol.pyc tree4.dnd seq4.fasta matrix.lom > 6.3.msa.sol.output

P7 - Nussinov

7.1 An implementation of the nussinov algorithm was downloaded from the internet (7.1.nussinov.pb.py) but it contains some errors that lead to inaccurate predictions. Using the Durbin (p270) fix the faulty part of this implementation

## python2 7.1.nussinov.sol.pyc > 7.1.nussinov.sol.output

7.2 Modify the following nussinov implementation (7.2.nussinov.pb.py) so that it can take into account the co-variation observed on a multiple sequence alignment

## python2 nussinov_sol2.py rna2.fasta > nussinov_sol2.output

P8 - Blast Tutorial

Follow this Tutorial