All these exercises work on the same principle. A problem file is provided as x.y.fooo<_pb>.py. The purpose of the exercise is then to modify this initial file following the instruction of the exercise. There may be many ways to implement this solution. In order to help you, I have implemented my own solution and I am providing the output of this solution on the reference datasets. This sample output comes as a file named foo_
In order to make sure you have everything under control, you can regenerate different output files with the solution script.
The purpose of the exercises is therefore to figure out a way to modify the pb files so as to have them producing the output files. You can measure your progress by doing a diff between your own output and the reference
1.1-Parse a fasta file using regular expressions. Start with the file 1.1.fastafile2seq.pb.py and use Python regular expressions (python_reg_expressions)
## python 1.1.fastafile2seq.sol.pyc msa1.fasta > 1.1.fastafile2seq.sol.output
## python ./1.2.fastafile2seqBP.sol.pyc msa1.fasta > 1.2.fastafile2seqBP.sol.output
## python 1.3.fastafile2ungappedseq.sol.pyc msa1.fasta > 1.3.fastafile2ungappedseq.sol.output
## python ./1.4.rna2dna.sol.pyc rna.fasta > 1.4.rna2dna.sol.output
2.1-In an MSA in FASTA remove the columns containing more than X% gaps: 2.1.msa2ungappedmsa.pb.py
## python 2.1.msa2ungappedmsa.sol.pyc msa1.fasta 50 >2.1.msa2ungappedmsa.sol.output
2.2-Turn an MSA into a substitution matrix: 2.2.fastafile2lom.pb.py
## python 2.2.fastafile2lom.sol.pyc msa1.fasta > 2.2.fastafile2lom.sol.output
2.3-Turn an MSA into a substitution matrix using bioperl: 2.3.fastafile2lomBP.pb.py
## python 2.3.fastafile2lomBP.sol.pyc msa1.clustal > 2.3.fastafile2lomBP.sol.output
2.4- Make a script able to parse a substitution matrix and make sure your script is correct by comparing its input and its output using diff: 2.4.readmatrix.pb.py
## python 2.4.readmatrix.sol.pyc matrix.lom >2.4.readmatrix.sol.output
2.5-compare two substitution matrices starting with the script: 2.5.compare_matrix.pb.py and sort the output
## python 2.5.compare_matrix.sol.pyc matrix.lom matrix2.lom > 2.5.compare_matrix.sol.output
2.6-remove gaps from an msa1 estimate the substitution matrix and compare it with the substitution matrix you had estimated on the full MSA. What are the residues that change most?
## python 2.2.fastafile2lom.sol.pyc 2.1.msa2ungappedmsa.sol.output > 2.6.sol.output
3.1-Modify the nw code so as to print the global alignment value: 3.1.nw.pb.py
## python 3.1.nw.sol.pyc twoseq.fasta >3.1.nw.sol.output
3.2-what is the function of the table tb?
3.3-what is the meaning of the values assigned in tb
3.4-Complete the traceback of 3.4.nw.pb.py
## python 3.4.nw.sol.pyc twoseq.fasta >3.4.nw.sol.output
3.5-Set the mismatch score to the minimal value it can accept so as to get the following alignment:3.5.nw.pb.py
THEBIGCAT THER---AT
## python 3.5.nw.sol.pyc catseq.fasta >3.5.nw.sol.output
4.1-Set the mistmatch value to -8, the match value to 1, the gep to -4 and modify the rest of the code so as to generate the following alignment when aligning catseq.fasta: 4.2.nw.pb.py
THEBIGCAT THE---RAT
## python 4.1.nw.sol.pyc catseq.fasta >4.1.nw.sol.output
4.2-Modify the code of 4.2.nw.pb.py so as to use your own substitution matrix
## python 4.2.nw.sol.pyc seq3.fasta matrix.lom >4.2.nw.sol.output
4.3-Modify 4.3.nw.pb.py so as to turn it into a sw algorithm
## python 4.3.nw.sol.pyc seq3.fasta matrix.lom > 4.3.nw.sol.output
4.4-Modify 4.4.mm.pb.py so that it uses a substitution matrix
## python 4.4.mm.sol.pyc seq3.fasta matrix.lom > 4.4.mm.sol.output
5.1-follow the code and figure out a way to print the list of leafs. To help yourself with this task, you can trace the program while parsing tree1.dnd:5.1.readtree.pb.py
## python 5.1.readtree.sol.pyc tree1.dnd > 5.1.readtree.sol.output
5.2-Parse the tree tree2.dnd and print all the sister species - that is all the pairs of taxa that are only separated by one edge from the same parent node:5.2.readtree.pb.py
## python 5.2.readtree.sol.pyc tree2.dnd > 5.2.readtree.sol.output
5.3-Starting with 5.3.readtree.pb.py print the list of all the leafs associated with all the nodes. Given a node as a starting point you will need to use a recursive formulation.
## python 5.3.readtree.sol.pyc tree1.dnd > 5.3.readtree.stdout
5.4-for every node in the tree, display the list of the left children and the list of the right children. Print the splits in a hierarchical way, starting with those close to the leaf and making your way up along the tree. These splits define the order in which sequences will be aligned when doing a progressive alignment: 5.4.readtree.pb.py
## python 5.4.readtree.sol.pyc tree2.dnd > 5.4.readtree.sol.output
5.5-print in newik the tree that has just been read in nodes. The printing can be recursive (this is the easiest solution). To do so, you should open a parenthesis the first time you visit a node, separate the left and the right subtrees with a comma and close the parenthesis when you are done displaying the left group. In recursive terms, it means you open a parenthesis when entering the node, the function then calls itself to explore the left side, and emits a list when coming back from this exploration. Use diff to make sure you did it right: 5.5.readtree.pb.py
## python 5.5.readtree.sol.pyc tree2.dnd > 5.5.readtree.output
During this session we will implement a complete tree based progressive multiple sequence aligner. This will be achieved by combining many of the pieces of code generated across the rest of the practicals
6.1 - modify 6.1.readseq4prf.pb.py so that it returns a dictionary (hash list) with the sequence name being the key and the sequence itself being the content. Also make sure readseq does not remove the gaps - you want to read profiles later own. This function will be useful to align subset of sequences when doing the multiple sequence alignment and to subgroup sequences while aligning them
## python 6.1.readseq4prf.sol.pyc prf1.msa > 6.1.readseq4prf.output
6.2 - modify the following script so that it takes as input two profiles and returns the score of the optimal alignment of these two profiles. The score for matching two columns will be set to the average of the matching score while ignoring gaps. Starting from 6.2.msa.pb.py
## python 6.2.msa.sol.pyc prf1.msa prf2.msa matrix.lom > 6.2.msa.sol.output
6.3 - integrate 6.2.msa.pb.py into 6.3.msa.pb.py (a tree parser) so has to obtain your full multiple sequence aligner
## python 6.3.msa.sol.pyc tree4.dnd seq4.fasta matrix.lom > 6.3.msa.sol.output
7.1 An implementation of the nussinov algorithm was downloaded from the internet (7.1.nussinov.pb.py) but it contains some errors that lead to inaccurate predictions. Using the Durbin (p270) fix the faulty part of this implementation
## python 7.1.nussinov.sol.pyc > 7.1.nussinov.sol.output
7.2 Modify the following nussinov implementation (7.2.nussinov.pb.py) so that it can take into account the co-variation observed on a multiple sequence alignment
## python 7.2.nussinov.sol.pyc rna2.fasta > 7.2.nussinov.sol.output
Follow this Tutorial