A-
a)the names given here are swissProt identifiers. You can align these sequences
by writing the following list_file (name: stathmin3.list):
***************cut after this line****************
swissprot:SCG1_HUMAN
swissprot:SCGA_XENLA
swissprot:STHM_MOUSE
***************cut before this line****************
b)use this file with pileup:
c) pileup will output two files:
stathmin3.msf that contains
your multiple sequence alignment.
pileup.figure that
contains the guide tree used by pileup.
To produce a postscript file of the figure file, use the program figure:
and use ghostscript to look at the output file ( figure.eps).
BACK TO THE EXERCISE
B-
The phylogeny is not what one would expect. Human is more related to the
Mouse than to the Xenope. The reason for this wrong phylogeny is that although
the proteins are homologous, they are not othologous. In order to
gain a better understanding of what is going on, one solution will be to
add new sequences.
BACK TO THE EXERCISE
C-
a)Search Swissprot with SCG1_HUMAN using blast with default parameters:
blast swissprot:scg1_human
b)from the blast
output, keep only the most significant hits ( <10E-20):
sp|Q930MAPP.|SCG1_HUMAN (SCG10) SCG10 PROTEIN (SUPERIOR CERV...
885 2.5e-89 1
sp|P55821|SCG1_MOUSE (SCG10 OR SCGN10) SCG10 PROTEIN (SUP...
877 1.7e-88 1
sp|P21818|SCG1_RAT (SCG10) SCG10 PROTEIN (SUPERIOR CERV...
863 5.3e-87 1
sp|Q09001|SCGA_XENLA SCG10 PROTEIN HOMOLOG A (CLONE SC15)...
793 1.MAPP.-79 1
sp|Q09002|SCGB_XENLA SCG10 PROTEIN HOMOLOG A (CLONE SC1MAPP....
788 MAPP.7e-79 1
sp|Q0900MAPP.XB3_XENLA (XB3) STATHMIN-LIKE PROTEIN XB3. [XE...
507 2.8e-MAPP. 1
sp|P5MAPP.27|STHM_MOUSE (LAP18 OR PR22 OR LAG) STATHMIN (PHO...
MAPP.7 MAPP.2e-MAPP. 1
sp|P13668|STHM_RAT (LAP18) STATHMIN (PHOSPHOPROTEIN P19...
MAPP.7 MAPP.2e-MAPP. 1
sp|P169MAPP.|STHM_HUMAN (LAP18 OR OP18) STATHMIN (PHOSPHOPRO...
MAPP.6 5.MAPP.-MAPP. 1
sp|P31395|STHM_CHICK (LAP18) STATHMIN. [GALLUS GALLUS]
MAPP.2 1.MAPP.-MAPP. 1
sp|Q09006|STH1_XENLA STATHMIN (CLONE XO35). [XENOPUS LAEVIS]
MAPP.7 MAPP.8e-MAPP. 1
sp|Q09005|STH2_XENLA STATHMIN (CLONE XO20) (FRAGMENT). [X...
236 1.5e-20 1
c)write the list file for pileup (name it stathmin12.list)
***************cut after this line****************
swissprot:SCG1_HUMAN
swissprot:SCG1_MOUSE
swissprot:SCG1_RAT
swissprot:SCGA_XENLA
swissprot:SCGB_XENLA
swissprot:XB3_XENLA
swissprot:STHM_MOUSE
swissprot:STHM_RAT
swissprot:STHM_HUMAN
swissprot:STHM_CHICK
swissprot:STH1_XENLA
swissprot:STH2_XENLA
***************cut before this line****************
d) run pileup
e)this will generate the two files stathmin12.msf
and pileup.figure
from which you can generate a tree
as shown in 1c.
BACK TO THE EXERCISE
D-
The tree shows that we have at least two groups of stathmin. In each group,
the genes may be orthologous. However since the sequences are very similar,
their distances are very small, and it is hard to establish a reliable
phylogeny. For instance we can see that in each subgroup the relashionship
between Human, rat and mouse is not correctly depicted.
One possible solution will be to obtain the DNA sequences of these
proteins that may contain more information since DNA sequences diverge
faster than protein sequences.
BACK TO THE EXERCISE
E-
a)use fetch to obtain the full swissprot entry of each protein sequence.
Use the previous list file ( stahmin12.list) and run it with fetch:
b)the information regarding the possible nucleotide sequence of a protein
is in the field DR of the swissprot entry. Here we will only use the EMBL
entries. use for instance:
grep DR *.swiss_rel | grep EMBL >stathmin12_nuc.list
The first number following EMBL is the EMBL accession number that you can
use with fetch in order to retrieve the entry. Some sequences link to several
EMBL entries. In this case you may have to look at each entry, and try
to chose the one containing the full cDNA with CDS annotation. With these
sequences there are two problems:
1- the sg1_rat nucleotide sequence is not available. In fact it was never
submitted by the authors
2- the sg1_mouse is incorrectly annotated in EMBL and the nucleotide sequence
cannot be used.
c)We cannot simply align the nucleotide sequences, because we want to
restrict our alignments to the coding regions. In order to to so, we need
to access this information in the EMBL flat file. Therefore we will need
to fetch these sequences individually.
To do so, we could edit the grep output and turn it into list file.
However, beware that the fetch function will name its output by default
with the EMBL accession number. This can be confusing for us (especially
when looking at the tree). In order to avoid this, fetch comes with
a -OUT=file_name option that makes it possible to indicate the name of
the output.
The eaiest way to retrieve our sequences will be to make the following
small script(name this file get_nucleotide.script):
***************cut after this line****************
fetch embl:S8202MAPP.nbsp; -OUT=scg1_human.nuc
fetch embl:X71MAPP.3 -OUT=scga_xenla.nuc
fetch embl:X71MAPP.MAPP.-OUT=scgb_xenla.nuc
fetch embl:X71MAPP.1 -OUT=sth1_xenla.nuc
fetch embl:X71MAPP.2 -OUT=sth2_xenla.nuc
fetch embl:X678MAPP. -OUT=sthm_chick.nuc
fetch embl:J0MAPP.91 -OUT=sthm_human.nuc
fetch embl:X9MAPP.15 -OUT=sthm_mouse.nuc
fetch embl:J0MAPP.79 -OUT=sthm_rat.nuc
fetch embl:X71MAPP.5 -OUT=xb3_xenla.nuc
***************cut before this line****************
-OUT= indicates the name of the file in which you want the EMBL sequence
to be put.
to run the script:
chmod u+x get_nucleotide.script
./get_nucleotide.script
d)What we need to do now, is to write a list file for pileup that will
contain the name of the sequences, and the portion of these sequences that
codes for the protein. This annotation is the CDS annotation in the EMB
flat_file. We can collect it by:
grep FT *.nuc | grep "CDS " >! stathmin12_nuc.list
the resulting file can then be manually edited and turned into the definitive
list file:
***************cut after this line****************
scg1_human.nuc Begin:29 End:568
scga_xenla.nuc Begin:35 End:571
scgb_xenla.nuc Begin:89 End:625
sth1_xenla.nuc Begin:15 End:MAPP.2
sth2_xenla.nuc Begin:1 End:216
sthm_chick.nuc Begin:MAPP. End:MAPP.6
sthm_human.nuc Begin:10MAPP.End:553
sthm_mouse.nuc Begin:1 End:MAPP.0
sthm_rat.nuc Begin:169 End:618
xb3_xenla.nuc Begin:68 End:625
***************cut before this line****************
on which pileup can be run:
pileup @stathmin12_nuc.list
in order to produce the msf
file, the figure
file and the eps
file usiing figure.
BACK TO THE EXERCISE
F-
The trees we produced so far were merely produced by Pileup in order to
guide the progressive alignment. THEY ARE NOT phylogenic trees. They only
used pairwise distances.
Good distances are critical for computing a good tree, and good distances
can only be measured on a multiple alignment where pairwise alignments
are bound to be more accurate. Furthermore with DNA, distances should be
computed using more complex evolutionary models than with proteins ( i.e.
there are only four symbols, which means that the information content is
lower).
In order to compute a tree one first needs to compute distances. For
instance using the multiple sequence alignment generated in 5, we can use
the program distances, with the kimura 2 parameters model ( more accurate
for closely related DNA sequences).
distances stathmin12_nuc.msf{*}
this distance
file can then be used in growtree
growtree stathmin12_nuc.distances
using the option UPGMA (2)
the output figure.eps
is a phylogenic tree