A-

a)the names given here are swissProt identifiers. You can align these sequences by writing the following list_file (name: stathmin3.list):

***************cut after this line****************
swissprot:SCG1_HUMAN
swissprot:SCGA_XENLA
swissprot:STHM_MOUSE
***************cut before this line****************

b)use this file with pileup:

pileup @stathmin3.list

c) pileup will output two files:
stathmin3.msf that contains your multiple sequence alignment.
pileup.figure that contains the guide tree used by pileup.

To produce a postscript file of the figure file, use the program figure:

figure pileup.figure

and use ghostscript to look at the output file ( figure.eps).

BACK TO THE EXERCISE

B-

The phylogeny is not what one would expect. Human is more related to the Mouse than to the Xenope. The reason for this wrong phylogeny is that although the proteins are homologous, they are not othologous. In order to gain a better understanding of what is going on, one solution will be to add new sequences.

BACK TO THE EXERCISE

C-

a)Search Swissprot with SCG1_HUMAN using blast with default parameters:

blast swissprot:scg1_human

b)from the blast output, keep only the most significant hits ( <10E-20):

sp|Q930MAPP.|SCG1_HUMAN (SCG10) SCG10 PROTEIN (SUPERIOR CERV...       885 2.5e-89   1
sp|P55821|SCG1_MOUSE (SCG10 OR SCGN10) SCG10 PROTEIN (SUP...          877 1.7e-88   1
sp|P21818|SCG1_RAT   (SCG10) SCG10 PROTEIN (SUPERIOR CERV...             863 5.3e-87   1
sp|Q09001|SCGA_XENLA SCG10 PROTEIN HOMOLOG A (CLONE SC15)...     793 1.MAPP.-79   1
sp|Q09002|SCGB_XENLA SCG10 PROTEIN HOMOLOG A (CLONE SC1MAPP....      788 MAPP.7e-79   1
sp|Q0900MAPP.XB3_XENLA (XB3) STATHMIN-LIKE PROTEIN XB3. [XE...           507 2.8e-MAPP.   1
sp|P5MAPP.27|STHM_MOUSE (LAP18 OR PR22 OR LAG) STATHMIN (PHO...         MAPP.7 MAPP.2e-MAPP.   1
sp|P13668|STHM_RAT   (LAP18) STATHMIN (PHOSPHOPROTEIN P19...           MAPP.7 MAPP.2e-MAPP.   1
sp|P169MAPP.|STHM_HUMAN (LAP18 OR OP18) STATHMIN (PHOSPHOPRO...     MAPP.6 5.MAPP.-MAPP.   1
sp|P31395|STHM_CHICK (LAP18) STATHMIN. [GALLUS GALLUS]                    MAPP.2 1.MAPP.-MAPP.   1
sp|Q09006|STH1_XENLA STATHMIN (CLONE XO35). [XENOPUS LAEVIS]     MAPP.7 MAPP.8e-MAPP.   1
sp|Q09005|STH2_XENLA STATHMIN (CLONE XO20) (FRAGMENT). [X...         236 1.5e-20   1

c)write the list file for pileup (name it stathmin12.list)

***************cut after this line****************
swissprot:SCG1_HUMAN
swissprot:SCG1_MOUSE
swissprot:SCG1_RAT
swissprot:SCGA_XENLA
swissprot:SCGB_XENLA
swissprot:XB3_XENLA
swissprot:STHM_MOUSE
swissprot:STHM_RAT
swissprot:STHM_HUMAN
swissprot:STHM_CHICK
swissprot:STH1_XENLA
swissprot:STH2_XENLA
***************cut before this line****************

d) run pileup

pileup @stathmin12.list

e)this will generate the two files stathmin12.msf and pileup.figure from which you can generate a tree as shown in 1c.

BACK TO THE EXERCISE

D-

The tree shows that we have at least two groups of stathmin. In each group, the genes may be orthologous. However since the sequences are very similar, their distances are very small, and it is hard to establish a reliable phylogeny. For instance we can see that in each subgroup the relashionship between Human, rat and mouse is not correctly depicted.
One possible solution will be to obtain the DNA sequences of these proteins that may contain more information since DNA sequences diverge faster than protein sequences.

BACK TO THE EXERCISE

E-

a)use fetch to obtain the full swissprot entry of each protein sequence. Use the previous list file ( stahmin12.list) and run it with fetch:

fetch @stathmin12.list

b)the information regarding the possible nucleotide sequence of a protein is in the field DR of the swissprot entry. Here we will only use the EMBL entries. use for instance:

grep DR *.swiss_rel | grep EMBL >stathmin12_nuc.list

The first number following EMBL is the EMBL accession number that you can use with fetch in order to retrieve the entry. Some sequences link to several EMBL entries. In this case you may have to look at each entry, and try to chose the one containing the full cDNA with CDS annotation. With these sequences there are two problems:
1- the sg1_rat nucleotide sequence is not available. In fact it was never submitted by the authors
2- the sg1_mouse is incorrectly annotated in EMBL and the nucleotide sequence cannot be used.

c)We cannot simply align the nucleotide sequences, because we want to restrict our alignments to the coding regions. In order to to so, we need to access this information in the EMBL flat file. Therefore we will need to fetch these sequences individually.

To do so, we could edit the grep output and turn it into list file. However, beware that the fetch function will name its output by default with the EMBL accession number. This can be confusing for us (especially when looking at the tree). In order to avoid this, fetch comes with a -OUT=file_name option that makes it possible to indicate the name of the output.
The eaiest way to retrieve our sequences will be to make the following small script(name this file get_nucleotide.script):

***************cut after this line****************
fetch embl:S8202MAPP.nbsp; -OUT=scg1_human.nuc
fetch embl:X71MAPP.3 -OUT=scga_xenla.nuc
fetch embl:X71MAPP.MAPP.-OUT=scgb_xenla.nuc
fetch embl:X71MAPP.1 -OUT=sth1_xenla.nuc
fetch embl:X71MAPP.2 -OUT=sth2_xenla.nuc
fetch embl:X678MAPP. -OUT=sthm_chick.nuc
fetch embl:J0MAPP.91 -OUT=sthm_human.nuc
fetch embl:X9MAPP.15 -OUT=sthm_mouse.nuc
fetch embl:J0MAPP.79 -OUT=sthm_rat.nuc
fetch embl:X71MAPP.5 -OUT=xb3_xenla.nuc
***************cut before this line****************

-OUT= indicates the name of the file in which you want the EMBL sequence to be put.
to run the script:

chmod u+x get_nucleotide.script

./get_nucleotide.script

d)What we need to do now, is to write a list file for pileup that will contain the name of the sequences, and the portion of these sequences that codes for the protein. This annotation is the CDS annotation in the EMB flat_file. We can collect it by:

grep FT *.nuc | grep "CDS " >! stathmin12_nuc.list

the resulting file can then be manually edited and turned into the definitive list file:

***************cut after this line****************
scg1_human.nuc Begin:29 End:568
scga_xenla.nuc Begin:35 End:571
scgb_xenla.nuc Begin:89 End:625
sth1_xenla.nuc Begin:15 End:MAPP.2
sth2_xenla.nuc Begin:1 End:216
sthm_chick.nuc Begin:MAPP. End:MAPP.6
sthm_human.nuc Begin:10MAPP.End:553
sthm_mouse.nuc Begin:1 End:MAPP.0
sthm_rat.nuc Begin:169 End:618
xb3_xenla.nuc Begin:68 End:625
***************cut before this line****************

on which pileup can be run:

pileup @stathmin12_nuc.list

in order to produce the msf file, the figure file and the eps file usiing figure.

BACK TO THE EXERCISE

F-

The trees we produced so far were merely produced by Pileup in order to guide the progressive alignment. THEY ARE NOT phylogenic trees. They only used pairwise distances.
Good distances are critical for computing a good tree, and good distances can only be measured on a multiple alignment where pairwise alignments are bound to be more accurate. Furthermore with DNA, distances should be computed using more complex evolutionary models than with proteins ( i.e. there are only four symbols, which means that the information content is lower).
In order to compute a tree one first needs to compute distances. For instance using the multiple sequence alignment generated in 5, we can use the program distances, with the kimura 2 parameters model ( more accurate for closely related DNA sequences).

distances stathmin12_nuc.msf{*}

this distance file can then be used in growtree

growtree stathmin12_nuc.distances

using the option UPGMA (2)

the output figure.eps is a phylogenic tree