1 Using PILEUP and ClustalW

Multiple alignment of (homologous) sequences is a very powerful tool for finding biologically significant features in sequences and also as an essential prerequisite to carrying out phylogenetic analysis.

CLUSTALW

ClustalW is a multiple alignment program that also draws phylogenetic trees. This software was described in: Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) Nucleic Acids Research, 22(22):4673-4680.

Input sequences must all be in one file (or two files for a profile alignment) and one format. The acceptable formats are : FASTA (Pearson); NBRF/PIR; EMBL/Swiss Prot; GDE; CLUSTAL; GCG/MSF. Note that for Clustal format and MSF format (output from the GCG program pileup), the sequences are already aligned. You can use this facility to read in an alignment in order to calculate a phylogenetic tree OR to output the same alignment in a different format (from the output format options menu of the multiple alignment menu) e.g. read in a GCG/MSF format alignment and output a PHYLIP format alignment. This is also useful to read in one reference alignment and to add one or more new sequences to it using the "profile alignment" facilities.

The default output format for clustal-created trees is New Hampshire format (in which the tree topology is indicated by a hierarchy of nested brackets), this format is compatible with PHYLIP and you can use PHYLIP programs such as RETREE or DRAWTREE/DRAWGRAM to view the output tree.

The phylogenetic trees in ClustalW use the Neighbour-Joining method of Saitou and Nei based on a matrix of "distances" between all sequences. Note: do NOT use the .dnd file (a guide tree used to decide which sequences in the dataset align first) as the definitive phylogeny.

PILEUP

This is a program available under GCG. Pileup creates a multiple sequence alignment from a group of related sequences using a simplification of the progressive alignment method of Feng and Doolittle.

The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences.

Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments. Distance along the vertical axis is proportional to the difference between sequences; distance along the horizontal axis has no significance.

GCG 9 incorporates a useful xwindows based tool called seqlab, which enables you to view, edit-by-hand, and save multiple sequence alignments. Windows based software is best learned by being shown how to use it in a basic way and then experimenting.

Exercise:

The aim of this exercise is to demonstrate how to operate multiple sequence alignment programs and also to highlight their restrictions and some differences between them.

For the purposes of this exercise it will be necessary to create a new directory. This will be used to hold only the sequences and files relevant to the exercise. This will enable you to use the wild-card * to select all sequences, as demonstrated later. To create a new directory called "mult_seqs" enter the command :

% mkdir mult_seqs

You can then move into this directory (change directory)

% cd mult_seqs

It might be better to use cas instead of mult_seqs because it's quicker to type and the first dataset is mammalian casein peptides. Later you will be analysing the corresponding DNA sequences and later still analagous somatotropin genes.

The next step is to fetch the sequences for alignment from the database. To do this you must be in GCG, so you may first need to enter:

% gcg

Fetch these individually into your working directory.

% fetch sw:database_name

where database_name is the name as listed below or is the appropriate accession number.

The first alignment is of the a - S1 casein precursor of selected mammals. These sequences have the Swiss-Prot names :
cas1_bovin (p02662)
cas1_human (p47710)
cas1_mouse (p19228)
cas1_pig (p39035)
cas1_rabit (p09115)
cas1_rat (p02661)
cas1_sheep (p04653)
Now enter

% ls

you will see a list of the files in your present directory ( ? mult_seqs). Note that these all end .sw, this is useful because you can use *.sw to refer to all the sequences which you want to align.

Because these files were brought to your directory using the GCG program fetch, they are all in GCG format. For ClustalW to read in these sequences they need to be all in one file and in one of the accepted formats. We can do both of these at once using the GCG command tofasta, which will write all the sequences into one file in FastA/Pearson format.

To do this enter the following:

% tofasta *.sw

ToFastA converts GCG sequence(s) into FastA format.
What should I call the output file (* tofasta.tfa *) ?  cas1.tfa
CAS1_BOVIN214 characters.
CAS1_HUMAN185 characters.
CAS1_MOUSE313 characters.
CAS1_PIG206 characters.
CAS1_RABIT215 characters.
CAS1_RAT284 characters.
CAS1_SHEEP206 characters.
 1,623 symbols written into "cas1.tfa".

Note that it is important to use *.sw here because if you repeat the tofasta command for each sequence they will all be in separate files.

You now have a single file (cas1.tfa) containing all the sequences for alignment so you can now run ClustalW :

% clustalw

**************************************************************
******** CLUSTAL W(1.60) Multiple Sequence Alignments ********
**************************************************************

     1. Sequence Input From Disc
     2. Multiple Alignments
     3. Profile / Structure Alignments
     4. Phylogenetic trees
     S. Execute a system command
     H. HELP
     X. EXIT (leave program)

Your choice:

Sequences should all be in 1 file.

7 formats accepted:

NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF,RSF



Enter the name of the sequence file:

cas1.tfa

Sequence format is Pearson
Sequences assumed to be PROTEIN
Sequence 1: CAS1_BOVIN      214 aa
Sequence 2: CAS1_HUMAN      185 aa
Sequence 3: CAS1_MOUSE      313 aa
Sequence 4: CAS1_PIG_I      206 aa
Sequence 5: CAS1_RABIT      215 aa
Sequence 6: CAS1_RAT_I      284 aa
Sequence 7: CAS1_SHEEP      206 aa

You will now have returned to the main menu. Enter 2 after the prompt "Your choice: " to choose to do a multiple alignment. The Multiple Alignment Menu (as follows) should then appear on your screen.

****** MULTIPLE ALIGNMENT MENU ****** 
    1.  Do complete multiple alignment now (Slow/Accurate) 
    2.  Produce guide tree file only 
    3.  Do alignment using old guide tree file 
    4.  Toggle Slow/Fast pairwise alignments = SLOW
    5.  Pairwise alignment parameters
    6.  Multiple alignment parameters
    7.  Reset gaps between alignments? = OFF
    8.  Toggle screen display          = ON
    9.  Output format options
    S.  Execute a system command
    H.  HELP
    or press [RETURN] to go back to main menu
Your choice:

You can choose 6 at this prompt to change the algorithm parameters, i.e. gap creation and extension penalties etc., or just choose 1 to do the multiple alignment now. In this case choose 1. Accept the default filenames by just hitting <return> twice.

When the multiple alignment is complete the Multiple Alignment Menu will again be on your screen. Hit <return> to go back to the main menu (as before). Here choose option x to exit from ClustalW. Look at the alignment

% more cas1.aln

CLUSTAL W(1.60) multiple sequence alignment
CAS1_BOVIN      MKLLILTCLVAVALARPKHPIKHQGLP-------QEVLNEN-LLRFFVAPFPEVFGKEKV
CAS1_HUMAN      MRLLILTCLVAVALARPKLPLRYPERLQNP---SESSE-------PIP----LESREEYM
CAS1_MOUSE      MKLLILTCLVAAAFAMPRLHSRNAVSSQTQQQHSSSEE-------IFKQPKYLNLNQEFV
CAS1_PIG_I      MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSREELFKERKFLRFPEVPLLSQFRQEII
CAS1_RABIT      MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERKLLRFVQTVPLELREEYV
CAS1_RAT_I      MKLLILTCLVAAALALPRAHRRNAVSSQTQQENSSSEEQE-----IVKQPKYLSLNEEFV
CAS1_SHEEP      MKLLILTCLVAVALARPKHPIKHQGLS-------PEVLNEN-LLRFVVAPFPEVFRKENI
                *.***  ** * * *  .                                       * .

CAS1_BOVIN      NELSKDIGSESTEDQAMEDIKQMEAESISSSEEIVPNSVEQKHIQKE-------------
CAS1_HUMAN      NGMNRQRNILREK----QTDEIKDTRNESTQNCVVAEPEKMESSISSS------------
CAS1_MOUSE      NNMNRQRALLTE-----QNDEIKVTMDAASEEQAMASAQEDSSISSSS-EESEEAIPNIT
CAS1_PIG_I      NELNRNHG--------MEGHEQRGS-SSSSSEEVVGNSAEQKHVQKEE------------
CAS1_RABIT      NELNRQRELLREK----ENEEIKGTRNEVTEEHVLADRETEASISSSS----EEIVPSST
CAS1_RAT_I      NNLNRQRELLTE-----QDNEIKITMDSSAEEQATASAQEDSSSSSSSSEESKDAIPSAT
CAS1_SHEEP      NELSKDIGSESIEDQAMEDAKQMKAGSSSSSEEIVPNSAEQKYIQKE-------------
                * . .            .      .    .

CAS1_BOVIN      -----DVPSERYLGYLEQLLRLKKYKVPQLEIVPNSAEERLHSMKE---GIHAQQKEPMI
CAS1_HUMAN      ------SEEMSLSKCAEQFCRLNEYNQLQLQAAH--AQEQIRRMN-----ENSHVQVP--
CAS1_MOUSE      EQKNIANEDMLNQCTLEQLQRQFKYNQLLQKASL--AKQASLFQQPSLVQQASLFQQPSL
CAS1_PIG_I      -----DVPSQSYLGHLQGLN---KYKLRQLEAIH---DQELHRTNE---DKHTQQGEPMK
CAS1_RABIT      KQKYVPREDLAYQPYVQQQLLRMKERYQIQE------REPMRVVN---QELAQLYLQP--
CAS1_RAT_I      EQKNIANKEILNRCTLEQLQRQIKYSQLLQQASL--AQQASLAQQASLAQQALLAQQP--
CAS1_SHEEP      -----DVPSERYLGYLEQLLRLKKYNVPQLEIVPKSAEEQLHSMKE---GNPAHQKQPMI
                                .                     .                  *

CAS1_BOVIN      GVNQELAYF-----------------YPELFRQFYQLD--AYPSGAWYYVPLGTQYTDAP
CAS1_HUMAN      ------------------------------FQQLNQL---AAYPYAVWYYPQIMQYVPFP
CAS1_MOUSE      LQQASLFQQPSMAQQASLLQQLLLAQQPSLALQVSPAQQSSLVQQAFLAQQASLAQKHHP
CAS1_PIG_I      GVNQEQAYF-----------------YFEPLHQFYQLD--AYPYATWYYPP---QYIAHP
CAS1_RABIT      -----------------------------FEQPYQLD---AYLPAPWYYTPEVMQYVLSP
CAS1_RAT_I      ----------------------------SLAQQAALAQQASLAQQASLAQQASLAQKHHP
CAS1_SHEEP      AVN-------------------------QLFRQFYQLD--AYPSGAWYYLPLGTQYTDAP
                                                        .                  *
CAS1_BOVIN      SFSDIPNPIGSENSE-KTTMPLW-------------------------------------
CAS1_HUMAN      PFSDISNPTAHENYEKNNVMLQW-------------------------------------
CAS1_MOUSE      RLSQSYYPHMEQPYRMNAYSQVQMRHPMSVVDQALAQFSVQPFPQIFQYDAFPLWAYFPQ
CAS1_PIG_I      LFTNIPQPTAPEKGGKTEIMPQW-------------------------------------
CAS1_RABIT      LFYDLVTPSAFESAEKTDVIPEWLKN----------------------------------
CAS1_RAT_I      RLSQVYYPNMEQPYRMNAYSQVQMRHPMSVVDQ--AQFSVQSFPQLSQYGAYPLWLYFPQ
CAS1_SHEEP      SFSDIPNPIGSENSG-KITMPLW-------------------------------------
                       *   .

CAS1_BOVIN      ----------------------------
CAS1_HUMAN      ----------------------------
CAS1_MOUSE      DMQYLTPKAVLNTFKPIVSKDTEKTNVW
CAS1_PIG_I      ----------------------------
CAS1_RABIT      ----------------------------
CAS1_RAT_I      DMQYLTPEAVLNTFKPIAPKDAENTNVW
CAS1_SHEEP      ----------------------------

If you examine this alignment you might decide that it is not very sensible and consequently any tree drawn on the basis of this alignment would be unsatisfactory. For example, in the alignment above, I have highlighted small sequences that you might think should be aligned but are not. Change the alignment parameters to see how it affects the alignment.

To do this re-enter ClustalW

% clustalw

and input the same sequences as before. Choose option 2 to do an alignment. Now choose option 6 to change the multiple alignment parameters.

********* MULTIPLE ALIGNMENT PARAMETERS *********
     1. Gap Opening Penalty              :10.00
     2. Gap Extension Penalty            :0.05
     3. Delay divergent sequences        :40 %
     4. DNA Transitions Weight           :0.50
     5. Protein weight matrix            :BLOSUM series
     6. DNA weight matrix                :IUB
     7. Use negative matrix              :OFF
         
     7. Protein Gap Parameters
         
     H. HELP
Enter number (or [RETURN] to exit):

As you can see the gap opening penalty is quite large (the default in pileup v8.1 it was only 3.00 although it is now 12). Change these parameters to be the same as the pileup ones (gap extension penalty for pileup 8.1 was 0.10 now 4). To change the gap opening or extension penalties choose the appropriate menu number and you will be prompted for a value. Depending on what changes you make this may give a quite different alignment. If you choose the parameters the same as pileup the alignment is as follows (note it is a good idea to give the new alignment a name other than the default otherwise you will overwrite the pre-existing alignment).

CLUSTAL W(1.60) multiple sequence alignment
CAS1_BOVIN      MKLLILTCLVAVALARPKHPIK--HQG-LP------QE--VLNEN-LLRFFVAPFPEVFG
CAS1_HUMAN      MRLLILTCLVAVALARPKLPLR--YPERLQNPSESSEP--IP--------------LESR
CAS1_MOUSE      MKLLILTCLVAAAFAMPRLHSRNAVSSQTQQQHSSSEE--IFK---------QPKYLNLN
CAS1_PIG_I      MKLLIFICLAAVALARPKPPLR--HQEHLQNEPDSREE--LFKERKFLRFPEVPLLSQFR
CAS1_RABIT      MKLLILTCLVATALARHKFHLG--HLKLTQEQPESSEQ-EILKER-KLLRFVQTVPLELR
CAS1_RAT_I      MKLLILTCLVAAALALPRAHRRNAVSSQTQQENSSSEEQEIVK---------QPKYLSLN
CAS1_SHEEP      MKLLILTCLVAVALARPKHPIK--HQG-LS------PE--VLNEN-LLRFVVAPFPEVFR
                *.***  ** * * *  .                      .

CAS1_BOVIN      KEKVNELSKDIGSESTEDQAMEDIKQMEAESISSSEEIVPNSVEQKHIQKE-DVPSE---
CAS1_HUMAN      EEYMNGMNRQRNILR-EK----QTDEIKDTRNESTQNCVVAEPEKMESSISSSS-EE--M
CAS1_MOUSE      QEFVNNMNRQRALLT-E-----QNDEIKVTMDAASEEQAMASAQE-DSSISSSS-EESEE
CAS1_PIG_I      QEIINELNRNHG--------MEGHEQ-RGSSSSSSEEVVGNSAEQKHVQKEEDVPSQ---
CAS1_RABIT      EEYVNELNRQRELLR-EK----ENEEIKGTRNEVTEEHVLADRET-EASISSSS-EE---
CAS1_RAT_I      EEFVNNLNRQRELLT-E-----QDNEIKITMDSSAEEQATASAQE-DSSSSSSSSEESKD
CAS1_SHEEP      KENINELSKDIGSESIEDQAMEDAKQMKAGSSSSSEEIVPNSAEQKYIQKE-DVPSE---
                 * .* . .                .        ..       .            .

CAS1_BOVIN      RYLGYLEQLLRLKKYKVPQLEIVPNSAEERLHSMKEGIHAQQKEPMIGVNQE--------
CAS1_HUMAN      SLSKCAEQFCRLNEYNQLQLQAAH--AQEQIRR------MNENSHVQVPFQQ--------
CAS1_MOUSE      AIPNITEQKNIANEDMLNQCTLEQ--LQRQFKY------NQLLQKASLAKQASLFQQPSL
CAS1_PIG_I      SYLGHLQG---LNKYKLRQLEAIH---DQELHRTNEDKHTQQGEPMKGVNQE--------
CAS1_RABIT      IVPSSTKQKYVPREDLAYQPYVQQQLLRMKERYQ-----IQEREPMRVVNQE--------
CAS1_RAT_I      AIPSATEQKNIANKEILNRCTLEQ--LQRQIKY------SQLLQQASLAQQA--------
CAS1_SHEEP      RYLGYLEQLLRLKKYNVPQLEIVPKSAEEQLHSMKEGNPAHQKQPMIAVNQ---------
                                  .            .                  *

CAS1_BOVIN      --LAYFYPE----------LFRQFYQLDAYPSGAWYYV-PLGTQYTDAPSFSDIPNPI--
CAS1_HUMAN      -----------------------LNQLAAYPYAVWYY--PQIMQYVPFPPFSDISNPT--
CAS1_MOUSE      VQQASLFQQPSLLQQASLFQQPSMAQQASLLQQLLLAQQPSLALQVSPAQQSSLVQQAFL
CAS1_PIG_I      --QAYFYFE----------PLHQFYQLDAYPYATWYYP-P---QYIAHPLFTNIPQPT--
CAS1_RABIT      --LAQLYLQP----------FEQPYQLDAYLPAPWYYT-PEVMQYVLSPLFYDLVTPS--
CAS1_RAT_I      ----------------------SLAQQASLAQQALLAQQPSLAQQAALAQQASLAQQASL
CAS1_SHEEP      -------------------LFRQFYQLDAYPSGAWYYL-PLGTQYTDAPSFSDIPNPI--
                                         *  .          *             .

CAS1_BOVIN      GSEN--SEKTT-MPLW--------------------------------------------
CAS1_HUMAN      AHEN--YEKNNVMLQW--------------------------------------------
CAS1_MOUSE      AQQASLAQKHHPRLSQSYYPHMEQPYRMNAYSQVQMRHPMSVVDQALAQFSVQPFPQIFQ
CAS1_PIG_I      APEK--GGKTEIMPQW--------------------------------------------
CAS1_RABIT      AFES--AEKTDVIPEWLKN-----------------------------------------
CAS1_RAT_I      AQQASLAQKHHPRLSQVYYPNMEQPYRMNAYSQVQMRHPMSVVDQ--AQFSVQSFPQLSQ
CAS1_SHEEP      GSEN--SGKIT-MPLW--------------------------------------------
                  .     *

CAS1_BOVIN      ----------------------------------------
CAS1_HUMAN      ----------------------------------------
CAS1_MOUSE      YDAFPLWAYFPQDMQYLTPKAVLNTFKPIVSKDTEKTNVW
CAS1_PIG_I      ----------------------------------------
CAS1_RABIT      ----------------------------------------
CAS1_RAT_I      YGAYPLWLYFPQDMQYLTPEAVLNTFKPIAPKDAENTNVW
CAS1_SHEEP      ----------------------------------------

This alignment is better than the one above but not as good as the pileup alignment (see later). These casein genes are obviously homologous but difficult to align and may require a lot of hand editing with seqlab, a unix editor or a word-processing package.

Note that there are many sites where there are no asterisks *. This means there are only a few sites where all of the sequences are identical. Any site where there is not an asterisk holds information about the relationship of these proteins to each other. If you have a very "good" alignment with most sites marked with an asterisk not much information about the phylogeny is available i.e. there are very few "informative sites", and any derived tree may not be a true representation of the real phylogeny. It might be worth looking at the DNA sequences if the proteins are all nearly identical.

On the other hand, if the relationship between the sequences is distant, you may have an alignment filled with gaps. You will not have much success drawing reliable trees with such data either, especially if you use the "toss all gaps" option <as you should !> in, say, clustalW

The GCG program pileup can also be used to align these casein sequences. To run pileup you must first be in GCG. The guide tree on which the order of sequence alignment is based is plotted as graphics output by pileup. To view this tree you will have to set your "graphics environment" before running pileup - see Appendix I.

To Use two alignment viewing software tools later on: prettyplot and prettybox, you must be usinf GCGv8 for this next part. Where as GCGv9 can read and understand files in GCGv8 format, the reverse is not true. Both prettyplot and prettybox are GCGv8 tools, so convert to GCGv8 now.

To do this enter:
% gcg81

To convert back to GCGv9 enter:
% gcg91

NOTE: Local instructions may vary.

In the example below a printable postscript file is created. To start the program enter:

% pileup

PileUp creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments.  It can also plot a
tree showing the clustering relationships used to create the alignment.

PileUp of what sequences ? *.sw

   1   cas1_bovin.sw   214 aa
   2   cas1_human.sw   185 aa
   3   cas1_mouse.sw   313 aa
   4     cas1_pig.sw   206 aa
   5   cas1_rabit.sw   215 aa
   6     cas1_rat.sw   284 aa
   7   cas1_sheep.sw   206 aa
What is the gap creation penalty (* 12 *) ?

What is the gap extension penalty (* 4 *) ?
This program can display the clustering relationships graphically.
Do you want to:

     A) Plot to a FIGURE file called "pileup.figure"
     B) Plot graphics on LASERWRITER attached to PlotPort
     C) Suppress the plot

Please choose one (* A *): C

The minimum density for a one-page plot is 6.0 sequences/100 platen units.
What density do you want (* 6.0 *) ?
What should I call the output file name (* pileup.msf *) ?
Determining pairwise similarity scores...
   1   x     2       0.59
   1   x     3       0.52
   1   x     4       0.80
   1   x     5       0.65
   1   x     6       0.48
   1   x     7       1.35
   2   x     3       0.57
   2   x     4       0.67
   2   x     5       0.77
   2   x     6       0.60
   2   x     7       0.59
   3   x     4       0.54
   3   x     5       0.66
   3   x     6       1.22
   3   x     7       0.51
   4   x     5       0.72
   4   x     6       0.52
   4   x     7       0.74
   5   x     6       0.61
   5   x     7       0.61
   6   x     7       0.49
Aligning...
   1     ..........-..
   2     ..............-..
   3     ..........-.
            ..........-.
   4     ..........-.
   5     ..........-.
            ..........-..
   6     ............-..

           Total sequences:          7
          Alignment length:        330
                  CPU time:      01.23
            Output file:pileup.msf

This tree has an implied root at the top which should be ignored. This tree is quite different to the ClustalW neighbour-joining and PHYLIP PROTPARS trees. This is not really a fair comparison because this is only a UPGMA guide tree, not a tree based on the alignment.

You could look at the pileup alignment simply by entering

% more pileup.msf

But this output is in a different format to the ones previously obtained. To make the alignments more easily comparable change the format of this alignment to clustal format. Do this

% clustalw

Choose 1 to input your alignment (pileup.msf).
Choose 2 for the "Multiple Alignment Menu".
Choose 9 "Output format options" and you will see that the default format is clustal.
Choose 8 to create the alignment output files.

Look at the alignment (first exit from ClustalW)

% more pileup.aln

The pileup alignment in clustal format follows

CLUSTAL W(1.60) multiple sequence alignment

cas1_mouse   MKLLILTCLVAAAFAMPRLHSRNAVSSQTQQQHSSSE--E-----IFKQPKYLNLNQEFV
cas1_rat     MKLLILTCLVAAALALPRAHRRNAVSSQTQQENSSSEEQE-----IVKQPKYLSLNEEFV
cas1_bovin   MKLLILTCLVAVALARPKHPIKHQ-------GLPQEVLNE-NLLRFFVAPFPEVFGKEKV
cas1_sheep   MKLLILTCLVAVALARPKHPIKHQ-------GLSPEVLNE-NLLRFVVAPFPEVFRKENI
cas1_pig     MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSREELFKERKFLRFPEVPLLSQFRQEII
cas1_human   MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE--------------PIPLESREEYM
cas1_rabit   MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERKLLRFVQTVPLELREEYV
             *.***  ** * * *  .                                       * .

cas1_mouse   NNMNRQRALLTE-QNDEIKVTMDAASEEQAMASAQED-SSISSSSEESEEAIPNITEQKN
cas1_rat     NNLNRQRELLTE-QDNEIKITMDSSAEEQATASAQEDSSSSSSSSEESKDAIPSATEQKN
cas1_bovin   NELSKDIGS--------------ESTEDQAMEDIKQMEAESISSSEE---IVPNSVEQKH
cas1_sheep   NELSKDIGS--------------ESIEDQAMEDAKQMKAGSSSSSEE---IVPNSAEQKY
cas1_pig     NELNRNHGM--------------EGHEQ---------RGSSSSSSEE---VVGNSAEQKH
cas1_human   NGMNRQRNILREKQTDEIKDTRNESTQNCVVAEPEKMESSISSSSEE---MSLSKCAEQF
cas1_rabit   NELNRQRELLREKENEEIKGTRNEVTEEHVLADRET-EASISSSSEE---IVPSSTKQKY
             * . .                     .               *****           ..

cas1_mouse   IANEDMLNQCTLEQLQRQFKYNQLLQKASLAKQASLFQQPSLVQQASLFQQPSLLQQASL
cas1_rat     IANKEILNRCTLEQLQRQIKYSQLLQQASLAQQASL------------------------
cas1_bovin   IQK-E-------------------------------------------------------
cas1_sheep   IQK-E-------------------------------------------------------
cas1_pig     VQKEE-------------------------------------------------------
cas1_human   CRLNE-------------------------------------------------------
cas1_rabit   VPRED-------------------------------------------------------
                 .
cas1_mouse   FQQPSMAQQASLLQQLLLAQQPSLALQVSPAQQSSLVQQAFLAQQASLAQKHHPRLSQSY
cas1_rat     ------AQQASLAQQALLAQQPSLAQQAALAQQASLAQQASLAQQASLAQKHHPRLSQVY
cas1_bovin   -------------------------------DVPSERYLGYLEQLLRLKKYKVPQLEIVP
cas1_sheep   -------------------------------DVPSERYLGYLEQLLRLKKYNVPQLEIVP
cas1_pig     -------------------------------DVPSQSYLGHLQG---LNKYKLRQLEAIH
cas1_human   ----------------------------------------------------YNQLQLQA
cas1_rabit   -------------------------------------------------------LAYQ-
                                                                    *
cas1_mouse   YPHMEQPYRMNAYSQVQMRHPMSVVDQALAQFSVQPFPQIFQYDAFP--LWAYFPQDMQY
cas1_rat     YPNMEQPYRMNAYSQVQMRHPMSVVDQ--AQFSVQSFPQLSQYGAYP--LWLYFPQDMQY
cas1_bovin   NSAEERLHSMKEGIHAQQKEPMIGVNQELAYFYPELFRQFYQLDAYPSGAWYYVPLGTQY
cas1_sheep   KSAEEQLHSMKEGNPAHQKQPMIAVNQ--------LFRQFYQLDAYPSGAWYYLPLGTQY
cas1_pig     ---DQELHRTNEDKHTQQGEPMKGVNQEQAYFYFEPLHQFYQLDAYPYATWYYPP---QY
cas1_human   AHAQEQIRRMNENSHVQ-----------------VPFQQLNQLAAYPYAVWYY-PQIMQY
cas1_rabit   PYVQQQLLRMKERYQIQEREPMRVVNQELAQLYLQPFEQPYQLDAYLPAPWYYTPEVMQY
                 .           .                     *  *  *.    * * *   **
cas1_mouse   LTPKAVLNTFKPIVSKDTEKTNVW------
cas1_rat     LTPEAVLNTFKPIAPKDAENTNVW------
cas1_bovin   TDAPSFSDIPNPIGSENSEKT-TMPLW---
cas1_sheep   TDAPSFSDIPNPIGSENSGKI-TMPLW---
cas1_pig     IAHPLFTNIPQPTAPEKGGKTEIMPQW---
cas1_human   VPFPPFSDISNPTAHENYEKNNVMLQW---
cas1_rabit   VLSPLFYDLVTPSAFESAEKTDVIPEWLKN
                        *

It is interesting to note that the area of the sequences that we were looking at earlier is perfectly aligned in this sequence.

You can feed the pileup alignment into ClustalW (filename : pileup.msf) and use clustal to draw a tree. This tree can then be viewed as before using DRAWTREE. This is a fairer comparison. Note that, in this case, it is essentially the same tree as obtained from the ClustalW alignment.

GCG's alignment viewing software

A GCG .msf file is particularly useless for viewing the aligned sequences. Use GCG pretty to create and display a consensus sequence and generally make it easier to read. If you have Xwindows capability, be sure to try seqlab to view and manipulate multiple sequence alignment files. Otherwise you are advised to try EGCG's pretty box and prettyplot to prepare your sequence for publication.

PRETTYPLOT

Prettyplot is an EGCG alternative to the GCG program pretty. It displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment it simply displays it. Prettyplot displays the aligned sequences with boxes around identical sites.

PRETTYBOX

Prettybox displays multiple sequence alignments as shaded boxes in Postscript format (i.e. the output file must be printed and/or displayed on a Postscript-compatible device). Prettybox will optionally calculate a consensus sequence. The program does not create the alignment it simply displays it.

Like the GCG program pretty, both of these will take pileup output. Note carefully the format of the input. If, for example, your pileup output file was called pileup.msf (the default filename) then when asked "what sequences ?" you must enter pileup.msf{*}. And don't forget the {*} !

Exercise:

Use prettyplot and prettybox to display the casein sequences pileup output from the "Multiple Sequence Alignment " section. To use that output file you will first need to be in that directory.

% cd mult_seqs

Then run prettyplot

% prettyplot

PRETTYPLOT displays multiple sequence alignments and calculates a
consensus sequence.  It does not create the alignment, it simply

displays it.

PRETTYPLOT uses any sequences

PRETTYPLOT of what sequence(s) ? pileup.msf{*}
If at this point, you encounter an error message telling you that the files can not be read, you may not have move into GCGv8. If this is the case you must go back prior to this exercise, change into GCGv8 and repeat this exercise to be able to continue.

                   Start (* 1 *) ?
                   End (* 330 *) ?
                     cas1_mouse  len: 330  wgt: 1.00
                       cas1_rat  len: 330  wgt: 1.00
                     cas1_bovin  len: 330  wgt: 1.00
                     cas1_sheep  len: 330  wgt: 1.00
                       cas1_pig  len: 330  wgt: 1.00
                     cas1_human  len: 330  wgt: 1.00
                     cas1_rabit  len: 330  wgt: 1.00

 Find consensus to what minimum plurality (* 4.0 *) ?
 PostScript instructions for a LASERWRITER are now being sent to output.ps.

To view this output enter

% lpr -Pprintername output.ps

Similarly to run prettybox,

% prettybox

PRETTYBOX displays multiple sequence alignments as shaded boxes in

Postscript format (e.g., the output file must be printed and/or displayed on

a Postscript-compatible device). PrettyBox will optionally calculate a

consensus sequence. The program does not create the alignment; it simply

displays it.

PRETTYBOX uses any sequencesPRETTYBOX of what sequence(s) ?
pileup.msf{*}

        pileup.msf{cas1_mouse}, len: 330
          pileup.msf{cas1_rat}, len: 330
        pileup.msf{cas1_bovin}, len: 330
        pileup.msf{cas1_sheep}, len: 330
          pileup.msf{cas1_pig}, len: 330
        pileup.msf{cas1_human}, len: 330
        pileup.msf{cas1_rabit}, len: 330

                   Start (* 1 *) ?
                   End (* 330 *) ?
Orient output as:
     L) Landscape
     P) Portrait
 Please choose one (* L *) ?

Display a consensus (* No *) ? yes

 Find consensus to what plurality (* 3.6 *) ?
Do numbering on:
     R) Right side
     T) Top side
     N) None
 Please choose one (* R *) ?

Printing the output file is exactly as for prettyplot. Remember that the fileoutput.ps will get overwritten until you issue another % postscript command.

For the casein dataset the alignments are likely to be "unconvincing" whatever parameters you chose for creating them. You might under these circumstances want to use GCG 9.1's seqlab program to try editing the .msf{*} file by hand.

Remember to down-grade to GCG 8.1 for EGCG programs prettybox and prettyplot. Also note that to run any of these pretty* programs the input file is of the form file.msf{*}. Here the {*} indicates that you wish to include all the sequences in the .msf file. But if you leave off the {*} then the programs will not work for you. Its not a bug, but a GCG feature !

You might also like to try a terminal screen based tree manipulation and printing program such as njplot this is widely available for unix and also for Macintosh. The latter is particularly hand because it prints the tree out in a de-constructable PICT format that can be rapidly got to camera-ready print quality.

You should also compare the clustalw server which, among many other useful, point-and-clickable features and options, enables you to see an alignment with colour-coded residues.

[BACK]