The ABC of Bioinformatics11. Significance Testing of TreesWhatever data you put into Phylip, you can usually get out a tree of some kind. Obviously you will want to have some assessment of how reliable such a tree is. One of the standard methods of determining the reliability of the tree generated from a dataset is bootstrapping. This involves the random resampling, with replacement, of the data/sites from which the original tree was derived. From each resampling a new tree is drawn and this procedure is repeated 100, 1000 or 10,000 times. After all the resampling you can see how often particular branches are supported under the bootstrapping regime. With clustalw a bootstrapping module is incorporated into the neighbour joining tree drawing part of the program. Use option 4 from the main menu to get: ****** PHYLOGENETIC TREE MENU ****** 1. Input an alignment 2. Exclude positions with gaps? = OFF 3. Correct for multiple substitutions? = OFF 4. Draw tree now 5. Bootstrap tree 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Then take option 5 (almost certainly toggling options 2 and 3 at the same time!) Enter name for bootstrap output file [reca6.phb]: Enter seed no. for random number generator (1..1000) [111]: 559 Enter number of bootstrap trials (1..10000) [1000]: Each dot represents 10 trials .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... Bootstrap output file completed [reca6.phb] Note that the random number generator is one of those pseudorandom algorithms. Thus you will get exactly the same bootstrap values if you persist in taking the default 111 seed. With Phylip it is a bit more complicated and a multistep process. Let us suppose that you wish to bootstrap a tree of somatoptropin genes drawn with PROTPARS. The first step is to run SEQBOOT: % seqboot /phylip/bin_phylip/seqboot: can't read infile Please enter a new filename> soma.phy Random number seed (must be odd)? 579 The program will then present you with: Bootstrapped sequences algorithm, version 3.53c Settings for this run: D Sequence, Morph, Rest., Gene Freqs? Molecular sequences J Bootstrap, Jackknife, or Permute? Bootstrap R How many replicates? 100 I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or the letter for one to change) Y In fact it might be better to enter R and reduce the number of replicates to 50 or 20 in a classroom setting and 1000 for real work completed replicate number 10 completed replicate number 20 completed replicate number 30 completed replicate number 40 completed replicate number 50 completed replicate number 60 completed replicate number 70 completed replicate number 80 completed replicate number 90 completed replicate number 100 Output written to output file called outfile ! % mv outfile soma.sqb % protpars /phylip/bin_phylip/protpars: can't read infile Please enter a new filename> soma.sqb Protein parsimony algorithm, version 3.53c Setting for this run: U Search for best tree? Yes J Randomize input order of sequences? No. Use input order O Outgroup root? No, use as outgroup species 1 T Use Threshold parsimony? No, use ordinary parsimony M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each site No 5 Print sequences at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) M How many data sets? Y (to indicate that all parameters are set and the analysis can begin) protpars will clank through the input sequence files 50 times printing out each time thus: Data set # 8: Adding species: SOMA_BOVIN SOMA_SHEEP SOMA_MOUSE SOMA_RAT SOMA_RABIT SOMA_PIG SOMA_HUMAN doing global rearrangements !-------------! ............. until eventually it declares: Output written to output file Trees also written onto file
% mv treefile soma.sqbtree incbi@acer>consense Majority-rule and strict consensus tree program, version 3.53c Settings for this run: O Outgroup root? No, use as outgroup species 1 R Trees to be treated as Rooted? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the sets of species Yes 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) Y There are two output files from consense. The file outfile has the most accessible information although hardly to camera-ready quality. Each branch in this file has the number of times the most parsimonious tree was supported by bootstrap replicates. Don't draw the consense treefile expecting to see the bootstraps printed at the branches. It writes these into a New Hampshire format tree in such a way that they are interpreted as branch lengths. You should now practice the use of these programs with a real dataset of your own or use SRS at the EBI to download a family or partial family of proteins, such as: mammalian protein seqs: cytochrome C , alcohol dehydrogenase, actin alpha, actin beta , enolase, catalase, cathepsin D, HSP70 (heat shock protein), hexokinase, histone H3, lactate dehydrogenase, octeonectin/S.P.A.R.C., pyruvate kinase, somatotropin, spectrin alpha, spectrin beta, Thy-1 (membrane) glycoprotein, triose phosphate isomerase, tubulin alpha, tubulin beta. within prokaryotes: flagellin, recA, glnA/glutamine synthase Appendix I Graphics OutputMuch of the strength of GCG lies in its graphical display capabilities. However graphics are very much dependent on the terminal you are using. So it is not possible to set the correct parameters for bioinformatics, but rather each user must determine which graphics display mode best suits their local situation. To set-up the graphics output interface you must run one of the following programs. It may be necessary to switch between graphics output modes in one session. 'postscript' can be used for hard copy and final print outs, while 'tektronix' and 'xwindows' are better for instant views on the screen.
% postscript
The following command lines cover several of the more common situations. Note: if you have Xwindows capability, you may want to use the Wisconsin Package (X-)Interface called seqlab. To run this type:
All the GCG programs are then available from a window with menus. Appendix IIWays of representing sequences for findpatterns and motifs.
Expressions:
Appendix III Sequence SymbolsGCG uses the letter codes for amino acid codes and nucleotide ambiguity proposed by IUB (Nomenclature Committee, 1985, Eur. J. Biochem. 150; 1-5). These codes are compatible with the codes used by the EMBL, GenBank, and PIR databases. Nucleotides
Amino Acids
The Universal Genetic Code.
APPENDIX IVBiochemically meaningful grouping of Amino Acids
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|