Homology SearchingContrary to widely held belief, "homology searching", "Blast" and "the NCBI basic blast server" are not all synonyms ! The GCG programs that you will use for homology searching are blast netblast and fasta. We will also be using some WWW blast servers. Blast finds sequences in a database that are similar to a query sequence. It uses the method of Altschul et al.(J. Mol. Biol. 215; 403-410 (1990)). The query sequence and the database to be searched can be either peptide or nucleotide in any combination
Blast finds regions of relative similarity between your query sequence and a given database. These regions of similarity are known as segment pairs. The segment pairs found have the property that the sum of the scoring matrix of their constituent symbol pairs is above what you may expect to occur by chance. Homology searching is one of the most computationally intensive things you can do in bioinformatics. Care should be taken to ensure whether local computers are able to deal with your search in an effective time-scale. You can use such UNIX based protocols as uptime or top(if installed) to try to determine if running a blast search to the current system load is likely to slow everything down unacceptably. GCG 'netblast' enables you to carry out your searches on a dedicated remote blast server, this could be recommended if:
Options and parametersAll the homology searching programs come with several options and choices to tailor a search to yield the results you want. Some of these parameters change the quality of the search while others merely change the amount of information (eg number of hits) that is returned to you.It is perfectly legitimate to run a homology search taking the default parameters: those defaults have been chosen to achieve the "greatest good for the greatest number". If your query sequence turns out to be a member of a well established and widely sequenced family and you merely want to establish this fact, then you will almost certainly get the same result regardless of what options you choose. For less trivial searches/questions, however, such as finding distant relationships, one homology search is almost certainly inadequate.
For the WWW servers you should be able to find the equivalent switches, using the hypertext-linked help information for each option. For obvious reasons the number of available options is limited on the Basic Blast page. For reproducibility you should note the time, date and database version and size against which you carried out the homology search, this information should appear at the head of the output Hints and suggestions
Examples:GCG -FastaUse the GCG version of fasta to find potential yeast homologs of the E. coli recA gene using the DNA sequence.% fasta FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein)? For nucleotide searches, FastA may be more sensitive than BLAST FASTA with what query sequence ?em:v00328 Begin
(* 1 *) ?
1 Sequences 2,167bp searched EM_FUN:SC
Unfortunately, this does not find the known (but distant) homologue of recA the RAD51 gene in yeast (sw:ra51_yeast) because there are so many gaps in the alignment of the two sequences. If you were to do a blast search against, say, SwissProt to find this yeast homologue you'd have to ignore the first 80+ hits because recA is amongst the most widely sequenced of prokaryotic proteins. This can be a real problem when you already know of the obvious relationships with your protein of choice. In this particular case we are in better shape because the NCBI has created a blast-searchable DB specifically for Saccharomyces cerevisiae (yeast) presumably because it is a) a standard genetic organism and b) the first completely sequenced eukaryotic genome. See exercise 5 below for another approach. GCG - netblastUse the GCG internet version of blast to find homologs of the Bordetella pertussis recA gene using the protein sequence. Or some other more interesting gene of your own choice !% netblast sw:reca_borpe NetBLAST searches for seqs similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA. Search for query in what sequence database: 1) nr
p Non-redundant GB CDS translations+PDB+SwissProt+PIR
Please choose one (* 1 *): 3 Ignore hits expected to occur by chance more than (* 10.0 *) times? 10 Limit the number of sequences in my output to (* 250 *) ? What should I call the output file (* reca_borpe.blastp *) ? reca.blastp Sending query...
% more reca.blastp BLASTP 1.4.11 [24-Nov-97] [Build 24-Nov-97] Reference: Altschul,
Stephen F? Warren Gish, Webb Miller, Eugene W. Myers,
Database: Non-redundant SwissProt sequences
Smallest
sp|P17740|RECA_BORPE RECA PROTEIN
1744 5.5e-229 1
etc. etc. The first part of the output displays in order of statistical significance the 'hits' each line having the following information: database|accession number|database_name a brief description, an absolute score, the probability that you would get a match as good as this by chance - in all these cases this probability is vanishingly small so the 'hits' are likely to be statistically (and biologically) significant, and finally the number of high scoring segment pairs. GCG netblast is using an outmoded gap-free blast algorithm, so most of the hits come in two chunks: >SP|P19690|RECA-BURCE RECA PROTEIN Length = 347 Score = 232 (105.0 bits), Expect = 8.7e-180, Sum P(2) = 8.7e-180
Query: 11 AEKAKALAAALSQIEKQFGKGSIMRYGDNEVEHDIQVVSTGSLGLDIALGVGGLP
65
Score = 1193 (539.7 bits), Expect = 8.7e-180, Sum P(2) = 8.7e-180
Query: 68 VIEVYGPESSGKTTLTLQVIAEMQKLGGTCAFVDAEHALDVQYASKLGVNLTDLLISQPD
127
Query: 128 TGEQALEITDALVRSGSVDLIVIDSVAALVPKAEIEGEMGDSLPGLQARLMSQALRKLTA
187
Query: 188 TIKRTNCMVIFINQIRMKIAVMFGNPETTTGGNALKFYSSVRLDIRRIGAIKKGDEVVGN
247
Query: 248 ETRVKVVKNKVAPPFKQAEFDIMYGSGISREGEIIDLGVQANVVDKSGAWYSYSGNRIGQ
307
Query: 308 GKDNVREYLKEHKEMAIEIENKVRENQGIVS 338
After the list of hits there is a display of the alignments on which the hits are based. As an aside you may notice that, for the Query sequence, the first chunk above ends with residue 65 and the second begins with residue 68, while for the Sbjct sequence the first chunk ends with residue 57 and the second begins with residue 61. The reca_borpe.sw sequence is one residue shorter than most other recA sequences at this point and the old blastp is unable to bridge the gap. Careful scrutiny of the output will tell you that GCG Netblast is pointing at an NCBI server which is still running the 'gapless' Blast 1.4. This server is quite unreliable and may become intermittently unavailable after the USA wakes up. Output from http://www.ncbi.nlm.nih.gov/BLASTHere below is the first part of the output from a search for the yeast homologue of the prokaryotic recaA gene, using sw:reca_ecoli as query sequence.Reference:
Score E
The hits are reported as gi numbers - a unique identifier for the Genbank seqience, followed by a Genbank/EMBL accession number, then a brief diescriptor, then a score in bits (which should be independent of the scoring matrix) and finally an E value. This lat is the same as the -exp parameter and tellsd you the number of hitys you would expect in the searched database by chance alone. 1e-07 is a small number so the match is likely to be statistically (and biologically!) significant. Exercises
% netblast sw:cas1_bovin swissprot -lis=50 -seg=20 -fil=xs -mat=pam40 -exp=0.001 -out=bovin.blastp You can thus more easily run the several necessary blast searches and keep track of the combination of options that you have used. |
||||||
|