Companion & Solutions

Multiple Sequence Alignments and Profiles

Finding a local homology domain by BLAST and pairwise alignment

     a)
     HINT: Use the lookup program
     SOLUTION: The entry name is XRC1_HUMAN.

     b)
     HINT: Call blast with the option -filter=xs
     SOLUTION: See the blast output file.
     NOTE: If you want to see what happens if the search is run without filters, look at the following file

     c)
     HINT: For pairwise comparison, use the BLOSUM45 comparison matrix. Since we are interested in
     local homologies, use the bestfit program with gap creation penalties of 20-30 and gap extension
     penalties of 2-3. For assessing the statistical significance of pairwise matches, use the following
     command line:

bestfit -data=blosum62.cmp -gap=18 -len=2 swiss:rad4_schpo
swiss:xrcc_human -ran=100

SOLUTION: The only significant match here is Rad4 from S. pombe (see corresponding file).

     d)
     SOLUTION: Use the combination of GCGs compare and dotplot programs. For compare, use a
     window of 35 and a stringency of 20. See the resulting file

compare -window=36 -STR=20 -INfile1=swiss:RAD4_SCHPO

-INfile2=swiss:xrcc_human

dotplot xrcc_human.pnt

[BACK]

Multiple alignment of homologous domains and profile searches

     a)
     HINT: create a list file for pileup using the Begin and End specifications.
     SOLUTION:Use the following list_file ( MAPP.4.list), indicating the approximate conserved
     regions.

swiss:RAD4_SCHPO Begin:1 End:80

swiss:RAD4_SCHPO Begin:100 End:180

swiss:XRCC_HUMAN Begin:310 End:390

and the command

pileup @MAPP.4.list

that produces an msf file.

b)
HINT: Suppose your alignment file is pileup.msf. Invoke lineup by saying '

lineup -MSF pileup.msf

     NOTE: Note the difference to e.g. reformat, where you would have to say 'reformat -msf
     pileup.msf{*}'. This is because lineup always expects a multiple-alignment file while reformat
     expects one (or more) sequence(s)
     SOLUTION: An example edited alignment file.

c)
HINT: Suppose your edited MSF file is called edited.msf. The command line is:

profilemake -stringent -nologwgt -data=blosum62.cmp edited.msf{*}

see the above comment for the {*} syntax.
SOLUTION: An example output of profilmake is in this file

     d)
     HINT: For searching small to medium size databases, profilesearch is suited. However, profilesearch
     has a built-in restriction to take at most 100,000 sequences into consideration. For big databases, the
     EGCG program tprofilesearch with the option -nosixframe can be used. tprofilesearch also has a
     restriction to 80,000 sequences but it can use the-minscore=xx parameter. Using this option, only
     sequences with a score higher than xx are considered (and counted). See the tprofilesearch
     documentation (EGCG package) for details. Suitable commands for searching our example against
     SwissProt are:

profilesearch -noave -nor -gap=21 -len=2 -batch

or in EGCC

tprofilesearch -nosixframe -noaverage -normalize -minscore=5.0 -list=100
-batch ..

SOLUTION: The output file of this search is in this file.

     f) HINT:Siginificant hits have Z-scores > 7 or 8.
     In this example, YD97_SCHPO, YHV4_YEAST, YM8K_YEAST, DNLJ_THESC, DNLJ_THETH,
     DNL4_HUMAN and DNLJ_ECOLI are to be considered significant.
     SOLUTION: The output file of profilesegments is given in this file.

Advanced exercises

     e) HINT: Create matching segments with a command line like
     profilegap -outfile2=newsegment.seg
     In the newly found matches, only a part of the profile is matched to the sequence. In cases like this, it
     might be a good idea to manually check if the flanking regions of the sequence might be forced to also
     match the profile. This can be checked e.g. by
     profilesegments -global.

     f) NOTE: You should be able to find consecutively more significant matches, e.g. bacterial ligases,
     yeast Rev1, mammalian Ect, later also mammalian and yeast ligases, yeast Rad9, Rfc1, PARP, 53BP1
     and even Brca1.

[BACK]

The Psi-Blast Server

     a)
     HINT: Do not use the full XRCC_HUMAN. If you do so, another domain at the N-Terminus will obscure
     the hits of interest. The output file you get using the whole sequence is shown here. The major problem is that
     the long N-terminal domain makes it difficult to see the shorter domain we are interested in.

SOLUTION: The correct solution will be obtained in several rounds by submitting only the frst 400
amino acid of XRCC_HUMAN. See the file here for the third iteration.

     b)
     HINT: Pileup is not sensitive enought to make this alignment. Instead, use ClustalW, but use pileup to
     produce the fragment of sequences you wish to align. Use readseq to reformat your sequences.

SOLUTION: You can use the following list_file and feed it to pileup. The output alignment is rather
unconvincing (here) .

pileup @listfile

Reformat the alignmnent into a sequence file using readseq (it is important to remove the gaps before proceeding).
If you use clustalx, the msf file can be loaded direcly and the gaps removed before doing the multiple alignment.
If you are using clustalw, you may want to use readseq:

readseq <your_alignment.msf> -f=8 -degap=~ -o=your_sequences.seq -a [Produces a sequence file in FASTA]

clustaw your_sequences.seq [Produces a .aln file containing the alignment in clustalw format]

You can also edit this alignment using seaview. do not forget to save it in clustalw format for further processing (i.e. boxshade).

     c)
     You can use Boxshade or prettyview to produce a more convincing alignment in post-script format (here is a black and white an
    example), obtained with the command:

boxshade -def -in=pileup4domain.aln -out=x.ps -cons

[BACK]