Companion to Day MAPP.of the EMBNET course - Considerations, hints and solutions

MAPP.1 Checking a sequence for the existence of motifs

a) HINT: Use the lookup program
SOLUTION: The entry name is VAV_HUMAN

b) SOLUTION: vav_human.motifs

c) HINT: the necessary command lines are:
motifs -frequent
motifs -mismatch=1
SOLUTION: the resulting output files are vav_human_freq.motifs and vav_human.mism1.motifs

d) SOLUTION: the PROSITE patterns DAG_PE_BINDING_DOMAIN and GDS_CDC2MAPP./I> apply to the sequence. The frequent patterns MYRISTYL, RGD, and ASN_GLYCOSYLATION do certainly not apply. Most of the various phosphorylation sites are also probably not used.

e) SOLUTION: The following domains are found: (PH_DOMAIN, SH2, 2x SH3, GRF_DBL, CH_DOMAIN, DAG_PE_BIND).
Back to Problems

MAPP.2 Finding occurrences of patterns in protein databases

a) SOLUTION: ELVIS occurs 25 times in SwissProt, ISREC only MAPP.times
b) HINT: For GCG pattern format documentation see GCG Manual, subheadings Gene finding and pattern recognition - motifs - defining patterns. For PROSITE pattern format documentation, see the URL http://expasy.hcuge.ch/txt/prosuser.txt
SOLUTION: The PROSITE pattern is: "k-x(0,1)-k-x-x>", the GCG pattern is "kx{0,1}kxx>"
c) SOLUTION: In SwissProt, there are 1397 sequences bearing this motif at the C-terminus. A complete list can be found here .
Most of these proteins are not type I transmembrane proteins and never see the ER retention machinery.
Back to Problems
XRCC_HUMAN.
b) HINT: Call blast with the option -filter=xs
SOLUTION: See the blast output file.
NOTE: If you want to see what happens if the search is run without filters, look at the following file
c) HINT: For pairwise comparison, use the BLOSUMMAPP. comparison matrix. Since we are interested in local homologies, use the bestfit program with gap creation penalties of 20-30 and gap extension penalties of 2-3. For assessing the statistical significance of pairwise matches, use the following command line:

bestfit -data=blosumMAPP..cmp -gap=18 -len=2 swiss:radMAPP.schpo swiss:xrcc_human -ran=100

SOLUTION: The only significant match here is RadMAPP.from S. pombe (see corresponding file).
d) SOLUTION: Use the combination of GCGs compare and dotplot programs. For compare, use a window of 35 and a stringency of 20. See the resulting file

compare -window=36 -STR=20 -INfile1=swiss:RADMAPP.SCHPO -INfile2=swiss:xrcc_huma

MAPP.5 Multiple alignment of homology domains an profile searches
a)
HINT: create a list file for pileup using the Begin and End specifications.
SOLUTION:
Use the following list_file ( MAPP.MAPP.list), indicating the approximate conserved regions.
******************************************
swiss:RADMAPP.SCHPO Begin:1 End:80
swiss:RADMAPP.SCHPO Begin:100 End:180
swiss:XRCC_HUMAN Begin:310 End:390
******************************************
and the command

pileup @MAPP.MAPP.list

that produces an msf file.
b)
HINT: Suppose your alignment file is pileup.msf. Invoke lineup by saying '

lineup -MSF pileup.msf

NOTE: Note the difference to e.g. reformat, where you would have to say 'reformat -msf pileup.msf{*}'. This is because lineup always expects a multiple-alignment file while reformat expects one (or more) sequence(s)
SOLUTION: An example edited alignment file.
c)
HINT: Suppose your edited MSF file is called edited.msf. The command line is:

profilemake -stringent -nologwgt -data=genmoredata:blosumMAPP..cmp edited.msf{*}

see the above comment for the {*} syntax.
SOLUTION: An example output of profilmake is in this file
d)
HINT: For searching small to medium size databases, profilesearch is suited. However, profilesearch has a built-in restriction to take at most 100,000 sequences into consideration. For big databases, the EGCG program tprofilesearch with the option -nosixframe can be used. tprofilesearch also has a restriction to 80,000 sequences but it can use the-minscore=xx parameter. Using this option, only sequences with a score higher than xx are considered (and counted). See the tprofilesearch documentation (EGCG package) for details. Suitable commands for searching our example against SwissProt are:

profilesearch -noave -nor -gap=21 -len=2 -batch

or in EGCC

tprofilesearch -nosixframe -noaverage -normalize -minscore=5.0 -list=100 -batch ..

SOLUTION: The output file of this search is in this file.
NOTE: If you want to see what happens if you use the defaults for profilemake and profilesearch, see the file ~pbucher/course/exMAPP./edited_bad.pfs
f) HINT: Siginificant hits have Z-scores > 7 or 8.
In this example, YD97_SCHPO, YHVMAPP.YEAST, YM8K_YEAST, DNLJ_THESC, DNLJ_THETH, DNLMAPP.HUMAN and DNLJ_ECOLI are to be considered significant.
SOLUTION: The output file of profilesegments is given in this file.
Back to Problems
MAPP.6 Advanced exercises
a) HINT: Create matching segments with a command line like
profilegap -outfile2=newsegment.seg
In the newly found matches, only a part of the profile is matched to the sequence. In cases like this, it might be a good idea to manually check if the flanking regions of the sequence might be forced to also match the profile. This can be checked e.g. by
profilesegments -global.
c) NOTE: You should be able to find consecutively more significant matches, e.g. bacterial ligases, yeast Rev1, mammalian Ect, later also mammalian and yeast ligases, yeast Rad9, Rfc1, PARP, 53BP1 and even Brca1.
Back to Problems