Format Documentation

Fasta format

Sequences in fasta formatted files are preceded by a line starting with >.
The first word on this line is the name of the sequence. The rest of the line is a description of the sequence.
The first character must be a digit or a letter. The remaining lines contain the sequence itself.
Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.

>1aboA NLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPS NYITPVN >1ycsB KGVIYALWDYEPQNDDELPMKEGDCMTIIHREDEDEIEWWWARLNDKEGY VPRNLLGLYP >1pht GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPEEIG WLNGYNETTGERGDFPGTYVEYIGRKKISP >1vie DRVRKKSGAAWQGQIVGWYCTNLTPEGYAVESEAHPGSVQIYPVAALERI N >1ihvA NFRVYYRDSRDPVWKGPAKLLWKGEGAVVIQDNSDIKVVPRRKAKIIRD

Clustal format

Clustal format files contain the word clustal at the beginning:

CLUSTAL W (1.82) multiple sequence alignment 1aboA -NLFV-ALYDFVASGDNTLSITKGEKLRV-------LGYNHNG-------EWCEA--QTK 1ycsB KGVIY-ALWDYEPQNDDELPMKEGDCMTI-------IHREDEDEI-----EWWWA--RLN 1pht -GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPEEIGWLNGYNETT 1vie ---------DRVRKKSG--AAWQGQIVGW---------YCTNLTP----EGYAVESEAHP 1ihvA ------NFRVYYRDSRD--PVWKGPAKLL---------WKGEG-------AVVIQ---DN . . * 1aboA NGQGWVPSNYITPVN------ 1ycsB DKEGYVPRNLLGLYP------ 1pht GERGDFPGTYVEYIGRKKISP 1vie GSVQIYPVAALERIN------ 1ihvA SDIKVVPRRKAKIIRD----- . *

Msf format

msf formatted multiple sequence files are most often created when using programs of the GCG suite. msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. You can specify a single sequence or many sequences within an msf file.
An example of part of an msf file, created using the GCG multiple sequence alignment program:

!!AA_MULTIPLE_ALIGNMENT 1.0 PileUp of: @hsp70.list Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430 GapWeight: 8 GapLengthWeight: 2 hsp70.msf MSF: 743 Type: P October 6, 1998 18:23 Check: 7784 .. Name: S11448 Len: 743 Check: 3635 Weight: 1.00 Name: S06443 Len: 743 Check: 5861 Weight: 1.00 Name: S29261 Len: 743 Check: 7748 Weight: 1.00 // 1 50 S11448 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S06443 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE S29261 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT

Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file:

Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional)
A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.(optional)
A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required)

msf files contain some other information as well:

Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).(required)
Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.(required)
Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences.

Visualization of PFAM motifs on T-COFFEE alignments

METHOD:
All sequences in the alignment will be searched individually for known PFAM motifs using the program "hmmpfam".

Every hit to a PFAM motif (E-value < 0.1) will be mapped onto the multiple alignment from T-COFFEE using a unique color (green, red, yellow, blue, ...).

Such a hit corresponds to an alignment between the sequence and the PFAM motif (see HMMOUT files that are also available on the T-COFFEE site), where exactly and weekly conserved residues, as well as dominant residues of the motif are indicated.

We map the information from this alignment (HMMOUT) onto the T-COFFEE alignment in the following manner:
Exactly conserved residues between the sequence and the PFAM motif are colored in a darker color than the weekly conserved ones. Residues that do not support the alignment are not colored at all. Dominant residues in the PFAM sequence are also boxed.

INTERPRETATION:
Several conclusions can be drawn from such a presentation:

The position of PFAM motifs can be spotted directly on the alignment

PFAM motifs originated from multiple alignments themselves. One would expect to find most of the dominant residues to be aligned in the same manner in the T-COFFEE alignment if all sequences support this motif. This is thus a (somewhat indirect) way to compare PFAM alignments and T-COFFEE alignments, and regions where both alignments disagree should be investigated in more detail.

Sometimes, different PFAM motifs are found in the same T-COFFEE alignment, or only some sequences match a motif (for the given E-value cut-off). This again indicates regions of the alignment that should be scrutinized.

Consistency score

The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences.
Ideally, the better its score, the more biologically relevant the multiple alignment. In the simplest scheme, the overall consistency score is equal to the number of pairs of residues present in the multiple alignments that are also found in the library, divided by the total number of pairs observed in the multiple sequence alignment. This measure gives an overall score between 0 and 1. The maximum value a multiple alignment can have depends on the library. For the optimal score to be 1, all the alignments in the library need to be compatible with one another (e.g. when all the pairwise alignments have been extracted from the same multiple sequence alignment or when the sequences are almost identical).

Residues Type

charged KRDE

polar NQST

aliphatic ILMV

aromatic FYW

others APCGH