**1515-1524**

**Cédric **** Notredame* **

** **

** ABSTRACT **

** **

** We describe a new approach to multiple sequence alignment using
genetic algorithms and an associated software package called SAGA. The
method involves evolving a population of alignments in a quasi evolutionary
manner and gradually improving the fitness of the population as measured
by an objective function which measures multiple alignment quality. SAGA
uses an automatic scheduling scheme to control the usage of 22 different
operators for combining alignments or mutating them between generations.
When used to optimise the well known sums of pairs objective function,
SAGA performs better than some of the widely used alternative packages.
This is seen with respect to the ability to achieve an optimal solution
and with regard to the accuracy of alignment by comparison with reference
alignments based on sequences of known tertiary structure. The general
attraction of the approach is the ability to optimise any objective function
that one can invent. ** ** **

The simultaneous alignment of many nucleic acid or amino acid sequences is one of the most commonly used techniques in sequence analysis. Multiple alignments are used to help predict the secondary or tertiary structure of new sequences; to help demonstrate homology between new sequences and existing families; to help find diagnostic patterns for families; to suggest primers for PCR and as an essential prelude to phylogenetic reconstruction. The great majority of automatic multiple alignments are now carried out using the `progressive' of Feng and Doolittle ( 1 ) or variations on it ( 2 - 4 ). This approach has the great advantage of speed and simplicity combined with reasonable sensitivity as judged by the ability to align sets of sequences of known tertiary structure. The main disadvantage of this approach is the `local minimum' problem which stems from the greedy nature of the algorithm. This means that if any mistakes are made in any intermediate alignments, these cannot be corrected later as more sequences are added to the alignment. Further, there is no objective function (a measure of overall alignment quality) which can be used to say that one alignment is preferable to another or to say that the best possible alignment, given a set of parameters, has been found.

There are two main alternatives to progressive alignment. One is to use hidden Markov models (HMMs; 5 ) which attempt to simultaneously find an alignment and a probability model of substitutions, insertions and deletions which is most self consistent. Currently, this approach is limited, in practice, to cases with very many sequences (e.g. 100 or more) but does have the great advantage of a sound link with probability analysis. A second approach is to use objective functions (OFs) which measure multiple alignment quality and to find the best scoring alignment. If the OF is well chosen or is an accurate measure of quality, then this approach has the advantage that one can be confident that the resulting alignment really is the best by some criterion. Unfortunately, the number of possible alignments which must be scored in order to choose the best one becomes astronomical for more than four or five sequences of reasonable length.

Two solutions to this problem exist. The MSA program ( 6 , 7 ) attempts to narrow down the solution space to a relatively small area where the best alignment is likely to be. It then guarantees finding the best alignment in this reduced space. Even with this reduction, it is limited to small examples of around seven or eight sequences at most. Nonetheless, it is the only method we know of that seems capable of finding the globally optimal alignment or close to it, starting with completely unaligned sequences. A second approach is to use stochastic optimisation methods such as simulated annealing ( 8 ), Gibbs sampling ( 9 ) or genetic algorithms (GAs; 10 ). Simulated annealing has been used on numerous occasions for multiple alignment (e.g. 11 - 13 ) but can be very slow and usually only works well as an alignment improver i.e. when the method is given an alignment that is already close to optimal and is not trapped in a local minimum. Gibbs sampling has been very successfully applied to the problem of finding the best local multiple alignment block with no gaps but its application to gapped multiple alignment is not trivial. Finally, we know of one attempt at using GAs in this context ( 14 ). Here they used a hybrid iterative dynamic programming/GA scheme.

In this paper, we describe a GA strategy and software package called
SAGA (sequence alignment by genetic algorithm) which appears capable of
finding globally optimal multiple alignments (or close to it) in reasonable
time, starting from completely unaligned sequences. It can find solutions
that are as good as or better than either MSA or CLUSTAL W ( 3
) as measured by the OF score or by reference to alignments of sequences
of known tertiary structure. The approach has a further advantage in that
it can be used to optimise any OF one can invent. Biologically, the key
to successful application of optimisation methods to this problem, depends
critically on the OF. If the OF is not a good descriptor of multiple alignment
quality, then the alignments will not necessarily be best in any real sense.
The search for useful OFs for sequence alignment, perhaps for different
purposes, is surely a key area of research. Without SAGA, however, it is
difficult to consider most new OFs as one cannot optimise them.

The overall approach is to use a measure of multiple alignment quality
(an OF) and to optimise it using a genetic algorithm. A set of well known
test cases is used as a reference to evaluate the efficiency of the optimisation.

Evaluation of the alignments is made using an OF which is simply a measure of multiple alignment quality. We use two OFs related to the weighted sums of pairs with affine gap penalties ( 15 ). The principle is to give a cost to each pair of aligned residues in each column of the alignment (substitution cost), and another cost to the gaps (gap cost). These are added to give the global cost of the alignment. Furthermore, each pair of sequences is given a weight related to their similarity to the other pairs. Variations involve: (i) using different sets of sequence weights; (ii) different sets of costs for the substitutions [e.g. PAM matrices ( 16 ) or BLOSUM tables ( 17 )]; (iii) different schemes for the scoring of gaps ( 18 ). The cost of a multiple alignment (A) is then: A L I G N M E N T ^ C O S T ( A ) = {sum from {i = 2} to N} {sum from {j = 1} to {i - l}} {W sub {i , j}} " " C O S T {{( A} sub i} , {A sub j} )

where COST is the alignment score between two aligned sequences (A
_{ i } and A _{ j } ) and W _{ i,j } is their weight.
The COST function includes gap opening and extension penalties for opening
and extending gaps. Altschul ( 18
) made an extensive review describing the different ways of scoring
gaps in a multiple alignment. Two different methods were used in SAGA:
(i) natural affine gap penalties and (ii) quasi-natural affine gap penalties.
These methods differ in how they treat nested gaps, i.e. a gap in one sequence
that is completely contained in the second. In both cases, positions where
both sequences have a null are removed. With the natural gap penalties,
gap opening and extension penalties are charged for each remaining gap.
With the quasi-natural gap penalties, an additional gap opening penalty
is charged for any gap in one sequence that starts after and ends before
a gap in the second sequence (before the columns of null are removed).
Terminal gaps are penalised for extension but not for opening.

Sequence weights are an attempt to minimise redundant information,
based on the relatedness of the sequences. In MSA, a weight for every pair
of sequences is derived from a phylogenetic tree connecting the sequences.
In CLUSTAL W ( 20
), a weight is calculated for each sequence and the pair weight (W
_{ i,j } ) for two sequences is simply their product. These weights
differ in detail although both are designed for a similar purpose.

In this study we give results for the optimisation of two OFs: (i)
OF1 weighted sums of pairs using the pam250 weight matrix with quasi-natural
gap penalties and MSA, rationale 2, weights ( 19
). This is the function optimised by MSA. (ii) OF2 weighted sums of
pairs using the pam250 weight matrix with natural gap penalties and CLUSTAL
W weights ( 20
).

To align protein sequences, we designed a multiple sequence alignment
method called SAGA. SAGA is derived from the simple genetic algorithm described
by Goldberg ( 21
). It involves using a population of solutions which evolve by means of
natural selection. The overall structure of SAGA is shown in Figure 1
. The population we consider is made of alignments. Initially, a generation
zero (G _{ 0 } ) is randomly created. The size of the population
is kept constant. To go from one generation to the next, children are derived
from parents that are chosen by some kind of natural selection, based on
their fitness as measured by the OF (i.e. the better the parent, the more
children it will have). To create a child, an operator is selected that
can be a crossover (mixing the contents of the two parents) or a mutation
(modifying a single parent). Each operator has a probability of being chosen
that is dynamically optimised during the run.

**Figure **1

These steps are repeated iteratively, generation after generation. During these cycles, new pieces of alignment appear because of the mutations and are combined by the crossovers. The selection makes sure that the good pieces survive and the dynamic setting of the operators helps the population to improve by creating the children it needs.

Following this simple process, the fitness of the population is increased until no more improvement can be made . All these steps, shown in Figure 1 , can be summarised by the following pseudo-code:

Initialisation 1. create G _{ 0 }
Evaluation 2. evaluate the population of generation
n (G _{ n } ) 3. if the population is
stabilised then END 4. select the individuals
to replace 5. evaluate the expected offspring
(EO) Breeding 6. select the parent(s) from G
_{ n } 7. select the operator
8. generate the new child 9. keep or discard
the new child in G _{ n+1 } 10. goto
6 until all the children have been success- fully
put into G _{ n+1 } 11. n = n+1
12. goto EVALUATION End 13. end * Initialisation
* . The first step of the algorithm (Fig. 1
a) is the creation of a random population. This generation zero consists
of a set of alignments containing only terminal gaps. A population size
of 100 was used in all of the results presented here. To create one of
these alignments, a random offset is chosen for all the sequences (the
typical range being from 0 to 50 for sequences 200 residues long) and each
sequence is moved to the right, according to its offset.The sequences are
then padded with null signs in order to have the same length, L. The alignments
of generation zero will be the parents of the children used to populate
generation one. * Evaluation. * To give birth to a new generation,
the first step is the evaluation of the fitness of each individual. This
fitness is assessed by scoring each alignment according to the OF. The
better the alignment, the better its score, and thus the higher its fitness.
If the purpose is to minimise the OF, as is the case for OF1 and OF2, then
the scores are inverted to give the fitness. The expected offspring (EO)
of an alignment is derived from the fitness. It is typically a small integer.
The method we used to derive it is known as remainder stochastic sampling
without replacement ( 22
). In the case of OF1 and OF2 the typical values of the EO are between
0 and 2, which can be considered as an acceptable range ( 21
).

Only a portion (e.g. 50%) of the population is to be replaced during
each generation. This technique, known as overlapping generation ( 23
), means that half of the alignments will survive unchanged, the other
half will be replaced by the children. We chose in SAGA to keep only the
best individuals, and to replace the others. In practice, all the individuals
are ranked according to their fitness, and the weakest are replaced by
new children while creating the generation n+1 from the generation n. The
other individuals (the fittest) will simply survive as they are during
the breeding. * Breeding. * First, the new generation is directly
filled with the fittest individuals from the previous generation (typically
50%). Next, the remaining 50% of the individuals in the new generation
are created by selecting parents and modifying them. During the breeding,
the EO is used as a probability for each individual to be chosen as a parent.
A wheel is spun where each potential parent has a number of slots equal
to its EO. When an individual is chosen to be a parent, its EO is accordingly
decreased before the next turn of selection (selection without replacement).
This weighted wheel selection is carried on until all the parents have
been chosen.

To modify the parent(s), an operator has to be chosen. An operator is a small program that will modify an alignment (e.g. shuffle the gaps or merge two alignments into a new one). We have designed several operators. Each of them has a specific probability of being used. To create a child, one operator is chosen according to this probability (by spinning another weighted wheel). The chosen operator is then applied to the chosen parent(s). Some operators require two parents, others require only one.

An important aspect of the SAGA population structure is the constraint
we put on the absence of duplicates. In the same generation, all the alignments
have to be different. This technique helps maintain a high level of diversity
in a population of small size ( 24
). To do so, each newborn child is checked to ensure it is not identical
to any of the children already generated. If it is not, it will be put
into the new generation. Otherwise, it will simply be discarded along with
its parent(s) and the operator, in order to avoid deadlock problems. This
process is carried on until enough children (e.g. 50% of the population)
have been successfully inserted in the new generation. The Evaluation/Breeding
process will be carried on until the decision is made to stop the search.
* End. * There is no valid proof that a GA must reach the optimum,
even in an infinite amount of time, as there is for Simulated Annealing
( 25 ). Thus
the decision to stop the search has to be an arbitrary choice using more
or less sophisticated heuristic criteria. We use stabilisation as a criterion:
SAGA is stopped when the search has been unable to improve for some specified
number of generations (typically 100). This condition is the most widely
used when working on a population with no duplicates ( 26
).

According to the traditional nomenclature of genetic algorithms ( 21 ), two types of operators are represented in SAGA: the crossovers and the mutations. These programs perform modifications (mutation) or merging of parent alignments (crossover). In SAGA we do not make any distinction between these two types with regard to how we apply them. They are designed as independent programs that input one or two alignments (the parents) and output one alignment (the child). Each operator requires one or more parameters which specify where the operation is to be carried out. For example, an operator which inserts a new gap must be told where (at which position in the alignment) and in which sequences the gap is to be inserted.

The parameters of an operator may be chosen completely randomly in
some range in which case the operator is said to be used in a stochastic
manner. Alternatively, all except one of the parameters may be chosen randomly
and the value of the remaining parameter will be fixed by exhaustive examination
of all possible values. The value which yields the optimal fitness will
be used. When an operator is applied in this way, it is said to be used
in * * semi-hill climbing mode. Most of the SAGA operators may be
used in either way. * The crossovers. * Crossovers are responsible
for combining two different alignments into a new one. We implemented two
different types of crossover: one-point and uniform. The one point crossover
combines two parent alignments through a single exchange. Figure 2
outlines this mechanism. The first parent is cut straight at some
randomly chosen position and the second one is tailored so that the right
and the left pieces of each parent can be joined together while keeping
the original sequence of amino acids. Any void space that appears at the
junction point is filled with null signs. Because of the specificity of
this junction point, where rearrangements can occur, this operator combines
both the traditional properties of a crossover and those of a local rearrangement
mutation. Only the best of the two children produced that way is kept.

**Figure **2

This one point crossover can be very disruptive, especially at the junction point. To avoid this drawback, we added a second operator: the uniform crossover, designed to promote multiple exchanges between two parents in a more subtle manner. This operator is based on an analogy with biological crossover: exchanges are promoted between zones of homology.

The first step consists of mapping the alignment positions that are consistent between the two parents. In an alignment, a position is a column of residues or nulls stacked on top of each other. Two positions are said to be consistent between two alignments, if in each line they contain the same residue (by reference to the original sequence) or a null coming from the same gap (i.e. between the same residues). For instance, if in one line of a given position we have ALA125 and at the same line of a position in the other alignment we have ALA101 then the two position are not consistent. This process is outlined in Figure 3 . Blocks between consistent positions can be directly swapped. One can do so in a semi-hill climbing way, if only the best combination of blocks is chosen, or in a stochastic way, if the block to place between two consistent positions is randomly chosen between the two alignments. Both uniform crossovers, the semi-hill climbing one and the stochastic one, are implemented in SAGA.

**Figure **3

This operator extends alignments by inserting gaps. Its mechanism is detailed in Figure 4 . To keep the sequences aligned, each sequence will get a gap insertion of the same size. The sequences are split into two groups. Within each group, all the sequences get the insertion at the same position. The two groups are chosen, based on an underlying estimated phylogenetic tree between the sequences. The tree is randomly split into two sub-trees (Fig. 4 a). Each group consists of all the sequences in one of the two sub-trees. For one of the groups, a position is randomly chosen (Fig. 4 b). A gap of randomly chosen length is then inserted in each of the sequences of the group at the same position (P1 in Fig. 4 b). A gap of the same length is also inserted into all of the sequences of the second group at some position within a maximum distance from the first gap insertion (P2 in Fig. 4 b). This is the stochastic version of the block insertion operator.

**Figure **4

The semi-hill climbing version of this operator is similar to the stochastic one described above but in this case, all the parameters except P1 (the position of the insertion in the first group of sequences) are chosen randomly and all possible values of P1 are tested. The value of P1 that gives the best scoring alignment is chosen.

In general, it is dangerous to assume that the topology of the underlying
tree is correct. In the current usage, the main effect of an incorrect
tree topology will be to slow the program down. The ability to find the
globally optimal alignment should not be changed, just the speed at which
the solution will be found. When two groups are chosen, using the tree,
one of the groups can consist of a single sequence. This means that, eventually,
all possible arrangements of gaps can be found, even if the tree topology
is completely wrong. Ideally one would use fuzzy groupings based on the
tree but which allows alternative groupings. * Block shuffling. *
Generating an optimal arrangement after a gap insertion can often be a
matter of shifting a gap to the left or to the right. Therefore we designed
an operator that moves blocks of gaps or residues (but not both together)
inside an alignment. Here, we depart from the usual definition of a block
as a section of alignment containing no gaps, with all of the sub-sequences
having the same length ( 27
). For the purposes of this operator, we define a block of residues
to be a set of overlapping stretches of residues from one or more sequences,
each stretch being delimited by a gap or an end of a sequence. Each sub-sequence
can be a different length but all sub-sequences must overlap. Similarly,
a block of gaps is a set of overlapping gaps. An example of each is given
in Figure 5 a.
A block is chosen by first selecting one residue or gap position from the
alignment and then deriving the block to which it belongs. These can be
moved inside the alignment, to generate new configurations. Figure 5
b, c and d show some types of move that can be made inside an alignment.
These moves are an extension of those proposed for a simulated annealing
approach described by ( 12
). The limits of this move are contained in the alignment itself.
A gap can only be shifted until it merges with another gap. Similarly,
a stretch of residues can only be shifted until it merges with another
stretch of residues. We can enumerate the different ways these operations
may be used as follows:

**Figure **5

- Move a full block of gaps or a full block of residues. (Fig. 5 b). - Split the block horizontally and move one of the sub blocks to the left or to the right. The subdivision of a block is made according to the tree (cf. gap insertion operator) (Fig. 5 c). - Split the block vertically and move one half to the left or to the right (Fig. 5 d). - The move can be made in a semi-hill climbing way, looking for the best position, or in a stochastic manner.

These different combinations lead to a total of 16 possible operators,
designed to shuffle gaps, in all possible directions. All sixteen operators
are implemented in SAGA. * Block searching. * A set of operators including
crossovers, gap insertion and block shuffling, is theoretically able to
create any arrangement needed for the correct alignment, but it is also
bound to lose a lot of time, trying to generate some configurations that
a simple heuristic would easily find.

Therefore, we designed a crude method that, given a substring in one of the sequences, tries to find in the alignment, the block to which it may belong. Here we define a block as a short section of alignment without any gaps ( 27 ). First, we select a substring of random length at a random position in one of the sequences. Then, all substrings of the same length in all of the other sequences are compared with the initial substring and the best matching one is selected. This new substring is added to the first one, in order to form a small profile ( 31 ). Then, in the remaining sequences, the best match is located and added to the profile. The process goes on iteratively until a match has been identified in all the sequences. The sequences are then moved to reconstruct the block inside the alignment. This method does not depend on the underlying phylogenetic tree or on the order of the sequences.

The initial substring is randomly chosen (typical length 5-15 residues).
The block searching is not performed on the whole alignment, but only in
a section tailored randomly around the position of the initial substring
(typical size between 50 and 150 alignment positions). The ultimate rearrangement
occurs inside that section only. This precaution is taken in order to minimise
the side effect that could be caused by the existence of repeated motifs
inside some of the sequences. This block searching mutation generates more
dramatic changes than any of the other operators. * Local optimal or
sub-optimal rearrangement. * Some situations remain where the presence
of a very stable local minimum makes it quite difficult for the other operators
to generate the optimal configuration. In order to overcome this problem,
we designed our last operator. It attempts to optimise the pattern of gaps
inside a given block. This is done in two ways: (i) by exhaustive examination
of all gap arrangements inside the block or (ii) by a local alignment GA
(LAGA).

The exhaustive examination is carried out if it requires less than
a specified number of combinations to examine (typically 2000). Otherwise,
LAGA is used. LAGA is a crude version of the simple genetic algorithm described
by Golberg ( 21
). It uses only one point crossovers and the bloc shuffling operator. LAGA
is typically run for a number of generations equal to 10-fold the number
of sequences and with a population size of 20.

The 16 block shuffling operators, the two types of crossover, the block searching, the gap insertion and the local rearrangement operator, make a total of 22 operators (uniform crossovers and gap insertion may be used in a stochastic or semi hill climbing way). During initialisation of the program, all the operators have the same probability of being used, equal to 1/22. There is no guarantee that these probabilities are optimal. Even if they were for the first stages of the run, they could become inadequate in later stages. They could also be test case specific. How to schedule the different operators in a general way that will be efficient in many situations is a difficult problem. In fact, the more operators one has, the more difficult it becomes. We implemented an automatic procedure that deals with this problem and allows us to easily add or remove operators without any need for retuning.

Dynamic schedules, optimised on the run, are an elegant solution to this problem, that was proposed by Davis ( 28 ). In this model, an operator has a probability of being used that is a function of the efficiency it has recently (e.g. 10 last generations) displayed at improving alignments. The credit an operator gets when performing an improvement is also shared with the operators that came before and may have played a role in this improvement. Thus, each time a new individual is generated, if it yields some improvement on its parents, the operator that is directly responsible for its creation gets the largest part of the credit (e.g. 50%). Then the operator(s) responsible for the creation of the parents also get their share of the remaining credit (50% of the remaining credit, i.e. 25% of the original credit), and so on. This report of the credit goes on for some specified number of generations (e.g. 4).

After a given number of generations (e.g. 10) these results are summarised
for each of the operators. The credit of an operator is equal to its total
credit divided by the number of children it generated. This value is taken
as usage probability and will remain unchanged until the next assessment,
10 generations later. To avoid the early loss of some operators that may
become useful later on, all the operators are assigned a minimum probability
of being used (the same for all them, typically equal to half their original
probability i.e. 1/44).

Experience shows that, while monitoring the search, areas containing
gaps are those that are most likely to change during a run. For this reason
we found it useful to bias the choice of the mutation site by some probability
related to the concentration of gaps in an area. This bias is moderated
in order to avoid local minimum problems but it greatly helps the algorithm.
Typically, in the middle of a run, the probability of hitting a position
containing a gap is twice the probability of hitting a position without
gaps.

We used a set of 13 test cases based mainly on alignments of sequences of known tertiary structure. Twelve were chosen from the Pascarella structural alignment data base ( 29 ) and one of chymotrypsin sequences from ( 6 , 12 ). We chose test cases of varying length (60-280 residues) and various numbers of sequences (4-32).

The test cases were divided into two groups. The first group (nine test cases) is made of small alignments (4-8 sequences, and 60-280 residues long) that can be handled by MSA. Because they can be computed by MSA, they allow us to asses SAGAs ability to minimise the MSA OF. The second group (four test cases) is made of larger alignments (9, 12, 15 and 32 sequences). Three of them are only extended versions of some of the small test cases, the fourth contains 32 sequences of immunoglobulins (for details see Tables 1 and 2 ). These test cases cannot be handled by MSA, and are designed to show the ability of SAGA to perform multiple sequence alignments of realistic size. We analysed the results by comparing the scores obtained by MSA and SAGA using OF1 and CLUSTAL W and SAGA using OF2.

To analyse the similarity between the structural alignments and those
obtained by one of these three programs, we use a measure of consistency
between two alignments. This measure gives the percentage of residues that
are aligned in a similar manner in the two alignments. It allows us to
measure the level of sequential consistency between computed alignments
generated by SAGA, MSA and CLUSTAL W and the reference structural alignments.

SAGA was written in ANSI C and was implemented on an open VMS system.
Memory requirements are low, the main usage being to store the separate
alignments in the population. For 10 sequences with an average alignment
length of 200 and a population size of 100, ~1 Mb of memory is sufficient.
The source code is available free of charge from the authors; please send
an e-mail message to Cedric.Notredame@EBI.ac.uk .

We analysed three aspects of SAGA in detail. As the robustness of our optimisation strategy depends on the dynamic operator setting, we checked its behaviour on various test cases. In order to show that SAGA was able to perform a rigorous optimisation, we used the first group of nine test cases, for which, thanks to MSA, a mathematically optimal, or sub-optimal solution is known for OF1. We verified that SAGA was able to find a solution at least as good. Then, using the second set of four test cases, we analysed the ability of SAGA to perform a multiple alignment on sequences that could not be aligned by MSA. We compared these results with those given by CLUSTAL W on the same test cases. With these two sets of experiments, we also tried to assess the biological relevance of the alignments produced by SAGA by reference to the structural alignments.

Table 1 **
** The performance of MSA and SAGA on nine test cases

Test case | Nseq | Length | MSA | MSA versus | CPU-time | SAGA | SAGA versus | CPU-time |

score | structure (%) | score | structure (%) | |||||

Cyt c | 6 | 129 | 1 051 257 | 74.26 | 7 | 1 051 257 | 74.26 | 960 |

Gcr | 8 | 60 | 371 875 | 75.05 | 3 | 371 650 | 82.00 | 75 |

Ac protease | 5 | 183 | 379 997 | 80.10 | 13 | 379 997 | 80.10 | 331 |

S protease | 6 | 280 | 574 884 | 91.00 | 184 | 574 884 | 91.00 | 3500 |

Chtp | 6 | 247 | 111 924 | * | 4525 | 111 579 | * | 3542 |

Dfr secstr | 4 | 189 | 171 979 | 82.03 | 5 | 171 975 | 82.50 | 411 |

Sbt | 4 | 296 | 271 747 | 80.10 | 7 | 271 747 | 80.10 | 210 |

Globin | 7 | 167 | 659 036 | 94.40 | 7 | 659 036 | 94.40 | 330 |

Plasto | 5 | 132 | 236 343 | 54.03 | 22 | 236 195 | 54.05 | 510 |

* Cyt c * : 451c, 1ccr, 1cyc, 5cyt, 3c2c, 155c; * Gcr *
: 2gcr, 2gcr-2, 2gcr-3, 2gcr-4, 1gcr, 1gcr-2, 1gcr-3, 1gcr-4; * Ac protease
* : 1cms, 4ape, 3app, 2apr, 4pep; * S.protease * : 1ton, 2pka,
2ptn, 4cha, 3est, 3rp2; * Dfr secstr * : 1dhf, 3dfr, 4dfr, 8dfr; *
Chtp * : 3rp2, M13143 (EMBL accession number), 1gmh, 2tga, 1est, 1sgt;
* Sbt * : 1cse, 1sbt, 1tec, 2prk; * Globin * : 4hhb-2, 2mhb-2,
4hhb, 2mhb, 1mbd, 2lhb, 2lh1; * Plasto * : 7pcy, 2paz, 1pcy, 1azu,
2aza.

Table 2 **
** The performance of CLUSTAL W and SAGA on four test cases

Test case | Nseq | Length | CLUSTAL W | CLUSTAL W | CPU-time | SAGA | SAGA versus | CPU-time |

score | versus structure (%) | score | structure (%) | |||||

Igb | 32 | 144 | 31 812 824 | 55.86 | 60 | 31 417 736 | 55.97 | 41 135 |

Ac Protease2 | 10 | 186 | 10 514 101 | 41.02 | 16 | 10 393 145 | 43.50 | 12 236 |

S Protease2 | 12 | 281 | 16 354 800 | 64.37 | 21 | 16 282 179 | 66.18 | 20 537 |

Globin2 | 12 | 171 | 5 249 682 | 94.90 | 18 | 5 233 058 | 94.01 | 2538 |

* Igb * : 2fb4, 2fb4-2, 2fb4-3,2fb4-4, 2fbj,
2fbj-2, 2fbj-3, 2fbj-4, 1fc2, 1fc2-2, 1mcp, 1mcp-2, 1pcf, 1rei, 2rhe, 3fab,
3fab-2, 3fab-3, 3fab-4, 2hfl, 2hfl-2, 2hfl-3, 1fl9, 1fl9-2, 1fl9-3, 1fl9-4,
1cd4, 3hla, 3hla-2, 4fab, 3hfm, 1mcw; * Ac protease2 * : 1cms, 4ape,
3app, 2apr, 4pep, 1cms-2, 4ape-2, 3app-2, 2apr-2, 4pep-2; * S protease2
* : 1ton, 2pka, 2ptn, 2trm, 4cha, 3est, 1hne, 3rp2, 1sgt, 2sga, 3sgb,
2alp; * Globin2: * 4hhb, 4hhb-2, 2mhb, 2mhb-2, 1fdh, 1mbd, 1mbs, 2lhb,
1eca, 2lh1, 1pmb, 1mba.

SAGA was run on all the test cases, and the schedules for all the operators were plotted. Figure 6 a and b present some of these results. Figure 6 a shows that the probabilities of being used of the two types of crossover (one point and uniform) evolve according to different schedules. In the early stages, the young population is very heterogeneous and lacks consistency. Thus the uniform crossover can hardly be used. Later in the run, when some order has been created, the uniform crossover can be applied more easily. It then gradually replaces the one point crossover. This graph clearly shows that the two types of crossover are competing with each other, although no extra information regarding the type of the operator is given to the algorithm. All the operators were analysed in the same way, in order to verify that they were needed (data not shown).