SAGA: sequence alignment by genetic algorithm

Cédric Notredame* and Desmond G. Higgins

EMBL outstation, The European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge CB10 1RQ, UK

Received December 5, 1995; Revised and Accepted March 4, 1996

ABSTRACT

We describe a new approach to multiple sequence alignment using genetic algorithms and an associated software package called SAGA. The method involves evolving a population of alignments in a quasi evolutionary manner and gradually improving the fitness of the population as measured by an objective function which measures multiple alignment quality. SAGA uses an automatic scheduling scheme to control the usage of 22 different operators for combining alignments or mutating them between generations. When used to optimise the well known sums of pairs objective function, SAGA performs better than some of the widely used alternative packages. This is seen with respect to the ability to achieve an optimal solution and with regard to the accuracy of alignment by comparison with reference alignments based on sequences of known tertiary structure. The general attraction of the approach is the ability to optimise any objective function that one can invent.

INTRODUCTION

The simultaneous alignment of many nucleic acid or amino acid sequences is one of the most commonly used techniques in sequence analysis. Multiple alignments are used to help predict the secondary or tertiary structure of new sequences; to help demonstrate homology between new sequences and existing families; to help find diagnostic patterns for families; to suggest primers for PCR and as an essential prelude to phylogenetic reconstruction. The great majority of automatic multiple alignments are now carried out using the `progressive' of Feng and Doolittle ( 1 ) or variations on it ( 2 - 4 ). This approach has the great advantage of speed and simplicity combined with reasonable sensitivity as judged by the ability to align sets of sequences of known tertiary structure. The main disadvantage of this approach is the `local minimum' problem which stems from the greedy nature of the algorithm. This means that if any mistakes are made in any intermediate alignments, these cannot be corrected later as more sequences are added to the alignment. Further, there is no objective function (a measure of overall alignment quality) which can be used to say that one alignment is preferable to another or to say that the best possible alignment, given a set of parameters, has been found.

There are two main alternatives to progressive alignment. One is to use hidden Markov models (HMMs; 5 ) which attempt to simultaneously find an alignment and a probability model of substitutions, insertions and deletions which is most self consistent. Currently, this approach is limited, in practice, to cases with very many sequences (e.g. 100 or more) but does have the great advantage of a sound link with probability analysis. A second approach is to use objective functions (OFs) which measure multiple alignment quality and to find the best scoring alignment. If the OF is well chosen or is an accurate measure of quality, then this approach has the advantage that one can be confident that the resulting alignment really is the best by some criterion. Unfortunately, the number of possible alignments which must be scored in order to choose the best one becomes astronomical for more than four or five sequences of reasonable length.

Two solutions to this problem exist. The MSA program ( 6 , 7 ) attempts to narrow down the solution space to a relatively small area where the best alignment is likely to be. It then guarantees finding the best alignment in this reduced space. Even with this reduction, it is limited to small examples of around seven or eight sequences at most. Nonetheless, it is the only method we know of that seems capable of finding the globally optimal alignment or close to it, starting with completely unaligned sequences. A second approach is to use stochastic optimisation methods such as simulated annealing ( 8 ), Gibbs sampling ( 9 ) or genetic algorithms (GAs; 10 ). Simulated annealing has been used on numerous occasions for multiple alignment (e.g. 11 - 13 ) but can be very slow and usually only works well as an alignment improver i.e. when the method is given an alignment that is already close to optimal and is not trapped in a local minimum. Gibbs sampling has been very successfully applied to the problem of finding the best local multiple alignment block with no gaps but its application to gapped multiple alignment is not trivial. Finally, we know of one attempt at using GAs in this context ( 14 ). Here they used a hybrid iterative dynamic programming/GA scheme.

In this paper, we describe a GA strategy and software package called SAGA (sequence alignment by genetic algorithm) which appears capable of finding globally optimal multiple alignments (or close to it) in reasonable time, starting from completely unaligned sequences. It can find solutions that are as good as or better than either MSA or CLUSTAL W ( 3 ) as measured by the OF score or by reference to alignments of sequences of known tertiary structure. The approach has a further advantage in that it can be used to optimise any OF one can invent. Biologically, the key to successful application of optimisation methods to this problem, depends critically on the OF. If the OF is not a good descriptor of multiple alignment quality, then the alignments will not necessarily be best in any real sense. The search for useful OFs for sequence alignment, perhaps for different purposes, is surely a key area of research. Without SAGA, however, it is difficult to consider most new OFs as one cannot optimise them.

METHODS

The overall approach is to use a measure of multiple alignment quality (an OF) and to optimise it using a genetic algorithm. A set of well known test cases is used as a reference to evaluate the efficiency of the optimisation.

Objective function

Evaluation of the alignments is made using an OF which is simply a measure of multiple alignment quality. We use two OFs related to the weighted sums of pairs with affine gap penalties ( 15 ). The principle is to give a cost to each pair of aligned residues in each column of the alignment (substitution cost), and another cost to the gaps (gap cost). These are added to give the global cost of the alignment. Furthermore, each pair of sequences is given a weight related to their similarity to the other pairs. Variations involve: (i) using different sets of sequence weights; (ii) different sets of costs for the substitutions [e.g. PAM matrices ( 16 ) or BLOSUM tables ( 17 )]; (iii) different schemes for the scoring of gaps ( 18 ). The cost of a multiple alignment (A) is then: A L I G N M E N T ^ C O S T ( A ) = {sum from {i = 2} to N} {sum from {j = 1} to {i - l}} {W sub {i , j}} " " C O S T {{( A} sub i} , {A sub j} )

where COST is the alignment score between two aligned sequences (A _i and A _j ) and W _i,j is their weight. The COST function includes gap opening and extension penalties for opening and extending gaps. Altschul ( 18 ) made an extensive review describing the different ways of scoring gaps in a multiple alignment. Two different methods were used in SAGA: (i) natural affine gap penalties and (ii) quasi-natural affine gap penalties. These methods differ in how they treat nested gaps, i.e. a gap in one sequence that is completely contained in the second. In both cases, positions where both sequences have a null are removed. With the natural gap penalties, gap opening and extension penalties are charged for each remaining gap. With the quasi-natural gap penalties, an additional gap opening penalty is charged for any gap in one sequence that starts after and ends before a gap in the second sequence (before the columns of null are removed). Terminal gaps are penalised for extension but not for opening.

Sequence weights are an attempt to minimise redundant information, based on the relatedness of the sequences. In MSA, a weight for every pair of sequences is derived from a phylogenetic tree connecting the sequences. In CLUSTAL W ( 20 ), a weight is calculated for each sequence and the pair weight (W _i,j ) for two sequences is simply their product. These weights differ in detail although both are designed for a similar purpose.

In this study we give results for the optimisation of two OFs: (i) OF1 weighted sums of pairs using the pam250 weight matrix with quasi-natural gap penalties and MSA, rationale 2, weights ( 19 ). This is the function optimised by MSA. (ii) OF2 weighted sums of pairs using the pam250 weight matrix with natural gap penalties and CLUSTAL W weights ( 20 ).

Sequence alignment by genetic algorithm (SAGA)

To align protein sequences, we designed a multiple sequence alignment method called SAGA. SAGA is derived from the simple genetic algorithm described by Goldberg ( 21 ). It involves using a population of solutions which evolve by means of natural selection. The overall structure of SAGA is shown in Figure 1 . The population we consider is made of alignments. Initially, a generation zero (G ₀ ) is randomly created. The size of the population is kept constant. To go from one generation to the next, children are derived from parents that are chosen by some kind of natural selection, based on their fitness as measured by the OF (i.e. the better the parent, the more children it will have). To create a child, an operator is selected that can be a crossover (mixing the contents of the two parents) or a mutation (modifying a single parent). Each operator has a probability of being chosen that is dynamically optimised during the run.

Test case	Nseq	Length	MSA	MSA versus	CPU-time	SAGA	SAGA versus	CPU-time
			score	structure (%)		score	structure (%)
Cyt c	6	129	1 051 257	74.26	7	1 051 257	74.26	960
Gcr	8	60	371 875	75.05	3	371 650	82.00	75
Ac protease	5	183	379 997	80.10	13	379 997	80.10	331
S protease	6	280	574 884	91.00	184	574 884	91.00	3500
Chtp	6	247	111 924	*	4525	111 579	*	3542
Dfr secstr	4	189	171 979	82.03	5	171 975	82.50	411
Sbt	4	296	271 747	80.10	7	271 747	80.10	210
Globin	7	167	659 036	94.40	7	659 036	94.40	330
Plasto	5	132	236 343	54.03	22	236 195	54.05	510

Test case	Nseq	Length	CLUSTAL W	CLUSTAL W	CPU-time	SAGA	SAGA versus	CPU-time
			score	versus structure (%)		score	structure (%)
Igb	32	144	31 812 824	55.86	60	31 417 736	55.97	41 135
Ac Protease2	10	186	10 514 101	41.02	16	10 393 145	43.50	12 236
S Protease2	12	281	16 354 800	64.37	21	16 282 179	66.18	20 537
Globin2	12	171	5 249 682	94.90	18	5 233 058	94.01	2538

SAGA: sequence alignment by genetic algorithm

INTRODUCTION

METHODS

Objective function

Sequence alignment by genetic algorithm (SAGA)

The operators

Dynamic scheduling of the operators

Choice of the mutation sites

Test cases

Implementation

RESULTS

Self tuning ability

Optimisation of OF1

Optimisation of OF2

DISCUSSION

ACKNOWLEDGEMENT

REFERENCES