Multiple Sequence Alignment

An Introduction to Bioinformatics Multiple Sequence Alignment

AIMS To introduce the different approaches to multiple sequence alignment To identify criteria for selecting a multiple sequence alignment program OBJECTIVES To select an appropriate multiple sequence alignment program To carry out a multiple sequence alignment using CLUSTALX

The result of searching databases is the establishment of a list of sequences, either protein or nucleotide, which exhibit significant similarity and are inferred to behomologous These sequences can then be subjected to multiple sequence alignment The process that involves an attempt to place residues in columns that derive from a common ancestral residue by substitutions The most successful alignment is the one that most closely represents the evolutionary history of the sequences

Why create multiple sequence alignments? to attempt a phylogenetic analysis of the sequences so as to construct evolutionary trees the identification of functional sites the identication of modules in multimodular protein the identification of motifs the detection of weak similarities in databases using profiles the design of PCR primers for the identification of related genes

Global versus local alignments Things would be much simpler if we only considered sequences that are homologous over their entire length and could be globally aligned Homology is often restricted to certain regions of sequence Many proteins are multi-modular and the shuffling of modules is part of the evolutionary process An attempt to align, over their entire length, sequences that share some, but not all of their modules, would be bound to lead to errors In such a case a series of multiple local sequence alignments of each of the modules would be appropriate

Substitutions and Gaps In trying to establish the evolutionary trajectories of a group of related sequences the same problem is encountered as met in pairwise alignment How do you deal with substitutions and gaps? The solution is the same Use of gap penalties, gap extension penalties and substitution matrices such as PAM and BLOSUM

There are essentially four major approaches to multiple sequence alignment: Optimal global sequence alignment Progressive global alignment Block-based global alignment Motif-based local alignment

Optimal global sequence alignment Attempts to align sequencesalong their entire length. ‘Optimal’ means that it will give the best alignment amongst all the possible solutions for a given scoring scheme Whether the optimal alignment corresponds with the biologically correct alignment will depend on a variety of factors e.g. substitution matrix, the gap penalty and the scoring scheme Optimal global sequence alignment programs are very computer intensive and the complexity of the task increases exponentially with the number of sequences There are few programs which employ this approach - there is one available on the Web

Progressive global alignment employs multiple pairwise alignments in a series of three steps: 1. Estimate alignment scores between all possible pairwise combinations of sequences in the set 2. Build a ‘guide tree’ determined by the alignment scores 3. Align the sequences on the basis of the guide tree Each step can be carried out in a number of ways designed to increase speed or accuracy Progressive global alignment is the most commonly used method and the best known programs employing this approach are CLUSTAL family

Block-based global alignment Divides the sequences into blocks which, depending on the program, are exact (identical regions of sequence) or not exact and uniform (found in every sequence) or not uniform Once the blocks have been defined other approaches are employed to align regions between the blocks Once blocks have been identified other programs (e.g. CLUSTAL X) can be used to multiply align individual modules Examples of block-based global alignment programs available on the Web are DCA and DIALIGN2

Motif-based local alignment Most recent local alignment programs employ computationally efficient heuristics to solve optimization calculations for local alignments The Gibbs iterative sampling approach is used to find blocks in programs such as the excellent MACAW MACAW although available as freeware is not available as a Web-based application MEME is Web-based

Which method to use Optimal global alignment programs are rarely employed computationally intensive requirements can only handle a very small number of sequences When the sequences to be aligned are homologous over their entire length a progressive global alignment program should be used. Where the sequences share conserved modules in a consistent orderblocks-based global alignment or motif-based local Alignment Is appropriate Where the sequences share conserved modules, but the order of modules is not consistent, a motif-based local alignment is the approach of choice

Multiple sequence alignment file types The various multiple sequence alignment programs will require different input file types and there are also a variety of output file types The sequences to be aligned are usually placed in a single file commonly in the Fasta format The common output file formats are: NBRF/PIR, EMBL/SWISS- PROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file Multiple sequence files can be interconverted using

Sequence formats that allow one or more sequences: • IG/Stanford, used by Intelligenetics and others • * GenBank/GB, genbank flatfile format • * NBRF format • * EMBL, EMBL flatfile format • * DNAStrider, for common Mac program • * Fitch format, limited use • * Pearson/Fasta, a common format used by Fasta programs and others • * Zuker format, limited use. Input only. • * Olsen, format printed by Olsen VMS sequence editor. Input only. • * Phylip3.2, sequential format for Phylip programs • * Phylip, interleaved format for Phylip programs (v3.3, v3.4) • + MSF multi sequence format used by GCG software • + PAUP's multiple sequence (NEXUS) format • + PIR/CODATA format used by PIR • +ASN.1 format used by NCBI

Phylip The first line of the input file contains the number of species and the number of characters separated by blanks. The information for eachspecies follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. Phylip format files can be interleaved, as in the example below, or sequential. 7 123seq1 ---------- ---------- ---KSKERYK DENGGNYFQL REDWWDANRE seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS seq6 ------FSKN IX--QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH seq7 ---------- ---------- ---------- ---------- ---------K TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCIG--------- TVWKAITCGA P-GDASYFHA TCDSGDGRGG AQAPHKCRCD G--------- TVWEAITCSA DKGNA-YFRR TCNSADGKSQ SQARNQCRC- --KDENGKN- TIWEAITCSA DKGNA-YFRA TCNSADGKSQ SQARNQCRC- --KDENGXN- TVWKALTCSD KLSNASYFRA TC--SDGQSG AQANNYCRCN GDKPDDDKP- TVWEALTCEA P-GNAQYFRN ACS----EGK TATKGKCRCI SGDP------ ELWEALTCSR P-KGANYFVY KLD-----RP KFSSDRCGHN YNGDP-----

clustal Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential. (note: the multiple sequence alignment program Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format.). CLUSTAL W (1.74) multiple sequence alignmentseq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNAseq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGAseq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQN-DGDNYFQLREDWWTANRSTVWKALTCSDseq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSAseq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSAseq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEAseq7 -------------------------------------------------KELWEALTCSR seq1 --GGGKYFRNTCDG--GQNPTETQNNCRCIG----------ATVPTYFDYVPQYLRWSDEseq2 P-GDASYFHATCDSGDGRGGAQAPHKCRCDG---------ANVVPTYFDYVPQFLRWPEEseq3 KLSNASYFRATC--SDGQSGAQANNYCRCNGDKPDDDKP-NTDPPTYFDYVPQYLRWSEEseq4 DKGNA-YFRRTCNSADGKSQSQARNQCRC---KDENGKN-ADQVPTYFDYVPQYLRWSEEseq5 DKGNA-YFRATCNSADGKSQSQARNQCRC---KDENGXN-ADQVPTYFDYVPQYLRWSEEseq6 P-GNAQYFRNACS----EGKTATKGKCRCISGDP----------PTYFDYVPQYLRWSEEseq7 P-KGANYFVYKLD-----RPKFSSDRCGHNYNGDP---------LTNLDYVPQYLRWSDE

Multiple Sequence Alignment