1 / 48

From Pairwise Alignment to Database Similarity Search

From Pairwise Alignment to Database Similarity Search. Outline. Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments. Local. Global. Best score for aligning the full length sequences Dynamic programming Algorithm: Needelman- Wunch

loc
Download Presentation

From Pairwise Alignment to Database Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Pairwise Alignment to Database Similarity Search

  2. Outline • Summary Local and Global alignments • FASTA and BLAST algorithms • Evaluating significance of alignments

  3. Local Global • Best score for aligning the full length sequences • Dynamic programming • Algorithm: • Needelman- Wunch • Table cells are allowed any score • Best score for aligning part of sequences • Dynamic programming • Algorithm: • Smith-Waterman • Table cells never score below zero Pairwise Alignment Summary

  4. Available tools for Sequence Alignments ALIGN -- GLOBAL (N-W)/ LOCAL (S-W) BLAST2SEQ– Only LOCAL using word match spidey -- aligns mRNAs to genomic sequence est2genome -- aligns ESTs to genomic sequence

  5. Gap Scores >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Biologically, indels occur in groups we want our gap score to reflect this

  6. Gap Scores • Standard solution: affine gap model • Once-off cost for opening a gap • Lower cost for extending the gap • Changes required to algorithm

  7. Affine Gap Penalty • wx = g + r(x-1) • wx : total gap penalty; g: gap open penalty; r: gap extend penalty;x: gap length • gap penalty chosen • Gaps not excluded • Gaps not over included • Typical Values: g=-12; r = -4

  8. Is this good enough ???

  9. Drawbacks to DP Approaches • Compute intensive • Memory Intensive

  10. Complexity • Complexity is determined by size of table • Aligning a sequence of lengthmagainst one of lengthnrequires calculating(mn)cells • Estimate: we calculate108 cells per second • Aligning two mRNA sequences of8,000 bprequires64,000,000 cells • Aligning an mRNA and a107 bpchromosome requires~1011 cells

  11. Searching databases • Goal: Find homologue sequences in database to query input.

  12. new sequence ? similar function Similar function ≈ Sequence Database

  13. Searching databases • Goal: Find homologue sequences in database to query input. • Naïve solution:Use exact algorithm to compare each sequence in the database to query.

  14. Complexity for genomes • Human genome contains3  109base pairs • Searching an mRNA against HG requires~1013 cells • -Even efficient exact algorithms will be extremely slow when preformed millions of times. • -Running the computations in parallel is expensive.

  15. So what can we do?

  16. Searching databases • Solutions: • Use a heuristic (approximate) algorithm to discard most irrelevant sequences. • Perform the exact algorithm on the small group of remaining sequences.

  17. Heuristic strategy • Homologous sequences are expected to contain common short segments (probably with substitutions, but without ins/dels) • Preprocess database (DB) into new data structure to enable fast accession • Remove low-complexity regions that are not useful for meaningful alignments

  18. Low Complexity Sequences • AAAAAAAAAAA • ATATATATATATA • Transposable elements (LINEs, SINEs)

  19. Low Complexity Sequences Whats wrong with them? produce artificial high scoring alignments. So what do we do?: We apply Low Complexity masking to the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

  20. Low Complexity Sequences Complexity is calculated as: Where N=4 in DNA (4 bases), L is the length of the sequence And ni the number of each residue in the sequence K=1/L logN(L!/Π ni!) all i For the sequence GGGG: L! =4x3x2x1=24 ng =4 nc =0 na =0 nt =0 Πni =24x1x1x1=24 K =1/4 log4 (24/24)=0 For the sequence CTGA: L! =4x3x2x1=24 ng =1 nc =1 na =1 nt =1 Πni =1x1x1x1 K =1/4 log4 (24/1)=0.573

  21. Heuristic (approximate solution) Methods: FASTA and BLAST • FASTA (Lipman & Pearson 1985) • First fast sequence searching algorithm for comparing a query sequence against a database • BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) • improvement of FASTA: Search speed, ease of use, statistical rigor • Gapped BLAST (Altschul et al 1997)

  22. FASTA and BLAST • Common idea - a good alignment contains subsequences of absolute identity: • First, identify very short (almost) exact matches. • Next, the best short hits from the 1st step are extended to longer regions of similarity. • Finally, the best hits are optimized using the Smith-Waterman algorithm.

  23. FastA (fast alignment) • Assumption: a good alignment probably matches some identical ‘words’ • Example: Aligning a query sequence to a database • Database record: • ACTTGTAGATACAAAATGTG • Query sequence: • A-TTGTCG-TACAA-ATCTG • Matching words of size 4

  24. FastA • Preprocess of all the sequences in the database. Find short words and organize in dictionaries. • Process the query sequence and prepare a dictionary. • ATGGCTGCTCAAGT…. ATGG TGGC GGCT … … Query

  25. FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches. For DNA sequences the word length used is 6. seq1 seq2

  26. The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix. seq1 seq2

  27. FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. seq1 seq2

  28. The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. seq1 seq2

  29. FastA final stage • Apply an exact algorithm of local alignment on surviving records, computing the final alignment score. • Calculate an Alignment score (S) • Evaluate the statistical significance

  30. Assessing Alignment Significance Determine probability of alignment occurring at random Random Related No Good Ideal

  31. FastA at EMBL

  32. FastA • A set of programs for database searching. FastA EMBL

  33. FASTA SEQUENCE FORMAT This format contains a one line header followed • by lines of sequence data. • Sequences in fasta formatted files are preceded by a line starting • with a" >" symbol. • The first word on this line is the name of the sequence. • The remaining lines contain the sequence itself. • Blank lines in a FASTA file are ignored,

  34. FastA at EMBL • Output, general information: • Z’ score = deviation (in sd) of the actual score from the mean of random scores Z=(x-mean)/sd • Opt: the number of optimized scores observed. Lower limit • E( ): the number of sequences expected in the score range.

  35. FastA at EMBL

  36. FastA at EMBL

  37. BLAST Basic Local Alignment Search Tool • Developed to be as sensitive as FastA but much faster. • Also searches for short words. • Protein 3 letter words • DNA 11 letter words. • Words can be similar, not only identical

  38. Word Search -BLAST • Identity - CAT : CAT • Similarity – CAT : CAT, CAY, HAT … • But even CAT : XTX can be similar • For each three letter word there are many similar words (depending on the alphabet). • Similar words are only the ones that have a minimum cut-off score (T). Y= C or T H= A, C, T X=A or T or C or G

  39. BLAST • Find matching word pairs • Extend word pairs as much as possible,i.e., as long as the total weight increases • Result:High-scoring Segment Pairs (HSPs) THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

  40. BLAST • Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

  41. Score and E-value • The score is a measure of the similarity of the query to the sequence shown. • The E-value is a measure of the reliability of the score. • E-value is the probability due to chance, that there is another alignment with a similarity greater than the given S score.

  42. BLAST- Score Score (S): (identities + mismatches)-gaps Bit score (S) : • Similar to alignment score • Normalized • Higher means more significant

  43. BLAST- E value:Expected # of alignments with score at least S Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database m = length of query ; n= length of database ; s= score • K ,λ: statistical parameters dependent upon scoring system • and background residue frequencies

  44. What is a Good E-value - thumb rules • E values of less than 0.00001 show that sequences are almost always homologues. • Greater E values, can represent homologues as well. • Generally the decision whether an E-value is biologically significant depends on the size of database that is searched

  45. Significance of Gapped Alignments • Gapped alignments use same statistics •  and K cannot be easily estimated • Empirical estimations and gap scores determined by looking at random alignments

  46. BLAST • Blast is a family of programs: BlastN, BlastP,BlastX, tBlastN, tBlastX • BlastN - nt versus nt database • BlastP - protein versus protein database • BlastX - translated nt versus protein database • tBlastN - protein versus translated nt database • tBlastX - translated nt versus translated nt database Query: DNA Protein Database: DNA Protein

  47. BLAST at NCBI • Output • Graphical out put of top results • The alignments for top scores • Scores for each alignment: • E value • Bits score: a score normalized with respect to the scoring system. Can be used to compare different searches.

More Related