Sequence based searching

Sequence based searching Lesson 7 Based on a presentation by Irit Gat-Viks Based on presentation by Amir Mitchel, Introduction to bioinformatics course, Bioinformatics unit, Tel Aviv University.

Reminder – Importance of Homology Use a sequence as a search query in order to find homologous sequences in a database. Homology – similarity between sequences that results from a common ancestor. Basic Assumption: Sequence homology → similar structure/function Why ? • Characterizing an ORF. • Finding duplicate genes in the same organism(known function, variants) • Finding homologues genes in other organisms (phylogeny, known function) Study a sequence through homologs

Query= uniprot|Q9UP52|TFR2_HUMAN Transferrin receptor protein 2 (TfR2). >gi|20140567|sp|Q07891|TFR1_CRIGR Transferrin receptor protein 1 (TfR1) (TR) (TfR) (Trfr) Length = 757 Score = 540 bits (1392), Expect = e-152 Identities = 305/727 (41%), Positives = 412/727 (56%), Gaps = 52/727 (7%) Query: 87 LTALLIFTGAFLLGYVAF--RGSCQAC--------GDSVLVVSEDVNYEPDLDFHQGRLY 136 + ++ F F++GY+ + R + C G+S ++ E++ RLY Sbjct: 71 IAVVIFFLIGFMIGYLGYCKRTEQKDCVRLAETETGNSEIIQEENIP-------QSSRLY 123 Query: 137 WSDLQAMFLQFLGEGRLEDTIRQTSLRERVAGSAGMAALTQDIRAALSRQKLDHVWTDTH 196 W+DL+ + + L DTI+Q S R AGS L I KL VW D H Sbjct: 124 WADLKKLLSEKLDAIEFTDTIKQLSQTSREAGSQKDENLAYYIENQFRDFKLSKVWRDEH 183 Query: 197 YVGLQFPDPAHPNTLHWVDEAGKVGEQLPLEDPDVYCPYSAIGNVTGELVYAHYGRPEDL 256 YV +Q A N + ++ G + +E+P Y YS V+G+L++A++G +D Sbjct: 184 YVKIQVKGSAAQNAVTIINVNG---DSDLVENPGGYVAYSKATTVSGKLIHANFGTKKDF 240 Query: 257 QDLRAXXXXXXXXXXXXXXXXISFAQKVTNAQDFGAQGVLIYPEPADFSQDPPKPSLSSQ 316 +DL+ I+FA+KV NAQ F A GVLIY + F P + ++ Sbjct: 241 EDLK---YPVNGSLVIVRAGKITFAEKVANAQSFNAIGVLIYMDQTKF------PVVEAE 291 Query: 317 QAVYGHVHLGTGDPYTPGFPSFNQTQFPPVASSGLPSIPAQPISADIASRLLRKLKGPVA 376 +++GH HLGTGDPYTPGFPSFN TQFPP SSGLPSIP Q IS A +L + ++ Sbjct: 292 LSLFGHAHLGTGDPYTPGFPSFNHTQFPPSQSSGLPSIPVQTISRKAAEKLFQNMETNCP 351 IdentitySimilarityHomology

>gi|3582021|emb|CAA70575.1| cytochrome P450 [Nepeta racemosa] Length = 509 Score = 405 bits (1043), Expect = e-111 Identities = 94/479 (19%), Positives = 192/479 (40%), Gaps = 35/479 (7%) Query: 61 NLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVIDPEDVALLFKSEGPNPER 120 NL+ G + H + ++YGP+ + G+V + PE + K++ Sbjct: 45 NLHQL----GLY-PHRYLQSLSRRYGPLMQLHFGSVPVLVASSPEAAREIMKNQDIVFSN 99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Query: 297 -----DYRGMLYRLLGDSK----MSFEDIKANVTEMLAGGVDTTSMTLQWHLYEMARNLK 347 D+ +L + ++K + + +KA + +M G DTT+ L+W + E+ +N + Sbjct: 271 GDGALDFVDILLQFQRENKNRSPVEDDTVKALILDMFVAGTDTTATALEWAVAELIKNPR 330 Query: 348 VQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLH-PISVTLQRYLVNDLVLRD 406 L+ EV L+ +P LKASIKE+LRLH P+ + + R D + Sbjct: 331 AMKRLQNEVREVAGSKAEIEEEDLEKMPYLKASIKESLRLHVPVVLLVPRESTRDTNVLG 390 Query: 407 YMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSK--DKNITYFRNLGFGWGVRQCLG 464 Y I + T V + +A+ R+P+ + +PE F P R+L D +F L FG G R C G Sbjct: 391 YDIASGTRVLINAWAIARDPSVWENPEEFLPERFLDSSIDYKGLHFELLPFGAGRRGCPG 450 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ • For Proteins, finding distant relatives is a difficult task. • Distant protein family members, may share <20% amino acid identity(!).

Query types DNA vs. Protein (e.g., UCAUAC or Serine –Tyrosine) • The sequence query can be a nucleotide sequence or an amino acid sequence. • The search is preformed against a nucleotide or amino acid database Which search is preferable? 1. Which sequence is more conserved during evolution? Answer: The genetic code is redundant. Some amino acids are coded by more than one codon. Therefore, the DNA sequence can change while the amino acid sequence will remain the same. 2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter alphabet. Two random DNA sequences will share on average 25% of identity. Two random protein sequences will share on average 5% of identity. 3. Protein comparison matrices are much more sensitive than those for DNA, i.e., similarity relationships are defined between two amino acids (PAM/Blosum). 4. DNA databases are much larger, meaning more random hits.

Using the amino acid sequence is preferable for homology search. • Protein sequence comparisons typically doublethe evolutionary look-back time over DNA sequence comparisons. • Evolutionary distant proteins will exhibit a high similarity rather than a high identity. • Hits can exhibit a long alignment (homology) or a short alignment (conserved domains). Why use a nucleotide sequence after all?

Query type • The sequence query can be a nucleotide sequence or an amino acid sequence. But … we can translate the query sequence! • The search is performed against a nucleotide or amino acid database. But … we can use translated databases! (e.g., trEMBL) All types of searches are possible. • Nucleotide query can be translated and searched against protein databases: • Translate all reading frames (3 + 3) • Find long ORF. • Amino acid query can be back-translated to and searched against nucleotide databases? • During translation we lose information. • A single amino acid sequence can be back-translated to many possible nucleotide sequences . Query: DNA Protein Database: DNA Protein

Query types 1. amino acid query against protein database(blastp) • identifying a protein sequence • finding similar sequences in protein databases. 2. nucleotide query against nucleotide database(blastn) • In non-coding regions (no ORF found)- Identify the query sequence or find similar sequences. • Find primer binding sites or map short contiguous motifs 3. compares translated nucleotide query against protein database. (blastx) • Useful when the query include a coding region, and we try to find homologous proteins. • Used extensively in analyzing EST sequences. This search is more sensitive than nucleotide blast since the comparison is performed at the protein level. 4. protein query against translated nucleotide database(tblastn) • useful for finding protein homologs in unnannotated nucleotide data of coding regions (e.g., ESTs, draft genome records (HTG)). 5. translated nucleotide query against translated nucleotide database. (tblastz) • Useful for identifying novel genes in error prone query sequences. • Used for identifying potential proteins encoded by single pass read ESTs. * six-frame in all translations!!!

Searching databases • Naïve solution:Use exact algorithm to compare each sequence in the database to the sequence query. • Problems: • Databases are huge, millions of sequences. • Running the computations in parallel is expensive. • Solutions: • Use a heuristic algorithm to discard most irrelevant sequences. • Perform the alignment on the small group of remaining sequences. • Key concept of BLAST (Basic Local Alignment and Search Tool):Homologous sequences are expected to contain ungapped short segments (with substitutions, but without gaps) Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

BLAST • Preprocess • Low complexity regions are removed • A dictionary for K-tuple wordsis prepared for the query sequence and the database. Protein 3 letter words, DNA 4-6 or even 11 letter words. • Searches for K-tuple words and find database records with common words. Words can be similar, not only identical. • Identity - CAT : CAT • Similarity – CAT : CAT, CAR, HAT … • But even CAT : ZTX can be similar • For each three letter word there are at most 203 similar words. • Similar words are only the ones that have a minimum cut-off score (T).

BLAST Stage I • Find matching word pairs • Extend word pairs as much as possible,i.e., as long as the total weight increases • Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

BLAST Stage II • Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEW___ASNINETEEN

[t]BLAST[x/n/p] t : Translate a DNA database in all 6 reading frames for comparison with a Protein query. x : Translate a nucleotide query in all 6 reading frames for comparison with a Protein database. p : Comparison is against a Protein database. n : Comparison is against a Nucleotide database. BLAST Versions of the program

Masking low complexity • There is one frequent case where the random models and therefore the statistics discussed here break down: regions with highly biased amino acid composition ("low complexity" regions). • Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due to segment composition. • Usually generated by slippage and thus not interesting. The BLAST programs employ the SEG (protein) DUST(DNA) algorithm to filter low complexity regions from proteins before executing a database search. • Masking is practiced on the query sequence only, not on the database sequences!

BLAST http://www.ncbi.nlm.nih.gov/

http://www.ncbi.nlm.nih.gov/BLAST/ BLAST

BLAST

…or even construct my own searchable database by an Entrez query BLAST I can limit my search to a selected organism Mask for lookup table hit search stage, but NOT for the hit extension stage. Mask According to the case within the query sequence. Filter Low Complexity regions by SEG or DUST

BLAST Lineage Report root . Bilateria [animals] . . Coelomata [animals] . . . Euteleostomi [vertebrates] . . . . Tetrapoda [vertebrates] . . . . . Eutheria [mammals] . . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom . . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus] . . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg . . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis] . . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes] . . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B . . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755 . . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0 . . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu . environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence] Lineage Report root . Bilateria [animals] . . Coelomata [animals] . . . Euteleostomi [vertebrates] . . . . Tetrapoda [vertebrates] . . . . . Eutheria [mammals] . . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom . . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus] . . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg . . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis] . . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes] . . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B . . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755 . . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0 . . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu . environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence]

BLAST

The statistics of sequence similarity scores • Bits score– A score for the alignments according to the number of similarities, identities, etc. • Expected-score (E-value) (of an alignment having a score S): The number of times one expects to find alignments with a score >= S of a random sequence Vs. a random database. (having the same lengths and compositions). The closer the e-value approaches zero, the greater the confidence that the match is real (from zero to one).

BLAST • What about: • Short sequences? • large sequences and queries?

Short sequences:

PAM vs. BLUSOM- reminder • Different BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n  smaller evolutionary distance. • Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n  larger evolutionary distance 62 120 250

How to generate results from large sequences and queries? 1. Some sequences contain large regions of ALU repeats. In this case you can select the "Human Repeat" filtering option on the main BLAST search page. This will mask repeat regions which generate a large number of biologically uninteresting hits to the databases. 2. Increase the Word Size to 20 - 25. With a default Word Size of 7, limiting the number small initial fragments to be extended to HSPs. 3. Decrease the Expect value to 1.0 or lower  eliminates many hits and concentrate on results which are more likely to contain large coding regions and genomic fragments. 4. Processing multiple query sequences in one run can be much faster than processing them with separate runs because the database is scanned only 1 time for the entire set of queries.

PSI-BLAST (Position-Specific Iterated (PSI)-BLAST )Sensitive protein-protein similarity searches. • The most sensitive BLAST program, making it useful for finding very distantly related proteins. • Use PSI-BLAST when your standard protein-protein BLAST search failed to find significant hits. Algorithm: • The first round of PSI-BLAST is a standard protein-protein BLAST search. The program builds a position-specific scoring matrix (PSSM or profile) from an alignment of the sequences returned with Expect values better (lower) than the inclusion threshold (default=0.005). • The PSSM will be used to evaluate the alignment in the next iteration of search. Any new database hits below the inclusion finding very distantly related proteins.

Sequence based searching

Sequence based searching

Presentation Transcript

Rationale for searching sequence databases

BLAST Sequence Searching in Registry

Sequence Similarity Searching

Evidence-Based Searching

Sequence Alignment and Database Searching

Sequence-based database searching Unit 9

Biological Sequence Comparison / Database Homology Searching

Pairwise Alignments and Sequence Similarity-Based Searching

Sequence Database Searching

Searching Sequence Databases

Searching Sequence Databases

Sequence Alignment and Database Searching

Sequence Searching Strategies

Rationale for searching sequence databases

Previous Lecture: Sequence Database Searching

Evidence-based Searching

BLAST and searching sequence databases

Heuristic Methods for Sequence Database Searching

Pairwise Sequence Alignment and Database Searching

Lecture 4 Sequence alignment and searching

Sequence Similarity Searching

Sequence based searching