1 / 54

Exercise: BIOINFORMATIC DATABASES and BLAST

Exercise: BIOINFORMATIC DATABASES and BLAST. Outline. NCBI and Entrez Pubmed Google scholar RefSeq Swissprot Fasta format PDB : Protein Data Bank Organism specific databases Summary Pairwise Sequence Alignment and BLAST Overview Query type: DNA or Protein. What’s in a database?.

Download Presentation

Exercise: BIOINFORMATIC DATABASES and BLAST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exercise:BIOINFORMATIC DATABASESandBLAST

  2. Outline • NCBI and Entrez • Pubmed • Google scholar • RefSeq • Swissprot • Fasta format • PDB: Protein Data Bank • Organism specific databases • Summary • Pairwise Sequence Alignment and BLAST • Overview • Query type: DNA or Protein

  3. What’s in a database? • Sequences – genes, proteins, etc • Full genomes • Annotation – information about genes/proteins:- function- cellular location- chromosomal location- introns/exons- protein structure- phenotypes, diseases • Publications

  4. NCBI and EntrezNational center for biotechnology information • One of the largest and most comprehensive databases belonging to the NIH (national institute of health) • The primary Federal agency for conducting and supporting medical research in the USA • Entrez is the search engine of NCBI • Search for :genes, proteins, genomes, structures, diseases, publications and more • http://www.ncbi.nlm.nih.gov/

  5. PubMed: search for published papers • Yang X, Kurteva S, Ren X, Lee S,Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.

  6. Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags

  7. Exercise • Retrieve all publications in which the first author is:Pe'er I and the last author is: Shamir R

  8. Using limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years

  9. Google scholar http://scholar.google.com/

  10. NCBI gene & protein databases: GenBank • GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations) • Holds 99billionbases (2008)

  11. Searching NCBI for the protein human CD4 Search demonstration

  12. Using field descriptions, qualifiers, and boolean operators • Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] • List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers • Boolean Operators:ANDORNOT Note: do not use the field Protein name [PROT], only GENE!

  13. This time we directly search in the protein database

  14. RefSeq • RefSeq: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)

  15. An explanation on GenBank records

  16. Swissprot • A protein sequence database which strives to provide a high level of annotation:* the function of a protein* domains structure* post-translational modifications* variants • One entry for each protein

  17. GenBank Vs. Swissprot Swiss-Prot results GenBank results

  18. Fasta format header description ID/accession > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI sequence Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1

  19. Downloading

  20. PDB: Protein Data Bank • Main database of 3D structures • Includes ~56,000 entries (proteins, nucleic acids, others) • Proteins organized in groups, families etc • Is highly redundant • different conformations (e.g., ligand dependent) • http://www.rcsb.org

  21. Human CD4 in complex with HIV gp120 PDB ID 1G9M gp120 CD4

  22. Organism specific databases • Model organisms have independent databases: HIV database http://hiv-web.lanl.gov/content/index http://gmod.org/wiki/Main_Page?q=node/71

  23. Summary • General and comprehensive databases: • NCBI, EMBL, DDBJ • Genome specific databases: • ENSEMBL, UCSC genome browser • Highly annotated databases: • Proteins: • Swissprot, RefSeq • Structures: • PDB

  24. And always remember: • Google (or any search engine) • RTFM -Read the manual!!! (/help/FAQ)

  25. Pairwise Sequence Alignment and BLAST

  26. What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE

  27. Local vs. Global • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

  28. Evolutionary changes in sequences In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: • Insertion - AAGA AAGTA • Deletion - AAGA AGA • Substitution- AAGA AACA Insertion + Deletion Indel

  29. Scoring scheme • Match/mismatch scores: substitution matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolution (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan) • Gap penalty

  30. Computation time:How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 107 sequences, it will take 106seconds = 11.5days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

  31. Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

  32. Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

  33. BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).

  34. Query type: DNA or Protein All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database

  35. Query type • Information content in the letters: • Nucleotides: 4 letter alphabet • Amino acids: 20 letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity • Selection (and hence conservation) works (mostly) at the protein level The amino-acid sequence is often preferable for homology search

  36. E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous).E-values between 10-2 and 1 do not indicate a good homology

  37. Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology

  38. BLAST 2 sequences at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an optimal algorithm but a heuristic

  39. Back to NCBI

  40. BLAST – bl2seq

  41. Bl2Seq - query • blastn – nucleotide blastp – protein

  42. Bl2seq results

  43. Bl2seq results Dissimilarity Low complexity Gaps Similarity Match

  44. BLAST – programs Query: DNA Protein Database: DNA Protein

  45. BLAST – Blastp

  46. Blastp - results

More Related