Exercise: BIOINFORMATIC DATABASES and BLAST

Exercise:BIOINFORMATIC DATABASESandBLAST

Outline • NCBI and Entrez • Pubmed • Google scholar • RefSeq • Swissprot • Fasta format • PDB: Protein Data Bank • Organism specific databases • Summary • Pairwise Sequence Alignment and BLAST • Overview • Query type: DNA or Protein

What’s in a database? • Sequences – genes, proteins, etc • Full genomes • Annotation – information about genes/proteins:- function- cellular location- chromosomal location- introns/exons- protein structure- phenotypes, diseases • Publications

NCBI and EntrezNational center for biotechnology information • One of the largest and most comprehensive databases belonging to the NIH (national institute of health) • The primary Federal agency for conducting and supporting medical research in the USA • Entrez is the search engine of NCBI • Search for :genes, proteins, genomes, structures, diseases, publications and more • http://www.ncbi.nlm.nih.gov/

PubMed: search for published papers • Yang X, Kurteva S, Ren X, Lee S,Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.

Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags

Exercise • Retrieve all publications in which the first author is:Pe'er I and the last author is: Shamir R

Using limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years

Google scholar http://scholar.google.com/

NCBI gene & protein databases: GenBank • GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations) • Holds 99billionbases (2008)

Searching NCBI for the protein human CD4 Search demonstration

Using field descriptions, qualifiers, and boolean operators • Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] • List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers • Boolean Operators:ANDORNOT Note: do not use the field Protein name [PROT], only GENE!

This time we directly search in the protein database

RefSeq • RefSeq: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)

An explanation on GenBank records

Swissprot • A protein sequence database which strives to provide a high level of annotation:* the function of a protein* domains structure* post-translational modifications* variants • One entry for each protein

GenBank Vs. Swissprot Swiss-Prot results GenBank results

Fasta format header description ID/accession > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI sequence Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1

Downloading

PDB: Protein Data Bank • Main database of 3D structures • Includes ~56,000 entries (proteins, nucleic acids, others) • Proteins organized in groups, families etc • Is highly redundant • different conformations (e.g., ligand dependent) • http://www.rcsb.org

Human CD4 in complex with HIV gp120 PDB ID 1G9M gp120 CD4

Organism specific databases • Model organisms have independent databases: HIV database http://hiv-web.lanl.gov/content/index http://gmod.org/wiki/Main_Page?q=node/71

Summary • General and comprehensive databases: • NCBI, EMBL, DDBJ • Genome specific databases: • ENSEMBL, UCSC genome browser • Highly annotated databases: • Proteins: • Swissprot, RefSeq • Structures: • PDB

And always remember: • Google (or any search engine) • RTFM -Read the manual!!! (/help/FAQ)

Pairwise Sequence Alignment and BLAST

What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE

Local vs. Global • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

Evolutionary changes in sequences In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: • Insertion - AAGA AAGTA • Deletion - AAGA AGA • Substitution- AAGA AACA Insertion + Deletion Indel

Scoring scheme • Match/mismatch scores: substitution matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolution (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan) • Gap penalty

Computation time:How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 107 sequences, it will take 106seconds = 11.5days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).

Query type: DNA or Protein All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database

Query type • Information content in the letters: • Nucleotides: 4 letter alphabet • Amino acids: 20 letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity • Selection (and hence conservation) works (mostly) at the protein level The amino-acid sequence is often preferable for homology search

E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous).E-values between 10-2 and 1 do not indicate a good homology

Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology

BLAST 2 sequences at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an optimal algorithm but a heuristic

Back to NCBI

BLAST – bl2seq

Bl2Seq - query • blastn – nucleotide blastp – protein

Bl2seq results

Bl2seq results Dissimilarity Low complexity Gaps Similarity Match

BLAST – programs Query: DNA Protein Database: DNA Protein

BLAST – Blastp

Blastp - results

Exercise: BIOINFORMATIC DATABASES and BLAST

Exercise: BIOINFORMATIC DATABASES and BLAST

Presentation Transcript

Structural Concrete Innovations: A Focus on Blast Resistance

An introduction to biological databases

Managing the Obese Patient, With Emphasis on Exercise

Chapter 4 Exercise Metabolism

Introduction to Bioinformatics

Temporal Databases (Managing time varying data) Rob Squire - UK Consulting

Introduction to Databases

Energy Systems and Exercise

Gapped BLAST and PSI-BLAST ： a new generation of protein database search programs

Databases

NCBI Molecular Biology Resources

Exercise Metabolism

Chapter 22: Distributed Databases

Chapter 21

City of Ottawa Explosives Information Session 2012 Explotech Engineering

Bioinformatics

HAPTER 4

Efficient IR-Style Keyword Search over Relational Databases

Combinatorial Pattern Matching

Evaluation and Treatment of Blast Injuries