270 likes | 459 Views
2010-2011. Bioinformatics. Lecture 2 Databases. Dr. Aladdin Hamwieh Khalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. Main Lines. Different database types Types of data within databases The FASTA format
E N D
2010-2011 Bioinformatics Lecture 2 Databases Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly Aleppo University Faculty of technical engineering Department of Biotechnology
Main Lines • Different database types • Types of data within databases • The FASTA format • The Genbank format of the EMBL • Gene file format • Protein databank (PDB) format • Literature databases • Create a local database
Different database types • Primary databases They store the raw data that come directly from experiments. E.g. GenBank, the European Molecular Biology Laboratory (EMBL) database and the DNA Data Bank of Japan (DDBJ) and Protein DataBank (PDB). • Secondary databases They contain computationally processed or manually crated information, based on original information from primary databases. E.g. SWISS-Prot and Protein Information Resources (PIR) • Tertiary (specialized) databases These databases provide the most sophisticated, additional information around the raw data. They catered to a particular research interest E.g. flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.
Different database types • For example, if the raw data is a genome sequence, it might not only provide the location of genes and the encoded amino-acid sequences of the corresponding proteins,but also tell the user in which tissue types the genes are expressed. A tertiary database may combine data from several underlying primary or secondary databases.
Types of data within databases • DNA sequences • RNA sequences • RNA secondary structures • Genes • Protein structures • Expression array data (i.e. which gene is expressed & when) • Metabolic pathways (i.e. protein interaction networks) • Haplotypes • Literatures
Primary DNA Databases • GenBank: National Centre for Biotechnology Information (NCBI), USAhttp://www.ncbi.nlm.nih.gov/Genbank • EMBL: European Bioinformatics Institute, UKhttp://www.ebi.ac.uk/embl • DDBJ DNA DataBase of Japan: National Institute of Genetics, Japan http://www.ddbj.nig.ac.jp
The FASTA format >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
The Genbank format of the EMBL ID TRBG361 standard; mRNA; PLN; 1859 BP. XX AC X56734; S46826; XX SV X56734.1 XX DT 12-SEP-1991 (Rel. 29, Created) DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) XX DE Trifoliumrepens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifoliumrepens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. - XX RN [5] RP 1-1859 RX MEDLINE; 91322517. RX PUBMED; 1907511. RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifoliumrepens L.)."; RL Plant Mol. Biol. 17(2):209-219(1991). XX RN [6] RP 1-1859 RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE RL UPON TYNE, NE2 4HH, UK XX DR GOA; P26204. DR MENDEL; 11000; Trirp;1162;11000. DR SWISS-PROT; P26204; BGLS_TRIRP. XX FH Key Location/Qualifiers
The Genbank format of the EMBL FT source 1..1859 FT /db_xref="taxon:3899" FT /mol_type="mRNA" FT /organism="Trifoliumrepens" FT /tissue_type="leaves" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT CDS 14..1495 FT /db_xref="GOA:P26204" FT /db_xref="SWISS-PROT:P26204" FT /note="non-cyanogenic" FT /EC_number="3.2.1.21" FT /product="beta-glucosidase" FT /protein_id="CAA40058.1" FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD" FT mRNA 1..1859 FT /evidence=EXPERIMENTAL XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaaccaaatatggattttattgtagccatatttgctctgtttgttattagctcatt 60 cacaattacttccacaaatgcagttgaagcttctactcttcttgacataggtaacctgag 120 tcggagcagttttcctcgtggcttcatctttggtgctggatcttcagcataccaatttga 180 aggtgcagtaaacgaaggcggtagaggaccaagtatttgggataccttcacccataaata 240 tccagaaaaaataagggatggaagcaatgcagacatcacggttgaccaatatcaccgcta 300 caaggaagatgttgggattatgaaggatcaaaatatggattcgtatagattctcaatctc 360 ~ ~ ~ ~ ~ ~ ~ tggattaaaaaggtaccctaagctttctgcccaatggtacaagaactttctcaaaagaaa 1560 ctagctagtattattaaaagaactttgtagtagattacagtacatcgtttgaagttgagt 1620 tggtgcacctaattaaataaaagaggttactcttaacatatttttaggccattcgttgtg 1680 aagttgttaggctgttatttctattatactatgttgtagtaataagtgcattgttgtacc 1740 agaagctatgatcataactataggttgatccttcatgtatcagtttgatgttgagaatac 1800 tttgaattaaaagtctttttttatttttttaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 1859
The Genbank format of the EMBL • The ID line:ID TRBG361 standard; mRNA; PLN; 1859 BP.The ID line is always the first line of each sequence entry, it gives the names of the sequence and also its length in base pairs. • The XX line:XX indicates an empty line which are inserted for easier readability. • The AC line:AC X56734; S46826;An AC line contains the ACcession number(s) of the sequence entry. • The SV line:SV X56734.1An SV line contain information on the Sequence Version of the sequence entry.
The Genbank format of the EMBL • The DT line:DT 12-SEP-1991 (Rel. 29, Created)A DT line contains the DaTe when the sequence entry was generated or updated. • The DE line:DE Trifoliumrepens mRNA for non-cyanogenic beta-glucosidaseEach DE line contains a DEscription of the sequence entry, The DE line is format free. • The KW line:KW beta-glucosidase.Lines starting with KW contain keywords which are used to generate cross-reference indices of the sequence entries. • The OS line:OS Trifoliumrepens (white clover)An OS line specifies the Organism's Species from which the sequence entry was derived.
The Genbank format of the EMBL • The OC line:OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;An OC (Organism Classification) line contains the taxonomic classification of the organism from which the sequence entry was derived. • The reference (RN, RP, RX, RA, RT, RL) lines:RN [5]RP 1-1859RX MEDLINE; 91322517.RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;RT Nucleotide and derived amino acid sequence of the cyanogenicRL Plant Mol. Biol. 17(2):209-219(1991).This block of lines contains one reference to the original literature and always contains the lines in the order RN, RC, RP, RX, RG, RA, RT, RL Within each such reference • The DR line:DR GOA; P26204.A DR line contains a Database Cross-reference to another database
The Genbank format of the EMBL • The FH line:FH Key Location/QualifiersThe FH (Feature Header) lines are present only to improve readability of a sequence entry • The FT lines:FT source 1..1859FT /db_xref="taxon:3899“...FT CDS 14..1495FT /db_xref="GOA:P26204“...FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD“FT mRNA 1..1859FT /evidence=EXPERIMENTALThe set of FT (Feature Table) lines provide different types of annotation for the sequence of a sequence entry.
The Genbank format of the EMBL • The SQ line:SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;The SQ (SeQuence header) line comes before the lines with the sequence data and summarizes information about the sequence. • The sequence data line:aaacaaaccaaatatggattttattgtagccatatttgctctgtttgttattagctcatt 60...tttgaattaaaagtctttttttatttttttaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 1859The start identifier of lines containing the actual sequence data are two blank spaces. The sequence which is always given from 5' to 3', is given in chunks of 10 bases with up to 60 bases per line.
Gene fileformathttp://www.ensembl.org/info/data/ftp/index.html
Principles of genome annotation • Once the sequencing has been completed, the genome sequence is deposited in a database • One of the first annotation tasks is usually to find all protein-coding genes within a newly sequenced genome. • Location of RNA encoding genes (opposed to protein coding genes, RNA encoding genes are only transcribed, but not translated) • As these groups are not necessarily in the same place, they need to exchange their annotation in a common format that is understandable to all of them.
Examples of the GTF format The next figure symbolically shows a protein-coding gene consisting of three exonswhich falls on the reverse strand: Chr22 srcExon 649 700 . - . gene_id 1; transcript_id 1; exon_number 1 Chr22 src CDS 649 700 . - ? gene_id 1; transcript_id 1; exon_number 1 Chr22 srcExon 351 500 . - . gene_id 1; transcript_id 1; exon_number 2 Chr22 src CDS 351 500 . - ? gene_id 1; transcript_id 1; exon_number 2 Chr22 srcExon 150 250 . - . gene_id 1; transcript_id 1; exon_number 3 Chr22 src CDS 153 250 . - ? gene_id 1; transcript_id 1; exon_number 3 Chr22 srcStart_Codon 698 700 . - 0 gene_id 1; transcript_id 1; exon_number 1 Chr22 srcStop_Codon 150 152 . - 0 gene_id 1; transcript_id 1; exon_number 3
Examples of the GTF format The following figure symbolically shows a protein-coding gene consisting of five exonsA valid description of this gene in GTF format would be: Chr1 srcExon 150 200 . + . gene_id 1; transcript_id 1; exon_number 1 Chr1 srcExon 300 401 . + . gene_id 1; transcript_id 1; exon_number 2 Chr1 src CDS 380 401 . + 0 gene_id 1; transcript_id 1; exon_number 2 Chr1 srcExon 501 650 . + . gene_id 1; transcript_id 1; exon_number 3 Chr1 src CDS 501 650 . + 2 gene_id 1; transcript_id 1; exon_number 3 Chr1 srcExon 700 800 . + . gene_id 1; transcript_id 1; exon_number 4 Chr1 src CDS 700 707 . + 2 gene_id 1; transcript_id 1; exon_number 4 Chr1 srcExon 900 1000 . + . gene_id 1; transcript_id 1; exon_number 5 Chr1 srcStart_Codon 380 382 . + 0 gene_id 1; transcript_id 1; exon_number 2 Chr1 srcStop_Codon 708 709 . + 0 gene_id 1; transcript_id 1; exon_number 4
GrainGenes • A Genomic Database for Triticeae and Avena • Genetic maps • Genes • Alleles • Genetic markers • Phenotypic data • Quantitative trait loci studies • Experimental protocols • Publications
Literature databases • PubMed:http://www.ncbi.nlm.nih.gov/pubmed/ • PubMedCentral: http://www.pubmedcentral.gov/ • HighWirePress: http://highwire.stanford.edu/ • DRIVERProject:http://www.driver-community.eu/ • WebofScience®:http://isiknowledge.com • arXiv:http://www.arxiv.org • CiteSeer:http://citeseer.ist.psu.edu