600 likes | 897 Views
Introduction to Biological sequences. Sushmita Roy www.biostat.wisc.edu /bmi576/ sroy@ biostat.wisc.edu September 4, 2014. BMI/CS 576. Goals for today. A few key concepts in molecular biology Nucleic acids Genes Proteins The Central Dogma Connection between DNA, RNA and proteins
E N D
Introduction to Biological sequences Sushmita Roy www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576
Goals for today • A few key concepts in molecular biology • Nucleic acids • Genes • Proteins • The Central Dogma • Connection between DNA, RNA and proteins • Problems in sequence similarity • Sequence alignment • Sequence search
A Living Cell • The fundamental unit of life • There are unicellular (one cell) and multi-cellular organisms • A cell has different cellular components • We will be concerned with • Nucleus • Ribosomes • Cytoplasm • prokaryotes(single-celled organisms lacking nucleus) • eukaryotes(organisms with nucleus)
An animal cell http://www.genome.gov/Glossary/index.cfm?id=25
Deoxyribonucleic acid (DNA) image from the DOE Human Genome Program http://www.ornl.gov/hgmis
DNA is a double helical molecule Watson and Crick • In 1953, James Watson and Francis Crick discovered DNA molecule has two strands arranged in a double helix • This was possible through the Xray diffraction data from Maurice Wilkins and Rosalind Franklin Maurice Wilkins Rosalind Frankin http://www.chemheritage.org/discover/online-resources/chemistry-in-history/themes/biomolecules/dna/watson-crick-wilkins-franklin.aspx
Nucleotides • DNA is composed of small chemical units called nucleotides • Nucleotide • Nitrogen containing base • 5 carbon sugar: deoxyribose • Phosphate group • Phosphate-hydroxy bonds connect thenucleotides • Four nucleotides make DNA • adenine (A), cytosine (C), guanine (G) and thymine (T) • Each nucleotide differs in the base Phosphate Base Sugar Hydroxy
Bases in the nucleotides • Purines (Two rings) • Pyrimidines (one ring) Adenine (A) Guanine (G) Thymine (T) Cytosine (C)
Nucleotides are linked to form one strand of DNA O 5’ Base - CH2 O O P O - 1’ Sugar 4’ 2’ 3’ O Base - CH2 O O P 5’ O - 1’ Sugar 4’ 2’ 3’
5’ and 3’ of a DNA molecule • Each strand is made up of linkages between 5’ position (Phosphate) on one nucleotide to the 3’ position of the following nucleotide • At one end, there is a free phosphate group: 5’ end • At the other end, there is a free OH group: 3’ end • Therefore we can talk about directionality • the 5’ and the 3’ ends of a DNA strand • The two strands are held-together through base pairing
5’ and 3’ of a DNA molecule contd.. • DNA sequence is read from 5’ to 3’ • The two stands run anti-parallel to each other • One is the complement of the other • For example, if the AAG is the sequence on one strand the sequence on the other strand is CTT • Not TTC
Watson-Crick Base pairing A always bonds to T Calwaysbonds to G • This base-pairing is also called “complementary base-paring” • Each strand has a base sequence that is complementary to the sequence on the other strand. • If you know the sequence on one strand, you know the sequence on the other strand
DNA stores the blue print of an organism • The heredity molecule • Has the information needed to make an organism • Double strandednessof the DNA molecule provides stability, prevents errors in copying • one strand has all the information • DNA replication is the process by this information is copied through generations of daughter cells
DNA replication • Helicase, an enzyme, separates the double-helix • DNA polymerase makes a copy of each strand using free nucleotides • Each strand of DNA serves as a template 5’ 3’ Template strand A C A T T G C C C A G T Strand A G T A A C G G G T C A New strand B 5’ 3’ 5’ 3’ C A T T G C C C A G T G T A A C G G G T C A 5’ 3’ 5’ 3’ Strand B New strand A C A T T G C C C A G T Parent DNA double helix G T A A C G G G T C A Template strand B 5’ 3’ Adapted from “Understanding Bioinformatics”
Videos on DNA replication https://www.youtube.com/watch?v=zdDkiRw1PdU https://www.youtube.com/watch?v=27TxKoFU2Nw
Chromosomes • All the DNA of an organism is divided up into individual chromosomes • Each chromosome is really a DNA molecule • Different organisms have different numbers of chromosomes Image from www.genome.gov
Genes • Genes are the units of heredity • A gene is a sequence of bases which specifies a protein or RNA molecule • The human genome has ~ 25,000 protein-coding genes (still being revised) • One gene can have many functions • One function can require many genes …GTATGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTC…
Genomes • Refers to the complete complement of DNA for a given species • The human genome consists of 2X23 chromosomes • Every cell (except egg and sperm cells and mature red blood cells) contains the complete genome of an organism
The central dogma of Molecular biology DNA Transcription RNA Translation Proteins
RNA: Ribonucleic acid • RNA • Made up of repeating nucleotides • The sugar is ribose • U is used in place of T • A strand of RNA can be thought of as a string composed of the four letters: A, C, G, U • RNA is single stranded • More flexible than DNA • Can double back and form loops • Such structures can be more stable
Transcription • In eukaryotes: happens inside the nucleus • RNA polymerase (RNA Pol)is an enzyme that builds an RNA strand from a gene • RNA Pol is recruited at specific parts of the genome in a condition-specific way. • Transcription factor proteins are assigned the job of RNA Pol recruitment. • RNA that is transcribed from a protein coding region is called messenger RNA (mRNA)
Transcription The RNA string produced is identical to the non-template strand except T is replaced by U.
The central dogma of Molecular biology DNA Transcription RNA Translation Proteins
Translation • Process of turning mRNA into proteins. • Happens outside of the nucleus inside the cytoplasm in ribosomes • ribosomesare the machines that synthesize proteins from mRNA
Proteins • Proteins are polymers too • The repeating units are amino acids • There are 20 different amino acids known • DNA codes for protein • How many nucleotides are needed to specify 20 amino acids?
Codons • Each triplet of bases is called a codon • How many codons are possible? • There are three special codons • One Start codon: AUG: start of translation • Three Stop codons: End of translation • All others code for a particular amino acid
The Genetic Code: Specifies how mRNA is translated into protein Genetic code is degenerate
Codons and Reading Frames 3’ 5’ CUC AGC GUU ACC AU Leu Ser Val Thr C UCA UUA CCA U GCG Ser Ala Leu Pro CU CAG CGU UAC CAU Gln Arg Tyr His
Proteins are the workhorses of the cell • structural support • transport of substances • coordination of an organism’s activities • response of cell to chemical stimuli • protection against disease • Catalyzing chemical reactions
Proteins are complex molecules • Primary amino acid sequence • Secondary structure • Tertiary structure • Quarternary structure • These structures are formed through different levels of protein folding and packaging
Some well-known proteins Actin: maintenance of cell structure Hemoglobin: carries oxygen Insulin: metabolism of sugar http://en.wikipedia.org/wiki/Hemoglobin http://en.wikipedia.org/wiki/Insulin http://en.wikipedia.org/wiki/Actin
Hemoglobin protein HBA1 DNA sequence (491 bp) Amino acid sequence (142 aa) >gi|224589807:226679-227520 Homo sapiens chromosome 16, GRCh37.p9 Primary Assembly 1 cccacagactcagagagaacccaccatggtgctgtctcctgacgacaagaccaacgtcaa 61 ggccgcctggggtaaggtcggcgcgcacgctggcgagtatggtgcggaggccctggagag 121 gatgttcctgtccttccccaccaccaagacctacttcccgcacttcgacctgagccacgg 181 ctctgcccaggttaagggccacggcaagaaggtggccgacgcgctgaccaacgccgtggc 241 gcacgtggacgacatgcccaacgcgctgtccgccctgagcgacctgcacgcgcacaagct 301 tcgggtggacccggtcaacttcaagctcctaagccactgcctgctggtgaccctggccgc 361 ccacctccccgccgagttcacccctgcggtgcacgcctccctggacaagttcctggcttc 421 tgtgagcaccgtgctgacctccaaataccgttaagctggagcctcggtggccatgcttct 481 tgcccctttgg >sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2 MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
RNA genes • Not all genes encode proteins • For some genes the end product is RNA • ribosomal RNA (rRNA), which includes major constituents of ribosomes • transfer RNAs (tRNAs), which carry amino acids to ribosomes • micro RNAs (miRNAs), which play an important regulatory role in various plants and animals • linc RNAs (long non-coding RNAs), play important regulatory roles
RECAP • Key components of a eukaryotic cell • Nucleus, Cytoplasm, Ribosome • What is DNA and RNA? • A large molecule called a polymer • Made up of repeated units • Nucleotides • DNA: ATGC • RNA: AUGC • What is a protein • Also a polymer, but the units are amino acids • The Central Dogma: DNA->RNA->protein • Important processes • DNA replication, Transcription, Translation • Some resources • http://www.genome.gov/Glossary/index.cfm
http://www.youtube.com/watch?v=41_Ne5mS2ls A video on transcription and translation
Things we did not talk about • DNA packaging • Alternative splicing • Polyadenylation • Post translational modifications
A few important biological data/knowledge bases • 2014 Nucleic acids Research Database reports 1,552 databases • National Center of Biotechnology (NCBI) • http://www.ncbi.nlm.nih.gov • GenBank: Database of sequences • Refseq: Reference sequences • Ensemble • http://useast.ensembl.org/info/about/index.html • UniProt: Protein sequence and protein function • Protein Databank: Protein structure • Pathway databases • Gene Ontology • KEGG • Interaction databases • BioGRID • STRING See also http://nar.oxfordjournals.org/content/42/D1/D1.full#T1
Number of genomes in RefSeq Source: http://www.ncbi.nlm.nih.gov/refseq/statistics/
Sequence similarity • Sequence similarity is central to addressing many questions in biology • Are two sequences related? • Similarity in sequence can imply similarity in function. • Assign function to uncharacterized sequences based on characterized sequences • Sequence from different species can be compared to estimate the evolutionary relationships between species • We will come back to this in Phylogenetic trees.
Overview of sequence similarity problems • Assessing similarity between a small number of DNA or protein sequences • Pairwise sequence alignment • Multiple sequence alignment • Searching databases for a query sequence • Heuristic search using BLAST
What is sequence alignment The task of locating equivalent regions of two or more sequences to assess their overall similarity
A very simple alignment of two sequences T H I S S E Q U E N C E Aligned/matched positions T H A T S E Q U E N C E
How to align these two sequences? T H I S S E Q U E N C E T H A T I S A S E Q U E N C E The problem arises when the sequences to be compared are of unequal length
How do sequences change? • Sequences change through mutations substitutions: ACGAAGGA insertions: ACGAACGGA deletions: ACGAAGA
Need to incorporate gaps while aligning sequences _ _ _T H I S S E Q U E N C E T H I S _ _ _ S E Q U E N C E T H A T I S A S E Q U E N C E T H A T I S A S E Q U E N C E Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches
Issues in sequence alignment • What type of alignment? • Align the entire sequence or part of it? • Two sequences or multiple sequences? • How to find the alignment? • Search algorithms for alignment • How to score an alignment? • the sequences we’re comparing typically differ in length • some characters (nucleotide or aminoacid) are more substitutable than others • How to tell if the alignment is biologically meaningful? • Assessing how likely the alignment could have happened by random chance