400 likes | 511 Views
Evolution and the Santa Cruz Genome Browser. Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7. Typical Gene Level View:. Sialic Acid Binding/Ig-like Lectin 7.
E N D
Evolution and the Santa Cruz Genome Browser Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University
Typical Gene Level View: Sialic Acid Binding/Ig-like Lectin 7
Typical Gene Level View: Sialic Acid Binding/Ig-like Lectin 7
PDB Ribbon Diagram 4 clicks away by the wonder of the world wide web
Squished mode is ideal for ESTs and mouse/human homology ESTs hint at a smallerversion of exon2
Chaining Alignments • Chaining bridges the gulf between syntenic blocks and base-by-base alignments. • Local alignments tend to break at transposon insertions, inversions, duplications, etc. • Global alignments tend to force non-homologous bases to align. • Chaining is a rigorous way of joining together local alignments into larger structures.
Chains join together related local alignments Protease Regulatory Subunit 3
Affine penalties are too harsh for long gaps Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine gap scores model red/blue plots as straight lines.
Gaps are needed in Both Sequences in the General Case of Pair-Wise Alignment otherwise non-homologous bases can be forced to pair
2-D histogram of observed gaps. The horizontal axis is gaps in human, the vertical axis is gaps in mouse. The logarithm of counts of gaps in bins of 10 (left) and bins of 500 (right) are plotted as levels of gray with black representing the highest counts. Note the concentration of gaps along the axis, particularly for shorter gaps.
Chaining Algorithm • Input - blocks of gapless alignments from blastz • Dynamic program based on the recurrence relationship:score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj)) • Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i
Netting Alignments • Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. • Net finds best match mouse match for each human region. • Highest scoring chains are used first. • Lower scoring chains fill in gaps within chains inducing a natural hierarchy.
Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.
Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!
Mouse/HumanRearrangement Statistics Number of rearrangements of given type per megabase.
A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.
Rat Genome year of the rat - 2008
Rat/Mouse/Human Genome-Wide Multiz Alignments Available Eye lense protein gamma crystallin a. Upstream region (on right) is highly conserved but not a CpG island. Alignments are interrupted by numerous recent transposon insertions.
Details page offers quick access to browsers on corresponding regions of other genomes. It also highlights exons in base-by-base alignments.
Zoom to Base Level Detail near translation start of tubulin 8
Zoom to Base Level Intron consensus sequence visible.
Zoom to Base Level Possible alt-splice not consensus and not conserved.
Cross-hybridization at Work Zoomed in on right side:
200 Bases Upstream of Known Genes 5’ Extended by RNA/EST clusters >hg15_rnaCluster_chr22.246 range=chr22:25204375-25204574 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none aactccgcctcggggccccggggcgccgcctctctcccccggggcgccgc ctctctcccccggggcgccgcctccctccgccgcggccgtcgagccgcgg agcgcctcttccgcggagccgccgcctgccaggattccagcgccgcagct gcggccgcagccattggtctctgacgtcagcggcgtgcggcgcactcggc >hg15_rnaCluster_chr22.234 range=chr22:24125896-24126095 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ccagggcagggcgaggagcgcggggaggggccgcggggacccgggccgct ggggccgtggggcccgcccggccgccggccggctccctggggcgcgggcg gctgcgtcagcggggggcggagacgcggcgctgcttccgctcacgcgcgc cctgctccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga >hg15_rnaCluster_chr22.313 range=chr22:29356156-29356355 5'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none gccctcccggtccgggggcggggcttggcctggggcggggcttggctggg gtgctcagcccaattttccgtgtagggagcgggcggcggcgggggaggca gaggcggaggcggagtcaagagcgcaccgccgcgcccgccgtgccgggcc tgagctggagccgggcgtgagtcgcagcaggagccgcagccggagtcaca >hg15_rnaCluster_chr22.337 range=chr22:30433286-30433485 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none actcagaagctaagataccgacggtgttcctctgaacttcttccaatggc taaaagctacaagcgcctcagatataaaagactcctggacggattttcat ccagcacagagcagctgaatccatatttggcagctagtggatgggataag aggcctaacagtaagcccatggcactttattctctcgaatccatcaagat >hg15_rnaCluster_chr22.356 range=chr22:32640965-32641164 5'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ggccccgcgccccaggccggggcgaggccttttccggcgcttctttcccg cggagccgcgggcgggcggcgcaggccctgggggagagcgcgccgcggcc ggttgcagccccccccgcgccgccgcgttcggcgcccggcccggccagtc tgctcctgccccgccgccgcgccggagcccgggcgcccgaagctgggggc
Individuals Institutions Acknowledgements Webb Miller, Chuck Sugnet, Robert Baertsch, Scott Schwartz, Fan Hsu, Terry Furey, Ross Hardison, David Haussler, Richard Gibbs, Bob Waterston, Eric Lander, Francis Collins, LaDeana Hillier, Roderic Guigo, Michael Brent, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, James Gilbert, Greg Schuler, Deanna Church, the Gene Cats. Everyone else! NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers in the US and worldwide. Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Oklahoma U and the international sequencing centers. UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt.
A Cautionary Note • Infant digestive systems very permeable, uptake antibodies • ~10% of infants are allergic to cow’s milk based formula • These infants get soy/corn based formula • As we engineer plants, let’s be careful what we put in infant formula
New Algorithms and Data • ‘Chaining’ and ‘netting’ of mouse/human alignments precisely define orthology and quantify rearrangements. • Rat genome is browsable and used in rat/mouse/human multiple alignments. • Cross-hybridization potential of Affymetrix-style microarrays calculated and displayed.
Ideal Gap Penalties • Would allow gaps in both sequences at once • Would penalize long gaps less than affine gap scores. • Still would be quick to compute. • We use a piecewise linear function of the sum of gap sizes plus a substantial penalty for gaps that are in both sequences at once.