580 likes | 1.12k Views
The Role of Bioinformatics in Cancer Biotechnology. Bob Stephens Advanced Biomedical Computing Center Information Systems Program Feb 24, 2012.
E N D
The Role of Bioinformatics in Cancer Biotechnology Bob Stephens Advanced Biomedical Computing Center Information Systems Program Feb 24, 2012
OutlineOrigin of BioinformaticsWhy the expanding importance ?Nextgen Sequencing (Big Data)Integrative and Systems Biology (Complex Data)How to pursue interest in bioinformaticsDiscussion
What is bioinformatics ? Bioinformatics is the application of computational methods to the analysis of any type of biological data.Bioinformatics has become a diverse and multi-disciplined field that originally derived from computer and biological sciences.Now has sub-disciplines such as medical informatics, systems biology and clinical informaticsAs a result, PhD and masters level programs have emerged dedicated to different aspects of bioinformatics.
Evolution of bioinformatics1980s began as methods to scan protein and nucleic acid sequence databases for similarity (both available in print form)Rapid technological advances in across multiple biological domains set the pace for data acquisition.MicroarraysNextgen sequencingImagingProteomicsSimilar advances in computing power and algorithmic approaches for sequence analysis, robotics enabled instruments.Co-evolution with web browser and programming language technologies (now cloud).
What is the ABCC ? The ABCC is part of the Information Systems Program (ISP), a division of SAIC-F.The ISP is interconnected with the Advanced Technology Program and supports its computational needs.The ISP has computational infrastructure, system administration, networking, security, bioinformatics support and program development all under one roof.One of several NIH and NCI intramural bioinformatics resources
Layered Infrastructure Support●The CCR-IFX core consists of analyzers and users.●The BSG uses databases and apllications, tools, utilities and resources.●The SCPD uses algorithms, development and optimizaiton●The ISP computes, stores and networks.
Bioinformatics infrastructuresCommand-line implementations (open source).Primitive GUI implementations.Sophisticated GUI interfaces and application packaging.Web interface and Java language gives platform independent access.PC-based, web-based and server-based architectures.Multiple tier infrastructures distributes computational burden.Cloud-based – limited by data volumes
How can bioinformatics facilitate cancer research ? • Diagnosis - identify classifiers to better sub-divide cancer etiologies into groups. Better individual data to put treatment and individual together. • Treatment - identify better methods to track treatment progress and indicate problems earlier. • Prevention - understand mechanisms for cancer initiation, progression and development and identify targets in this process. • Connect cancer patient data from geographically distributed cancer patients for more complete analysis, esp. for rare cancers.
NCI-F NGS Instrument Landscape The SF-ATC, POB-STC, PCC/CADC, LMT, CGF-ATC and NCI labs interact.
NCI-F NGS Geographical Landscape ATRF LMT NCI-F ATC NIH
NGS Analysis Steps●Primary analysis includes base calling and Qc/QA filtering●Secondary analysis include mapping, coverage analysis, expression analysis, identify variants and impact assessment.●Tertiary analysis includes comparison tools and interpretation.
Typical Variant Identification Pipeline●A raw read after QC filtering yields an input read which may be mapped, unmapped, ambiguous or not mapped.●The mapped read can be split and read depths yield concordant and discordant mates. ●The split reads yields SV, CNV, cell variants, class enrichment, impact analysis and disease association.
Mapping ConsiderationsMany different mapping applications availableEach with complex set of mapping parametersMapping is only as good as the reference it maps againstMore mapping not necessarily better – finds best available site(s)Same platform, different mappers will yield different mapping percentages and influence variant callsMappers will need to continue to evolve to allow for multiple references to be searched
Potential Mapping IssuesReference genome is incomplete – Ns and many breaks per chromosomeReference genome contains many repeats that are very large and very similar (way longer than current read lengths)Reference genome contains many regions known to vary by copy number or be subject to structural variationReference genome contains an ancestry bias
What does our reference look like ? 7.5% 234mbp Ns 47 % 1.45gbp Repeats 45 % 1.45gbp NonN,NonRpt
CNV Coverage By Percent
Read fate mapping What are the unmapped reads ? – most map to alternate assemblies; some do not mapWhen alternative alleles are considered, some 2/3 of reads that should have mapped to them were mapped elsewhere !Although only a small fraction of reads do not map, we can not easily estimate the number that did not map correctly
1k genomes (NA18508) Mapped to Chr_Un 6 M reads 1 % Unmapped reads 13.1 M reads 3 % Mapped to Chromosomes and MT 430 M reads 96 %
PSPHL ComplexitiesThe gene is located on chromosome 7 with a 55kb insert location.There is also an:indel locus 427 bp with 99.6% identityancestral locus 106 kb with 95% identityand additional locus 465 bp with 95% identity
The PSPHL gene structure, and the deletion breakpoint, was determined by sequence mining. However, there was a huge gap to fill. hg18
Variant CallersMany variant callers exist, some components of larger applications, some stand-aloneLike mappers, many variables and filtering steps – all alter the false discovery/false negative ratesSNVs fairly well worked out, indels more difficult to identify and lower validation ratesEmerging consensus is to modularize workflow – best-in-breed mapper followed by best-in-breed variant caller, more dynamicNeed “truth” set to validate (Ventor ?)
Overlap amongst SNVs called by 3 popular variant callers (likely SAMTools, GATK and CLCBio)
GoalsBackground detailsMechanisms of connectivityLevel 1 integrationMore sophisticated integrationComplex interpretation needs
Supporting infrastructureDeep and complete genomic annotation for species of interestSystem to connect different data id typesOntology/controlled vocabulary for harmonization (apples==apples) [common data elements]Visualization capabilities for networks, heatmaps and genomic context
Pathway Gene Set AnalysisMany experiments result in sets of genes, eg microarray, proteomics, literature searches etc.Clustering genes based on expression etc. provides only first dimension.View prospective pathways impacted by changes in expression, protein levels, phosphorylation etc.
In-House NextGen WPS:Pathway-based Platform from Array/Proteomics to NGS Tertiary Analysis
VisualizationPathway/network – cytoscape and wpsGenome – many viewers availableHeatmap – simple R tools available
2 principal integration tiers Gene level – measurements associated/collected at or below gene level – expression, proteomics, phosphorylation, binding, metabolomic etc.Genome coordinate level – chipseq, cytogenetics, gwas, arrayCGH etc.Both are incompletely annotated and complex
Integration Goal: Use the database and application infrastructure to create new integrated applications Q: Tell me everything and anything you know about my gene Where we are: biodbnet.abcc Data level integration of all 32 databases Seamless updates at the backend Ortholog, batch and custom conversions medXminer Complete Medline in Oracle XMLDB Near future: *expand bioDBnet Integrate semantics layer into literature Development of new applications
Integration Example: bioDBnet193 biological identifiers from 32 biological databasesbioDBnet integrates proteomics, genomics transcriptomics, metabolomics to yield functional annotation, gene, drub, tzxon, disease, interaction, protein, microarray, protein features, pathway and variation/polymorphism.
bioDBnet – Biological DataBase NetworkIntegrates 32 widely used biological databasesThe network has 193 nodes and 658 edgesHandles batch conversions across databases(db2db), orthology conversions(dbWalk), organism wide conversions(dbOrg) and generates detailed annotation reports(dbReport)Major advantage - database update procedure is completely automated and does not impact operations
Integration layersFirst layer connects measurements through gene associationsSecond layer recognizes feedback and interactions and network complexities and builds on top of that
Java Web Start Version Overview WPS: An in-house Pathway Analysis, Visualization, and Data Integration Tool Tertiary Analysis for NGS To WPS, NGS data is just another resource of data from different platforms in parallel with Metabolome, Microarray, Proteomics data etc. Migration NextGen WPS for NGS data and other high throughput data Whole pathway scope
SLEPR and Pathway-level Pattern Extraction (PPEP) (PPEP) (PPEP) (SLEPR: Yi and Stephens, PlosOne 2008, 3(9):e3288) (PPEP: Yi M, Mudunuri U, Che A, Stephens R, BMC Bioinformatics 2009, 10:200) (WPS: Yi et al, BMC Bioinformatics 2006, 7:30)
BioCarta Pathways Uncovered by SLEPR but not by Conventional Method From Breast Cancer Array Dataset (NCI) Significant under SLEPR method but not by Conventional way
Computational Integration in Biomarker Discovery:Testing and validation include mechanistic studies in mice and biomarker validation in the clinic.
Computational Integration across species. Analyze mouse or clinical data featuring selection and modeling. Network signatures, molecular signatures and candidate biomarkers are calculated.
Bioinformatics Directions/GrowthData visualization – required to pull together and interpret the huge volumes of data now being producedData integration – often signs of disease can be diagnosed at different levels requiring “big picture” to be drawn for full understanding.Natural Language Processing – there are simply too many papers for humans to do all of the reading and comprehension.Controlled vocabularies – allow for apples and apples to be compared.
Clinical Sequencing and Cancer Companies already offering cancer diagnostic panelsTied to proprietary in-house clinical variation databasesConnect sets of mutations to clinically actionable treatments and/or trialsIdentify likely responders/non respondersIssues: EthicsCounseling Non-actionable targets
Training in bioinformatics ?Skill set needs to encompass aspects of both biological science and computer science.Direct access to relevant scientific questions through own research or close ties to scientific community.Ability to adapt to new questions, applications and data types.