The Role of Bioinformatics in Cancer Biotechnology

The Role of Bioinformatics in Cancer Biotechnology Bob Stephens Advanced Biomedical Computing Center Information Systems Program Feb 24, 2012

OutlineOrigin of BioinformaticsWhy the expanding importance ?Nextgen Sequencing (Big Data)Integrative and Systems Biology (Complex Data)How to pursue interest in bioinformaticsDiscussion

What is bioinformatics ? Bioinformatics is the application of computational methods to the analysis of any type of biological data.Bioinformatics has become a diverse and multi-disciplined field that originally derived from computer and biological sciences.Now has sub-disciplines such as medical informatics, systems biology and clinical informaticsAs a result, PhD and masters level programs have emerged dedicated to different aspects of bioinformatics.

Evolution of bioinformatics1980s began as methods to scan protein and nucleic acid sequence databases for similarity (both available in print form)Rapid technological advances in across multiple biological domains set the pace for data acquisition.MicroarraysNextgen sequencingImagingProteomicsSimilar advances in computing power and algorithmic approaches for sequence analysis, robotics enabled instruments.Co-evolution with web browser and programming language technologies (now cloud).

What is the ABCC ? The ABCC is part of the Information Systems Program (ISP), a division of SAIC-F.The ISP is interconnected with the Advanced Technology Program and supports its computational needs.The ISP has computational infrastructure, system administration, networking, security, bioinformatics support and program development all under one roof.One of several NIH and NCI intramural bioinformatics resources

Layered Infrastructure Support●The CCR-IFX core consists of analyzers and users.●The BSG uses databases and apllications, tools, utilities and resources.●The SCPD uses algorithms, development and optimizaiton●The ISP computes, stores and networks.

Bioinformatics infrastructuresCommand-line implementations (open source).Primitive GUI implementations.Sophisticated GUI interfaces and application packaging.Web interface and Java language gives platform independent access.PC-based, web-based and server-based architectures.Multiple tier infrastructures distributes computational burden.Cloud-based – limited by data volumes

How can bioinformatics facilitate cancer research ? • Diagnosis - identify classifiers to better sub-divide cancer etiologies into groups. Better individual data to put treatment and individual together. • Treatment - identify better methods to track treatment progress and indicate problems earlier. • Prevention - understand mechanisms for cancer initiation, progression and development and identify targets in this process. • Connect cancer patient data from geographically distributed cancer patients for more complete analysis, esp. for rare cancers.

NCI NGS ENVIRONMENT

NCI-F NGS Instrument Landscape The SF-ATC, POB-STC, PCC/CADC, LMT, CGF-ATC and NCI labs interact.

NCI-F NGS Geographical Landscape ATRF LMT NCI-F ATC NIH

“Next-Generation” Sequencing Technologies

Clinical Samples K

NGS Analysis Steps●Primary analysis includes base calling and Qc/QA filtering●Secondary analysis include mapping, coverage analysis, expression analysis, identify variants and impact assessment.●Tertiary analysis includes comparison tools and interpretation.

Typical Variant Identification Pipeline●A raw read after QC filtering yields an input read which may be mapped, unmapped, ambiguous or not mapped.●The mapped read can be split and read depths yield concordant and discordant mates. ●The split reads yields SV, CNV, cell variants, class enrichment, impact analysis and disease association.

Mapping ConsiderationsMany different mapping applications availableEach with complex set of mapping parametersMapping is only as good as the reference it maps againstMore mapping not necessarily better – finds best available site(s)Same platform, different mappers will yield different mapping percentages and influence variant callsMappers will need to continue to evolve to allow for multiple references to be searched

Potential Mapping IssuesReference genome is incomplete – Ns and many breaks per chromosomeReference genome contains many repeats that are very large and very similar (way longer than current read lengths)Reference genome contains many regions known to vary by copy number or be subject to structural variationReference genome contains an ancestry bias

What does our reference look like ? 7.5% 234mbp Ns 47 % 1.45gbp Repeats 45 % 1.45gbp NonN,NonRpt

CNV Coverage By Percent

Read fate mapping What are the unmapped reads ? – most map to alternate assemblies; some do not mapWhen alternative alleles are considered, some 2/3 of reads that should have mapped to them were mapped elsewhere !Although only a small fraction of reads do not map, we can not easily estimate the number that did not map correctly

1k genomes (NA18508) Mapped to Chr_Un 6 M reads 1 % Unmapped reads 13.1 M reads 3 % Mapped to Chromosomes and MT 430 M reads 96 %

Worst case scenario - PSPHL

PSPHL ComplexitiesThe gene is located on chromosome 7 with a 55kb insert location.There is also an:indel locus 427 bp with 99.6% identityancestral locus 106 kb with 95% identityand additional locus 465 bp with 95% identity

The PSPHL gene structure, and the deletion breakpoint, was determined by sequence mining. However, there was a huge gap to fill. hg18

Variant CallersMany variant callers exist, some components of larger applications, some stand-aloneLike mappers, many variables and filtering steps – all alter the false discovery/false negative ratesSNVs fairly well worked out, indels more difficult to identify and lower validation ratesEmerging consensus is to modularize workflow – best-in-breed mapper followed by best-in-breed variant caller, more dynamicNeed “truth” set to validate (Ventor ?)

Overlap amongst SNVs called by 3 popular variant callers (likely SAMTools, GATK and CLCBio)

Double identity

Systems biology overview

GoalsBackground detailsMechanisms of connectivityLevel 1 integrationMore sophisticated integrationComplex interpretation needs

Supporting infrastructureDeep and complete genomic annotation for species of interestSystem to connect different data id typesOntology/controlled vocabulary for harmonization (apples==apples) [common data elements]Visualization capabilities for networks, heatmaps and genomic context

Pathway Gene Set AnalysisMany experiments result in sets of genes, eg microarray, proteomics, literature searches etc.Clustering genes based on expression etc. provides only first dimension.View prospective pathways impacted by changes in expression, protein levels, phosphorylation etc.

In-House NextGen WPS:Pathway-based Platform from Array/Proteomics to NGS Tertiary Analysis

VisualizationPathway/network – cytoscape and wpsGenome – many viewers availableHeatmap – simple R tools available

2 principal integration tiers Gene level – measurements associated/collected at or below gene level – expression, proteomics, phosphorylation, binding, metabolomic etc.Genome coordinate level – chipseq, cytogenetics, gwas, arrayCGH etc.Both are incompletely annotated and complex

Integration Goal: Use the database and application infrastructure to create new integrated applications Q: Tell me everything and anything you know about my gene Where we are: biodbnet.abcc Data level integration of all 32 databases Seamless updates at the backend Ortholog, batch and custom conversions medXminer Complete Medline in Oracle XMLDB Near future: *expand bioDBnet Integrate semantics layer into literature Development of new applications

Integration Example: bioDBnet193 biological identifiers from 32 biological databasesbioDBnet integrates proteomics, genomics transcriptomics, metabolomics to yield functional annotation, gene, drub, tzxon, disease, interaction, protein, microarray, protein features, pathway and variation/polymorphism.

bioDBnet – Biological DataBase NetworkIntegrates 32 widely used biological databasesThe network has 193 nodes and 658 edgesHandles batch conversions across databases(db2db), orthology conversions(dbWalk), organism wide conversions(dbOrg) and generates detailed annotation reports(dbReport)Major advantage - database update procedure is completely automated and does not impact operations

Integration layersFirst layer connects measurements through gene associationsSecond layer recognizes feedback and interactions and network complexities and builds on top of that

Java Web Start Version Overview WPS: An in-house Pathway Analysis, Visualization, and Data Integration Tool Tertiary Analysis for NGS To WPS, NGS data is just another resource of data from different platforms in parallel with Metabolome, Microarray, Proteomics data etc. Migration NextGen WPS for NGS data and other high throughput data Whole pathway scope

SLEPR and Pathway-level Pattern Extraction (PPEP) (PPEP) (PPEP) (SLEPR: Yi and Stephens, PlosOne 2008, 3(9):e3288) (PPEP: Yi M, Mudunuri U, Che A, Stephens R, BMC Bioinformatics 2009, 10:200) (WPS: Yi et al, BMC Bioinformatics 2006, 7:30)

BioCarta Pathways Uncovered by SLEPR but not by Conventional Method From Breast Cancer Array Dataset (NCI) Significant under SLEPR method but not by Conventional way

BioCarta Pathway: Spliceosomal Assembly

GBrowse

IGV

Computational Integration in Biomarker Discovery:Testing and validation include mechanistic studies in mice and biomarker validation in the clinic.

Computational Integration across species. Analyze mouse or clinical data featuring selection and modeling. Network signatures, molecular signatures and candidate biomarkers are calculated.

SysBioCube

Bioinformatics Directions/GrowthData visualization – required to pull together and interpret the huge volumes of data now being producedData integration – often signs of disease can be diagnosed at different levels requiring “big picture” to be drawn for full understanding.Natural Language Processing – there are simply too many papers for humans to do all of the reading and comprehension.Controlled vocabularies – allow for apples and apples to be compared.

Clinical Sequencing and Cancer Companies already offering cancer diagnostic panelsTied to proprietary in-house clinical variation databasesConnect sets of mutations to clinically actionable treatments and/or trialsIdentify likely responders/non respondersIssues: EthicsCounseling Non-actionable targets

Training in bioinformatics ?Skill set needs to encompass aspects of both biological science and computer science.Direct access to relevant scientific questions through own research or close ties to scientific community.Ability to adapt to new questions, applications and data types.

The Role of Bioinformatics in Cancer Biotechnology

The Role of Bioinformatics in Cancer Biotechnology

Presentation Transcript

Bioinformatics Facility of the Biotechnology

Bioinformatics in Cancer Biotechnology

The Role of Phytoestrogens in Cancer Etiology

The Promise of Biotechnology and Idaho’s Role

Role of Metformin in Cancer Metastasis

Bioinformatics Group Institute of Biotechnology University of Helsinki

The Role of Biotechnology in Agriscience

Bioinformatics Facility at the Biotechnology/Bioservices Center

Bioinformatics in the CDC Biotechnology Core Facility Branch

Role of MSH2 in Colon Cancer

The Role of CDK4/6 in Breast Cancer

The role of adipocytokines in breast cancer

The Role of Agricultural Biotechnology in Africa’s Development

The power of bioinformatics tools in cancer research

Role of the Surgeon in Cancer Management

S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology

Evolution of Biotechnology in Brazil: The Role of Information Technology

Role of TF in Cancer Progression

Bioinformatics Facility of the Biotechnology

CANCER AND BIOTECHNOLOGY

Biotechnology and Bioinformatics: Medicine

The role of Endoscopy in Gastric Cancer