Research Update, April 2006

Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics Indiana University School of Informatics, Bloomington djwild @ indiana.edu David Wild – Research Overview April 2006. Page 1

Overview • Smart mining of drug discovery information • Project goals • Workflow examples & demonstrations • Collaborations with scientists • Workflow interoperability • Data mining of the DTP tumor cell line dataset • Fast clustering of Pubchem using Divisive Kmeans & Linux clusters • Distributed Drug Discovery for neglected diseases • Visualization & end-user layer tools • Usability of chemical informatics tools • Collaboration areas with Peter Murray Rust group David Wild – Research Overview April 2006. Page 2

Smart mining of drug discovery information • Technique for making the large volumes and diverse sources of chemical & related information manageable for scientists • Observation: many information needs of scientists are straightforward, but complex and time-consuming in implementation • This project aims to match information needs with use-cases and workflows of web services, along with imaginative human interfaces • Supported by Microsoft eScience grant David Wild – Research Overview April 2006. Page 3

3-layer model David Wild – Research Overview April 2006. Page 4

Request from Human Interface USE-CASE SCRIPT Invoke New Structure Service Convert structures to 3D Dock results & protein file Extract any hits Return links for visualization AGENT / SMART CLIENT Parse request Select appropriate use cases and/or web service(s) Schedule as necessary UDDI (?) WSDL SOAP New Structure Service Search online databases for recent structures Search local databases for recent structures Merge Results Online database (e.g. PubChem) Local database 3D Docking Tool 2D-3D converter 3D visualizer atomic services aggregate services David Wild – Research Overview April 2006. Page 5

David Wild – Research Overview April 2006. Page 6

Web services implemented • Database Services • Local DTP Tumor Cell Line Database • PDB Ligand Database • Distributed Drug Discovery Database • OpenEye • FRED Docking • FILTER Property Calculation and Filtering • OMEGA 2D-3D Conversion • BCI • Various BCI Clustering services • VOTables • InChIGoogle • InChiServer • CMLRSSServer • CDK Web services • Open Babel David Wild – Research Overview April 2006. Page 9

A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex) A protein implicated in tumor growth is supplied to the docking program (in this case HSP90 taken from the PDB 1Y4 complex) Correlation of docking results and “biological fingerprints” across the human tumor cell lines can help identify potential mechanisms of action of DTP compounds The workflow employs our local NIH DTP database service to search 200,000 compounds tested in human tumor cellular assays for similar structures to the ligand. Client portlets are used to browse these structures Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. David Wild – Research Overview April 2006. Page 10

Workflow interoperability • Taverna SCUFL <-> BEPL conversion • Working with Beth Plale & Dennis Gannon at IU Computer Science • Use of developing data standards for Chemical Informatics • CML & InChI • XML meta data • Interoperability of Taverna with other workflow systems • Use of workflows in experiment execution environments • See http://www.extreme.indiana.edu/portals/index.shtml David Wild – Research Overview April 2006. Page 11

DTP Tumor Cell Line Data Mining • Collaboration with Melanie Wu, Database & Data Mining expert at the School of Informatics • Local PostgreSQL database exposed as a web service • Building on existing published data mining research on this dataset • Current projects: • Comparing compound clusterings based on structure (MACCS keys) and “bioprint” (vector of screening results) • Investigating fingerprint and bioprint correlations with MOA’s of ~100 compounds (correlation is definitely found) • Application of workflows to associate docking results with screening results • Collaboration with Dr. Faming Zhang at IU Department of Chemistry for mining of Kinase-related information • Next projects: • Correlation of structural and gene expression information (without naïve combination of screen & gene information) • Application of COMPARE • Integration into a wider oncology information system David Wild – Research Overview April 2006. Page 12

Database architecture • Using PostgreSQL database with gNova CHORD for structure & fingerprint searching, exposed as a web service • Compound table contains ~200,000 SMILES, ID, properties, MACCS keys in compound table • Screen tables contain GI50/LD50/TGI values, and gene expression table (in development) • Can search on mix of structure and numeric / categorical data • Active research into optimizing searching efficiency David Wild – Research Overview April 2006. Page 13

Cluster Analysis and Chemical Informatics • Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds • Organizational usage has not been as well studied as the other two, but see • Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, 155-162. • Essentially helping large datasets become manageable • Methods used: • Jarvis-Patrick and variants • O(N2), single partition • Ward’s method • Hierarchical, regarded as best, but at least O(N2) • K-means • < O(N2), requires set no of clusters, a little “messy” • Sphere-exclusion (Butina) • Fast, simple, similar to JP • Kohonen network • Clusters arranged in 2D grid, ideal for visualization David Wild – Research Overview April 2006. Page 14

Limitations of Ward’s for large datasets (>1m) • Best algorithms have O(N2) time requirement (RNN) • Requires random access to fingerprints • hence substantial memory requirements (O(N)) • Problem of selection of best partition • can select desired number of clusters • Easily hit 4GB memory addressing limit on 32 bit machines • Approximately 2m compounds David Wild – Research Overview April 2006. Page 15

Divisive K-means Clustering • New hierarchical divisive method • Hierarchy built from top down, instead of bottom up • Divide complete dataset into two clusters • Continue dividing until all items are singletons • Each binary division done using K-means method • Originally proposed for document clustering • “Bisecting K-means” • Steinbach, Karypis and Kumar (Univ. Minnesota)http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf • Found to be more effective than agglomerative methods • Forms more uniformly-sized clusters at given level David Wild – Research Overview April 2006. Page 16

BCI Divkmeans • Several options for detailed operation • Selection of next cluster for division • size, variance, diameter • affects selection of partitions from hierarchy, not shape of hierarchy • Options within each K-means division step • distance measure • choice of seeds • batch-mode or continuous update of centroids • termination criterion • Have developed MPI parallel version for Linux clusters / grids in conjunction with BCI (now Digital Chemistry) • For more information, see Barnard and Engels talks at: http://cisrg.shef.ac.uk/shef2004/conference.htm • Now available as a web service at IU (along with other BCI programs) David Wild – Research Overview April 2006. Page 17

Comparative execution times NCI subsets, 2.2 GHz Intel Celeron processor 7h 27m 3h 06m 2h 25m 44m David Wild – Research Overview April 2006. Page 18

MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005) David Wild – Research Overview April 2006. Page 19

Distributed Drug Discovery • Project run by Dr. Bill Scott at IUPUI • Tackling neglected diseases using distributed chemistry (while educating undergraduates about combinatorial chemistry) • Each student makes 4 compounds on cheap equipment. Each class will typically make around 60 compounds. Many universities participating around the world • Reaction transformations, virtual and made compounds stored in PostgreSQL database exposed as a web service • This information can then be drawn into our workflows. For example, searches for similar compounds can be done on Pubchem, Tumor Cell Line database, etc David Wild – Research Overview April 2006. Page 20

Distributed Drug Discovery William L. Scott Distributed Drug Discovery A Distributed Drug Discovery Concept to Search for Developing World Disease Drug Leads David Wild – Research Overview April 2006. Page 21

Visualization and end-user tools • PubChemSR • 2D structure visualizer using CDK • VoPlot • VisualiSAR - modal fingerprints • Similarity Matrix Visualization • General approaches to end user tools • Portlets and .NET • Usability & Contextual Design David Wild – Research Overview April 2006. Page 22

PubChemSR (Junguk Hur) http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR David Wild – Research Overview April 2006. Page 23

Simple 2D viewer applet (using CDK) - David Jiao David Wild – Research Overview April 2006. Page 24

VoPlot David Wild – Research Overview April 2006. Page 25

VisualiSAR - modal fingerprints with a nod to Edward Tufte. See http://www.daylight.com/meetings/mug99/Wild/Mug99.html David Wild – Research Overview April 2006. Page 26

Visual Similarity Matrices Degree Sloan’s Algorithm Original (curated) Breadth-first Search Student: Christopher Mueller Data: NCI Compound Database - Compounds with positive AIDS screens Additional details are displayed as property plots. Here, the different computed properties are displayed along with the main matrix. Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal information about the data. In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments. David Wild – Research Overview April 2006. Page 27

General approaches to end-user tools • Main interface-level vehicle should be portlets, allowing reuse and interchangability • Other interfaces, such as .NET clients, email and RSS interfaces will also be investigated • No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists’ interaction with the system • Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right peoplein chemical informatics [collaboration with HCI?] • Possibility of multiple interfaces for different people groups(Cooper’s “primary personas”) • Don’t assume the browser interface – email / NLP ? • Start with the basics • 2D chemical structure drawing (input) • Visualization of large numbers of chemical structures in 2D • 3D chemical structure visualization • Current project is looking at usability of online chemical databases(including PubChem) David Wild – Research Overview April 2006. Page 28

Usability of 2D structure drawing tools • Key difference between “sequential” and “random” drawers • Huge difference in intuitiveness • Key factor how badly you can mess things up • Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw David Wild – Research Overview April 2006. Page 29

Cambridge-Indiana Collaboration • Weekly Access Grid meetings • Bringing together areas of expertise in the UK and USA • Applying OSCAR text mining to NIH data • Looking toward joint presentations & publications David Wild – Research Overview April 2006. Page 30

Cambridge-Indiana Collaboration David Wild – Research Overview April 2006. Page 31

Contributors My students Xiao Dong Huijung Wang Jason Lee Junguk Hur David Jaio Usha Cheemakurthi Waiping Kam Geoffrey’s group at CGL Marlon Pierce Jake Kim Sima Patel Smitha Ajay Others Gary Wiggins Melanie Wu Dennis Gannon Beth Plale Rajarshi Guha Peter Murray Rust Peter Corbett Dan Zaharevitz David Wild – Research Overview April 2006. Page 32

Research Update, April 2006

Research Update, April 2006

Presentation Transcript

APRIL 2006

April 2006

April 2006

April 2006

April 2006

APRIL UPDATE:

April Update

April 2006

April AAHU Legislative Update Program April 21, 2006

April 2006

WELCOME TO THE HEALTHCARE UPDATE FOR APRIL 2006

Medicare Part D Update April 2006

School Accountability Update September 2006 – April 2007

April, 2006

April Update

April 2006

April 2006

April 2006

April, 2006

April 2006

Legislative Update April 4, 2006