240 likes | 373 Views
Microsoft Cloud Futures Conference, Redmond, 8 th April 2010. Cloud Computing for Chemical Property Prediction. Paul Watson School of Computing Science Newcastle University, UK Paul.Watson@ncl.ac.uk. The team: David Leahy, Jacek Cala, Hugo Hiden,
E N D
Microsoft Cloud Futures Conference, Redmond, 8th April 2010 Cloud Computing for Chemical Property Prediction Paul Watson School of Computing Science Newcastle University, UK Paul.Watson@ncl.ac.uk • The team: David Leahy, Jacek Cala, Hugo Hiden, • Dominic Searson, Vladimir Sykora, Martyn Taylor, Simon Woodman • With thanks to: • Microsoft External Research for their financial support for Project Junior • Christophe Poulain, Savas Parastatidis
Chemists want to know: Q1. What are the properties of this molecule? Toxicity Biological Activity Solubility Q2. What molecule would have aqueous solubility of 0.1 μg/mL?
Answering the Question by performing experiments ..... time consuming, expensive, ethical Issues
An alternative to experimentation: QSAR Quantitative Structure Activity Relationship - predict properties based on similar molecules Activity≈ f( ) quantifiablestructural attributes, e.g. #atoms logp shape .....
Generating the models -Discovery Bus (Leahy et al) Data Model-Builders Models www.openqsar.com New Data or Model-Builders Model Generation New/ Improved Models
Chemical Structures & their Activities Separate Training & Test Data Test Data Training Data Calculate Descriptors from Structures Descriptors + Responses Combine Descriptors Selected Descriptors + Responses Combined Descriptors + Responses Filter Descriptors Multiple Linear Regression Neural Network Partial Least Squares Classification Trees Build & Test Models Independently ..... Select Best Models Add to Model Database
Increasing amounts of data for model building... CHEMBL : data on 622,824 compounds, collected from 33,956 publications WOMBAT : data on251,560 structures, for over 1,966targets WOMBAT-PK: data on 1230 compounds, for over 13,000 clinical measurements All contain structure information & numerical activity data More models Better models • Computationally expensive: • 5 years for new datasets on existing server
JUNIOR Project Aim Use Azure to generate models in weeks not years .... using as much of the available data as possible .... make models available on www.openqsar.com ... so that researchers can generate predictions for their own molecules
Potential for concurrency... Chemical Structures & their Activities Separate Training & Test Data Test Data Training Data Calculate Descriptors from Structures Descriptors + Responses Combine Descriptors Combined Descriptors + Responses Filter Descriptors Selected Descriptors + Responses Multiple Linear Regression Neural Network Partial Least Squares Classification Trees Build & Test Models Independently ..... Select Best Models Add to Model Database
Approach • avoid rewriting all existing Discovery Bus software • move existing Discovery Bus to Amazon Cloud • without parallelisation • move critical tasks to run concurrently on Azure • base solution around e-Science Central ....
Clouds to the rescue? • Building scalable, dependable, science applications is still hard ..... • e-Science Central • Science Cloud Platform
Science Cloud Options Science App 1 Science App n Users Users .... Science Cloud Platform Science App 1 Science App n .... Cloud Infrastructure: Storage & Compute Cloud Infrastructure: Storage & Compute
e-Science Central Science as a Service for users Science App 1 Science App n Users .... Science Cloud Platform for developers Science Cloud Platform Cloud Infrastructure: Storage & Compute
What should the Science Cloud Platform Include? Identify the common needs of our e-Science Users Data(instruments, experimental data, sensors...)
.... App App Analysis Services e-Science Central App API Security Social Networking Science Cloud Platform Provenance Workflow Enactment Metadata Processing Cloud Infrastructure Storage
Discovery Bus Planner Amazon Analysis Services e-Science Central App API Security Social Networking Provenance Workflow Enactment Metadata Processing Azure Storage
2 Workflow decomposed to Message Plan 1 Discovery Bus invokes e-Science Central Workflow via API Temporary workflow storage assigned, Message Plan queued for execution. 3 4 Message Plan Call Message Internal Service RMI / JMS NFS Response Message Workflow temporary storage Messages sent in sequence Call Message Azure Service HTTP HTTP Post Response Message 5 5 Workflow Execution Completes Discovery Bus notified with results Results data stored in e-Science Central folder
e-Science Central Blob Storage • Web Node • Worker Node • Worker Node • Worker Node Results Queue Azure
Current Status – Work in Progress • Running across up to 100 Azure nodes • Azure utilisation increasing (average ~60% over runs) • Moving more admin tasks to Azure / e-Science Central • model validation • co-ordination
Summary • Discovery Bus exemplifies a good Cloud pattern • large, variable, bursty requirements • proposal to apply to software verification • clouds do NOT make it easier to build complex, scalable, dependable distributed systems • we need higher-level “Science Cloud Platforms” • e-Science Central is our attempt at this • using the Azure Cloud we have a scalable system that can handle large new datasets • the models are being made freely available • www.openqsar.com