1 / 63

Can we Predict Anything Useful from 2-D Molecular Structure?

Can we Predict Anything Useful from 2-D Molecular Structure?. Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K. We look at data, analyse data, use data to find correlations ... ... to develop models ...

tallys
Download Presentation

Can we Predict Anything Useful from 2-D Molecular Structure?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

  2. We look at data, analyse data, use data to find correlations ... ... to develop models ... ... and to make (hopefully) useful predictions. Let’s look at some data ...

  3. New York Times, 4th October 2005.

  4. Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

  5. Outliers? Happiness (GNP/$5000) -2

  6. Fitting with a curve: reduce RMSE

  7. Outliers? Different linear models for different regimes

  8. Only one obvious (to me) conclusion This area is empty: no country is both rich and unhappy. All other combinations are observed. Happiness (GNP/$5000) -2

  9. ... but this is nothing to do with 2-D molecular structure

  10. QSPR • Quantitative Structure  Property Relationship • Physical property related to more than one other variable • First example from Hansch et al 1960’s • General form (for non-linear relationships): y = f (descriptors)

  11. QSPR Y = f (X1, X2, ... , XN ) • Optimisation of Y = f(X1, X2, ... , XN) is called regression. • Model is optimised upon N “training molecules” and then • tested upon M “test” molecules.

  12. QSPR • Quality of the model is judged by three parameters:

  13. QSPR • Different methods for carrying out regression: • LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. • NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

  14. QSPR • However, this does not guarantee a good predictive model….

  15. QSPR • Problems with experimental error. • A QSPR equation is only as accurate as the data it is • trained upon. • Therefore, we are making experimental measurements • of solubility (Dr Antonio Llinàs).

  16. QSPR • Problems with “chemical space”. • “Sample” molecules must be representative of “Population”. • Prediction results will be most accurate for molecules similar • to training set. • Global or Local models?

  17. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.

  18. Drug Disc.Today, 10 (4), 289 (2005) Cohesive interactions in the lattice reduce solubility Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

  19. Relationship of Chemical Structure With Lattice EnergyCan we predict lattice energy from molecular structure? Dr Carole Ouvrard & Dr John Mitchell Unilever Centre for Molecular Informatics University of Cambridge C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

  20. Why Do We Need a Predictive Model? • A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials • From 2-D molecular structure only • Without knowing the crystal packing • Without expensive theoretical calculations • Should help predict solubility.

  21. Why Do We Think it Will Work? • Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule. • Many molecules have a plurality of different experimentally observable polymorphs. • We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

  22. x P1- O P212121 1.60 1.40 1.50 Density (g/cc) + P21/c  P21 x + x + -92.0  x x x O x O  O O + -94.0 x Experimental Crystal Structure  O x x x + O -96.0 x  Lattice Energy (kJ/mol) Calculated Lowest Energy Structure -98.0

  23. Expression for the Lattice Energy • U crystal = U molecule + U lattice • Theoretical lattice energy • Crystal binding = Cohesive energy • Experimental lattice energy is related to -DH sublimation DH sublimation = -Ulattice – 2RT (Gavezzotti & Filippini)

  24. Partitioning of the Lattice Energy • U crystal = U molecule + U lattice • DH sublimation = -U lattice – 2RT • Partitioning the lattice energy in terms of structural contributions • Choice of the significant parameters • number of atoms of each type? • Number of rings, aromatics? • Number of bonds of each type? • Symmetry? • Hydrogen bond donors and acceptors? Intramolecular? • We choose counts of atom type occurrences.

  25. Experimental data: DHsublimation Atom Types SATIS codes : 10-digit connectivity code + bond types Each 2 digit code = atomic number HN 01 07 99 99 99 HO 01 08 99 99 99 O=C 08 06 99 99 99 -O- 08 06 06 99 99 Statistical analysis Multi-Linear Regression Analysis Hsub # atoms of each type Analysis of the Sublimation Energy Data • NIST (National Institute of Standards and Technology, USA) • Scientific literature Typically, several similar SATIS codes are grouped to define an atom type.

  26. 226 organic compounds 19 linear alkanes (19) 14 branched alkanes (33) 17 aromatics (50) 106 other non-H-bonders (156) 70 H-bond formers (226) Non-specific interacting Hydrocarbons Nitrogen compounds Nitro-, CN, halogens, S, Se substituents Pyridine Potential hydrogen bonding interactions Amides Carboxylic acids Amino acids… Training Dataset of Model Molecules

  27. 19 compounds : CH4  C20H24 Limit for van der Waals interactions DHsub= 7.955C-2.714 r2= 0.977 s = 7.096 kJ/mol Study of Non-specific Interactions: Linear Alkanes Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems. BPt DH sub Note odd-even variation in DHsub for this series.

  28. Include Branched Alkanes Add 14 branched alkanes to dataset. The graph below highlights the reduction of sublimation enthalpy due to bulky substituents. • 33 compounds : CH4 C20H24 • DHsub = 7.724Cnonbranched + 3.703 • r2= 0.959 • s = 8.117 kJ/mol • If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

  29. All Hydrocarbons: Include Aromatics Add 17 aromatics to the dataset (note: we have no alkenes or alkynes). • 50 compounds • DHsub = 7.680Cnonbranched + 6.185Caromatic + 4.162 aliphatic • r2= 0.958 • s = 7.478 kJ/mol • As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

  30. All Non-Hydrogen-Bonded Molecules: Add 106 non-hydrocarbons to the dataset. Include elements H, C, N, O, F, S, Cl, Br & I. • 156 compounds • DHsub predicted by 16 parameter model • r2= 0.896 • s = 9.976 kJ/mol Parameters in model are counts of atom type occurrences.

  31. General Predictive Model Add 70 hydrogen bond forming molecules to the dataset. • 226 compounds • DHsub predicted by 19 parameter model • r2= 0.925 • s = 9.579 kJ/mol Parameters in model are counts of atom type occurrences.

  32. Predictive Model Determined by MLRA DHsublimation (kJ mol-1) = 6.942+ 20.141 HN +30.172 HO+ 3.127 F +10.456 Cl + 12.926 Br + 19.763 I+ 3.297 C3 – 3.305 C4+ 5.970Caromatic+ 7.631 Cnonbranched+ 7.341 CO+ 19.676 CS+ 11.415 Nnitrile+ 8.953 Nnonnitrile+ 8.466 NO+ 18.249 Oether+ 20.585 SO + 12.840 Sthioether aliphatic All these parameters are significantly larger than their standard errors

  33. Distribution of Residuals The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

  34. Validation on an Independent Test Set • 35 diverse compounds • r2 = 0.928 • s = 7.420 kJ/mol Very encouraging result: accurate prediction possible. Nitro-compounds are often outliers

  35. Conclusions • We have determined a general equation allowing us to estimate the sublimation enthalpy for a large range of organic compounds with an estimated error of  9 kJ/mol. •  A very simple model (counts of atom types) gives a good prediction of lattice & sublimation energies. • Lattice energy can be predicted from 2D structure, without knowing the details of the crystal packing. • Avoids need for expensive calculations. • May help predict solubility. • Model gives good chemical insight.

  36. A Chemoinformatics Approach To Predicting the Aqueous Solubility of Pharmaceutical Molecules David Palmer & Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

  37. Pfizer Project: P13Novel Methods for Predicting Solubility • David Palmer • Dr Antonio Llinàs • Pfizer Institute for Pharmaceutical Materials Science • http://www.msm.cam.ac.uk/pfizer

  38. Datasets • Compiled from Huuskonen dataset and AquaSol database • All molecules solid at R.T. • n = 1000 molecules • Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)

  39. Diversity-Conserving Partitioning • MACCS Structural Key fingerprints • Tanimoto coefficient • MaxMin Algorithm Full dataset n = 1000 molecules Training n = 670 molecules Test n = 330 molecules

  40. Structures & Descriptors • 3D structures from Concord • Minimised with MMFF94 • MOE descriptors 2D/ 3D • Separate analysis of 2D and 3D descriptors • QuaSAR Contingency Module (MOE) • 52 descriptors selected

  41. Multi-Linear Regression Log.S = 0.07nHDon (+/-0.018) - 0.21TPSA (+/-0.033) + 0.11MAXDP (+/-0.022) - 0.22n.Ct (+/-0.019) - 0.29KierFlex (+/-0.032) - 0.59SLOGP (+/0.036) - 0.26ATS2m (+/-0.026) + 0.25RBN (+/-0.033) We can do better than this with other methods ...

  42. Two More Methods of Prediction (1) Random Forest handles both selection and regression. (2a)Ant Colony Optimisation algorithm selection was used for Support Vector Machine regression. (2b) Support Vector Machine regression was repeated with “Intelligent trial and error” selection.

  43. Random Forest: Introduction • Introduced by Briemann and Cutler (2001) • Development of Decision Trees (Recursive Partitioning): • Dataset is partitioned into consecutively smaller subsets (of similar solubility) • Each partition is based upon the value of one descriptor • The descriptor used at each split is selected so as to minimise the MSE

  44. Random Forest: Method • Random Forest is a collection of Decision Trees grown with the CART algorithm. • Standard Parameters: • 500 decision trees • No pruning back: Minimum node size > 5 • “mtry” descriptors tried at each split Important features: • Incorporates descriptor selection • Incorporates “Out-of-bag” validation

  45. Random Forest: Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 RMSE(tr)=0.27 r2(tr)=0.98 Bias(tr)=0.005

More Related