1 / 52

Project Prospect and the Semantic Web

Project Prospect and the Semantic Web. Colin Batchelor Royal Society of Chemistry, Cambridge, UK batchelorc@rsc.org. Project Prospect and the Semantic Web. Who we are What we’ve done Motivation Means The InChI and the Semantic Web Ontology development for chemistry

beryld
Download Presentation

Project Prospect and the Semantic Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Prospectand the Semantic Web Colin Batchelor Royal Society of Chemistry, Cambridge, UK batchelorc@rsc.org

  2. Project Prospectand the Semantic Web Who we are What we’ve done Motivation Means The InChI and the Semantic Web Ontology development for chemistry RXNO and MOP

  3. Who we are

  4. Royal Society of ChemistryAdvancing the Chemical Sciences • Learned and professional society • Scientific publisher • 25 journals, 8 databases and a growing book program • 8000 articles yearly • Covering a broad spectrum of chemical sciences from systems biology (Molecular BioSystems) to physical and theoretical chemistry (PCCP)

  5. What we’ve done

  6. The motivation

  7. The motivation • Scientific papers are formulaic and consistently structured (but not necessarily IMRD: see later) • There may be infinitely many possible chemical compounds BUT • Nomenclature is productive and susceptible to machine parsing

  8. The means

  9. The meanshow publishing really works

  10. Data capture Editing and proof-reading

  11. Enhanced HTML Database Text mining (Oscar) Manual QA Enhanced RSS

  12. Regular polysemy … where words stand for multiple things in a consistent way. Examples: • Brand names • Grinding • Figure–ground • Exact–class–part polysemy in chemistry Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.

  13. Regular polysemy Brand names “Learning to buy a Renault and talk to BMW” Grinding “The squirrel scampered down the path and kept stopping and looking at the officers to check they were behind” vs. “[…] the trick was to serve squirrel fresh and not to leave it hanging like other game”

  14. Regular polysemy Figure–ground • Audrey Hepburn painted the door (figure) • Audrey Hepburn walked through the door (ground) • The Incredible Hulk walked through the door (ambiguous)

  15. Imidazole

  16. An imidazole

  17. The imidazole side-chain/group/ring/etc.

  18. Can ChEBI handle this? • Imidazoles (!) (CHEBI:24780) • Imidazole (CHEBI:16069) • Imidazole ring not yet • Imidazolyl group not yet (but methyl, benzyl, etc.) … and there are no disambiguation cues

  19. Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions

  20. Disambiguation: toy model CLASS: • w(–1) = a, an, the, this • w(0) plural (bit of a cheat, as not a collocation) PART: • w(–1) = bridging, terminal • w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) • w(+1)w(+2) = “building block”, “protecting group”, “side chain”

  21. Why is this hard? Coordination resolution Part of speech ambiguity: tosylates; noun or verb?

  22. How many numbered compounds actually are named in a given paper? iloprost (1) tributyl-1-hexynylstannane (2) the desired 2-heptyne (3) methyl–Pd(II) iodide 4 or 4′ alkynylstannane 5 the hypervalent stannate 6 (alkynyl)(methyl)Pd(II) complex 7 the desired methylalkyne 8 compounds 9–14 the stannyl precursors 15 and 16 methylated compounds 17 and 18 stannyl precursor 19 iloprost methyl ester 20 “iloprost methyl ester” is the real name, but you need to know that iloprost is a monocarboxylic acid! Why is this hard?

  23. Why is this hard? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors

  24. What are we marking up? • Chemical compounds (InChI, ChEBI) • Chemical classes and parts (ChEBI) • Nanoparticles (in ChEBI from end of October) • Chemical terms from the IUPAC Gold Book • Name reactions (RXNO) • Gene products: function, process, location (GO) • Nucleotide and polypeptide sequence terms (SO) • Cell types (CL)

  25. InChI and the Semantic Web

  26. What InChI is for • Can represent complete molecules (may be ions or radicals) of less than 1024 heavy (non-H) atoms. (however) • Cannot yet represent metal atom geometry. • Cannot yet represent polymers. • Cannot yet represent diradicals etc.

  27. What InChI is not for • Classes of molecule • Parts of molecule (these have been done in ChemBlast)

  28. InChI in RDF (We don’t like this.) We use the RSS content module. (As if articles contained molecules.) And we use info:inchi URIs. Look…

  29. Some RDF <content:items> <rdf:Bag> <rdf:li> <content:item rdf:about="info:inchi/InChI=1/C15H22O9/c1-8(16)19-6-15(7-20-9(2)17)12(21-10(3)18)11-13(24-15)23-14(4,5)22-11/h11-13H,6-7H2,1-5H3/t11?,12-,13+/m1/s1"/> </rdf:li> <rdf:li> <content:item rdf:about="info:inchi/InChI=1/C21H34O9/c1-6-9-14(22)25-12-21(13-26-15(23)10-7-2)18(27-16(24)11-8-3)17-19(30-21)29-20(4,5)28-17/h17-19H,6-13H2,1-5H3/t17?,18-,19+/m1/s1"/> </rdf:li> </rdf:Bag> </content:items> <content:items> <content:item> <owl:Class rdf:ID="GO_0016298"> <rdfs:label>lipase activity</rdfs:label> </owl:Class></content:item> </content:items>

  30. RXNO David Barden Colin Batchelor Celia Gitterman

  31. RXNOthe name reaction ontology (1) • Every chemist knows about famous chemists like Wittig, Cannizzaro, Diels, Alder, benzoin • They’re pretty unambiguous and well-suited to logical definitions • But what organizing principle do we use?

  32. RXNOthe name reaction ontology (2) • Sort reactions by what they do to the ‘skeleton’ of the molecule. • Skeleton-changing reactions: • Joinings, cleavings, rearrangements, ring formation, ring expansion • Skeleton-preserving reactions: • Additions, eliminations, substitutions, protections, deprotections

  33. RXNOthe name reaction ontology (3) • Quality? Subjectivity? • Get our curators to assign reactions to categories without conferring, check percentage agreement, discuss disagreements, improve guidelines, iterate to convergence.

  34. RXNOthe name reaction ontology (4)

  35. What do people say?

  36. The spectroscopist’s tale The enriched html version came as something of a revelation and the current emphasis on links to, and through biomolecular terminology was very much a plus for us, since my colleagues and I are a mix of physical and biological chemists who are dabbling in inter-disciplinary waters. Given the steadily increasing burden of keeping up with the current literature and accessing earlier publications - a fortiori when conventional disciplinary boundaries are being crossed - the ability to 'grow a tree' from current articles (including one's own) is going to make 'targeted sleuthing' a great deal easier. John Simons, Oxford

  37. The high-throughput screener’s tale An interesting opportunity particularly for managers, students and beginners that are not that deeply immersed in the detail and the terminology. It further opens access to those who want to explore areas they are not specialists in. Great idea! Eberhard Krausz, MPI-CBG Dresden

  38. Lastly… “My only criticism would be the need for a time warning… I spent 4 hours digging about which generated at least six new research ideas printed half a ream of paper and I missed my bus home. At least it was a new excuse my wife had not heard, so another first.” An analytical chemist, The North.

More Related