1 / 41

Literature Based Discovery

Literature Based Discovery. Dimitar Hristovski dimitar .hristovski @ mf.uni-lj.si Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia. Let me introduce myself … Research and Development. BS – Biomedicina Slovenica database

kamea
Download Presentation

Literature Based Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Literature Based Discovery Dimitar Hristovskidimitar.hristovski@mf.uni-lj.si Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

  2. Let me introduce myself … Research and Development • BS – Biomedicina Slovenica database • Research Evaluation Decision Support System • Medical Information Systems • Surgical clinics • Genetic laboratory • Biochemical laboratory • Web User Behaviour Analysis • Data warehousing and OLAP

  3. Motivation • Overspecialization • Information overload • Large databases • For many diseases the chromosomal region known, but not the exact gene

  4. Background • Literature-based discovery (Swanson): New Relation? Concept X(Disease) Concepts Y(Pathologycal or Cell Function, …) Concepts Z(Genes)

  5. Biomedical Discovery Support System (BITOLA) • Goal: • discover potentially new relations (knowledge) between biomedical concepts • to be used as research idea generator and/or as • an alternative way to search Medline • System user (researcher or intermediary): • interactively guides the discovery process • evaluates the proposed relations

  6. Extending and Enhancing Literature Based Discovery • Goal: • Make literature based discovery more suitable for disease candidate gene discovery • Decrease the number of candidate relations • Method: • Integrate background knowledge: • Chromosomal location of diseases and genes • Gene expression location • Disease manifestation location

  7. Usage Scenarios • For a disease with known chromosomal location, find a candidate gene • For a gene, find a disease that might be influenced • For a disease and gene found to be related by linkage study, find the mechanism of the relation (intermediate concepts should help)

  8. System Overview Knowledge Base Concepts Background Knowledge (Chromosomal Locations, …) Discovery Algorithm Association Rules User Interface Knowledge Extraction Databases (Medline, LocusLink, HUGO, OMIM, …)

  9. Databases • Medline: source of known relationships between biomedical concepts • Set of concepts: • MeSH (Medical Subject Headings): Controlled dictionary and thesaurus used for indexing and searching the Medline database • HUGO: official gene symbols, names and aliases • LocusLink: gene symbols, aliases and chr.locations • OMIM: genetic diseases • UMLS (Unified Medical Language System) • Entrez: used to search PubMed, GenBank, ... • UniGene: gene expression

  10. Knowledge Extraction • Build master set of concepts (MeSH terms and gene symbols) • Extract occurrence of concepts from each Medline record (MeSH terms from MH field, gene symbols from Title and Abstract) • Association rule mining (concept co-occurrence) • Chromosomal location extraction (from LocusLink and HUGO) • Load into knowledge base

  11. Terminology Problems during Knowledge Extraction • Gene names • Gene symbols • MeSH and genetic diseases

  12. type|666548 II|552584 III|201776 component|179643 CT|175973 AT|151337 ATP|147357 IV|123429 CD4|99657 p53|89357 MR|88682 SD|85889 GH|84797 LPS|68982 59|67272 E2|64616 82|63521 AMP|61862 TNF|59343 RA|58818 CD8|57324 O2|56847 ACTH|54933 CO2|53171 PKC|51057 EGF|50483 T3|49632 MS|46813 A2|44896 ER|43212 upstream|41820 PRL|41599 Detected Gene Symbols by Frequency

  13. Gene Symbol Disambiguation • Find MEDLINE docs in which we can expect to find gene symbols • JD indexing (Susanne Humphrey) as possible solution: • Identifies the semantic context of docs • If semantic context not genetic, then gene symbol probably false positive • Example of false positive: • Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390 • breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support

  14. JD Indexing • JDs are 127 Journal Descriptors (e.g., JDs for journal Hum Mol Genet: Cytogenetics; Genetics, Medical) • Training set docs (435,000) inherit JDs from journals • Training set provides co-occurrence data between inherited JDs and: • indexing terms assigned to docs directly • words in docs • Docs having indexing terms/words occurring often with genetics JDs in tr. set assumed to have genetics context • Extended to indexing by 134 UMLS semantic types (e.g. Gene or Genome, Gene Function,…)

  15. System Overview Knowledge Base Concepts Background Knowledge (Chromosomal Locations, …) Discovery Algorithm Association Rules User Interface Knowledge Extraction Databases (Medline, LocusLink, HUGO, OMIM, …)

  16. Binary Association Rules • XY (confidence, support) • If X Then Y (confidence, support) • Confidence = % of docs containing Y within the X docs • Support = number (or %) of docs containing both X and Y • The relation between X and Y not known. • Examples: • Multiple Sclerosis  Optic Neuritis (2.02, 117) • Multiple Sclerosis  Interferon-beta (5.17, 300)

  17. Discovery Algorithm Candidate Gene? Concepts Y(Pathologycal or Cell Function, …) Concept X(Disease) Concepts Z(Genes) Chromosomal Region Match Chromosomal Location Manifestation Location Match Expression Location

  18. Let X be starting concept of interest. Find all Y for which X  Y. Find all Z for which Y  Z. Eliminate those Z for which X->Z already exists. Eliminate those Z that do not match the chromosomal region of X Eliminate those Z that do not match the expression location of X Remaining Z are candidates for new relation between X and Z. In general: X  Y1 … Yn Z, but not X  Z Example: X = disease Y = (pato)physiology of X Z = (de)regulators of Y (drugs, proteins, genes) New relation example:Z is candidate gene for disease X Discovery Algorithm

  19. Y1 Z1 Y2 Z2 Y3 Z3 X … Yi Zk … Yj Zn Ranking Concepts Z

  20. Results: Concepts in Medline • Full Medline (end 2001) analyzed (11,226,520 recs) • Looking for 19,781 MeSH terms and 22,252 human genes (14,659 from HUGO and 7,593 from LocusLink). 24,613 alias gene symbols added • Gene symbols found in 2,689,958 Medline recs. Most frequent ambiguous symbols (CT, MR, CO2,…) or format errors

  21. Results: Co-occurring Concepts in Medline • 29,851,448 distinct pairs of co-occurring concepts: • In 7,106,099 at least one gene symbol appeared • In 679,159 pairs both concepts are gene symbols • Total co-occurrence frequency: 798,366,684 • 59,702,986 association rules calculated and stored

  22. Bilateral Perisylvian Polymicrogiria - BPP (OMIM: 300388) • Polymicrogyria of the cerebral cortex is a developmental abnormality characterized by excessive surface convolution • Clinical characteristics: • Mental retardation • Epilepsy • Pseudobulbar palsy (paralysisof the face, throat, tongue and the chewing process) • X linked dominantinheritance

  23. BPP - pathogenesis • It is considered a disorder of neuronal migration (unlayered type) or a consequence of intrauterine ischemia (layered type)

  24. Finding Candidate Genes for Polymicrogyria, bilateral perisylvan

  25. 237 genes in Xq28 relation between semantic types Cell Movement and Gene or gene products 18 gene candidates Sublocalisation in the Xq28 15 gene candidates Tissue specific expression 2 gene candidates: L1CAM and FLNA

  26. User Interface “cgi-bin” version

  27. Automatically search for supporting Medline Citations

  28. Cleft Palate – Predicting Candidate Genes

  29. Summary and Conclusions • We extend and enhance an existing discovery support system (BITOLA) • The system can be used as: • Research idea generator, or • Alternative method of searching Medline • Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery

  30. Further Work • Increase the number of concepts • Gene symbol disambiguation • Semantic relations extraction • System evaluation • Improve the Web version of the system

  31. System Availability • URL: www.mf.uni-lj.si/bitola/

  32. Related work: SemGenTom Rindflesch et al • Extract semantic predications on genetic basis of disease • “Deletions of INK4 occur in malignant tumors” • INK4|ASSOCIATED_WITH|Malignant Tumors • Evaluation and visualization of SemGen output

  33. cause determine result in control underlie transmit responsible predispose lead to promote susceptibility risk associate involve link implicate influence related Semantic Structures <genetic phenomenon> ETIOLOGY_OF <disorder> CAUSE PREDISPOSE ASSOCIATED_WITH

  34. Statistical Evaluation • Assoc. rule base divided into 2 segments: older (1990-1995) and newer (1996-1999) • The system predicts new relations based on the older segment • Predictions compared with actual new relations in the newer segment

  35. Summary Statistical Evaluation Results

  36. Statistical Evaluation Results • With no assoc. rules constraints: • predicts almost all new relations, but too many candidate relations • With constraints: • predicts new relations 6.9 times better than random predictions • tighter the constraints, better (correct / all predictions) ratio (6.5%)

More Related