1 / 29

Discovery from Linking Open Data (LOD) Annotated Datasets

Discovery from Linking Open Data (LOD) Annotated Datasets. Louiqa Raschid University of Maryland PAnG /PSL/ANAPSID/ Manjal. Agenda. Motivation Challenges Solution approaches. Emergence of biological datasets in the cloud of Linked Data.

Download Presentation

Discovery from Linking Open Data (LOD) Annotated Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

  2. Agenda • Motivation • Challenges • Solution approaches

  3. Emergence of biological datasets in the cloud ofLinked Data. Biological objects (e.g., genes or proteins) or clinical trials are annotated with controlled vocabulary terms from ontologies such as GO, MeSH, SNOMED, NCI Thesaurus. Links form a graph that captures meaningful knowledge. Sense making of annotation graphs can explain phenomena, identify anomalies and potentially lead to discovery.

  4. Agenda • Motivation • Drug re-purposing • Cross ontology patterns and literature imprint • Cross genome analysis • Challenges • Solution approaches

  5. Signature: Set of mRNAs that increase or decrease in patients and is significant w.r.t the general population.Compute similarity score [-1, +1]

  6. Of 16,000 pairings, 2664 were significant (q < 0.05); half with an opposite relationship. 53 diseases had significant candidate therapeutic drug-disease relationships.

  7. Sirota et al Findings • Efficacy (literature) for 2 drugs: topiramate and prednisolone. • Evaluated efficacy of cimetidine (over getfinib) for lung adenocarcinoma. • Methodology does not provide avenues for explanation, validation or discovery.

  8. Sirota et al:Identified anomaly in this cluster

  9. Limitations and Extensions • Sirota et al. • Anomaly in drug cluster but their methodology does not allow further investigation. • Sims et al. • Methodology is limited to co-occurrence analysis. • Cannot exploit heterogeneous evidence from LOD sources. • Cannot exploit knowledge in ontologies. • Finding patterns in graph datasets and visualization and explanation.

  10. Agenda • Challenges • Exploiting LOD to create datasets. • Knowledge captured in ontologies. • Similarity metrics/distances tuned for ontologies. • Discovering and validating patterns in graphs. • Literature imprint. • Heterogeneous evidence. • Reasoning with uncertainty.

  11. Solution Approaches • PAnG • PSL • Manjal • ANAPSID • Thanks to our collaborators / domain experts: • Olivier Bodenreider, NLM, NIH • Sherri de Coronado, NCI, NIH • Andreas Thor, University of Leipzig • Louiqa Raschid ++ at UMD • LiseGetoor ++ at UMD • PadminiSrinivasan++ University of Iowa • Maria Esther Vidal ++ Universidad Simon Bolivar

  12. Solution approaches Manjal – Text Mining for MEDLINE Annotation Visualizer – Visualize and explore annotations and patterns PSL: Annotation computation by knowledge propagation PANG: Pattern identification using dense subgraphs and graph summaries. Patterns in ANnotation Graphs Integrated access for heterogeneous data sources: adaptive query processing for SPARQL endpoints TheArabidopsisInformationResource Gene Ontology Clinical Trials

  13. Motivation: Gene Annotation Graphs Anno-tations • Genes are annotated with Gene Ontology (GO) and Plant Ontology (PO) terms • Prediction of new annotations as hypothesis for experiments • Link prediction is predicting new functional annotations for a gene

  14. Link Prediction Framework Filter Link Prediction GraphSumma-rization Dense Subgraph Link Prediction ScoringFunction Distance Restriction Cost Model Graphsummary Ranked List of pre-dictedLinks DenseSubgraph Tripartite Anno-tation Graph (TAG) • Dense Subgraph (optional) • Focus on highly connected subgraphs • Graph summarization: • Identify basic pattern (structure) of the graph • Link Prediction • Predicted links reinforce underlying graph pattern

  15. Dense Subgraph [1] Saha et al. Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB, 2010 • Motivation: graph area that is rich or dense with annotation is an “interesting region” • Density of a subgraph = number of induced edges / number of vertices • Tripartite graph with node set (A, B, C) is converted into bipartite graph with (A, C) • Weighted edges = number of shared b’s • Apply technique of [1] • Distance restriction for DSG possible • Hierarchically arranged ontology terms • All node pairs of A and C are within a given distance

  16. Graph Summarization HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 = HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 • Minimum description length approach [2] • Loss-free; employs cost model • Graph summary = Signature + Corrections • Signature: graph pattern / structure • Super nodes = complete partitioning of nodes • Super edges = edges between super nodes = all edges between nodes of super nodes • Corrections: edges e between individual nodes • Additions: e  G but e  signature • Deletions: e  G but e  signature [2] Navlakha et.al. Graph summarization with bounded error. SIGMOD, 2008

  17. HY5 PO_9006 PHOT1 PO_20030 CIB5 CRY2 PO_37 COP1 PO_20038 CRY1 = HY5 HY5 PO_9006 PO_9006 PHOT1 PHOT1 PSL PO_20030 PO_20030 CIB5 CIB5 CRY2 CRY2 PO_37 PO_37 COP1 COP1 PO_20038 PO_20038 DSG+GS CRY1 CRY1

  18. Distance metrics

  19. Distance metrics

  20. Different retrieved sets of lung cancer related clinical trials Idenitfy 100 clinical trials using the search keyword “lung cancer” in CONDITION. Retrieve CT, CONDITION and INTERVENTION. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output. Retrieve 100 trials using “lung carcinoma” in the CONDITION field. Retrieve 100 trails using “lung carcinoma” in any field.

  21. Retrieve 100 clinical trials using search keyword “lung cancer”. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.

  22. 100 clinical trials using search keyword “lung carcinoma” for CONDITION.

  23. 100 clinical trials using search keyword “lung carcinoma” for ALL FIELDS.

  24. Questions? PAnG/PSL/ANAPSID/Manjal

More Related