1 / 20

Dongkyoo Shin (shindk@sejong.ac.kr) Sejong University, InCob2007

Exploitation of Structural Similarity in Semi-Structured Bioinformatics Data for Efficient Storage Construction. Dongkyoo Shin (shindk@sejong.ac.kr) Sejong University, InCob2007. Table of contents. Abstract Background Methods Results Conclusions. Abstract (1). Background

cyma
Download Presentation

Dongkyoo Shin (shindk@sejong.ac.kr) Sejong University, InCob2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploitation of Structural Similarity in Semi-Structured Bioinformatics Data for Efficient Storage Construction Dongkyoo Shin (shindk@sejong.ac.kr) Sejong University, InCob2007

  2. Table of contents • Abstract • Background • Methods • Results • Conclusions Multimedia & Internet Laboratory, Sejong University

  3. Abstract (1) • Background • Many researches related to storing XML data • Reduce the number of joins between tables • Not proper to microarray data with distinctive hierarchy • Hierarchical feature of microarray data model • a few core values occurs iteratively • New approach for capturing the feature • Class elements with similar structure into a group • Design common database table for the group Multimedia & Internet Laboratory, Sejong University

  4. Abstract (2) • Results • Database schema created by our approach • Reduce the number of table joins remarkably • Improve performance of storing and loading XML-based microarray data • Conclusions • Efficient way to improve performance of microarray data is mining structural similarity of elements Multimedia & Internet Laboratory, Sejong University

  5. Background (1) • DTD (Data Type Definition)-dependent base • Map one element into one table For each e  E, #(S) ≥1 OR #(A) ≥1 -> define_Class(e) For each Se  S -> Add_attributes_of_Class(e) Se  SequenceType -> Define_multivalued_att(Se, e) Multimedia & Internet Laboratory, Sejong University

  6. Background (2) • Inline technique base • Reduce the complexity of DTD (Data Type Definition) For each e, #(S) == 1 AND Se  SequenceType -> Add_Multi-valued_attribute_of_Paren-tClass(e) Multimedia & Internet Laboratory, Sejong University

  7. Background (3) • Drawback of previous approaches • DTD-dependent • Database schema has the same complexity with DTD • Inline technique • Strongly depend on the number of omissible elements • New design approach for microarray database • Capture similar structural features of microarray data • Need fast and simple way to mine the structural features Multimedia & Internet Laboratory, Sejong University

  8. Background (5) • Microarray data and MAGE (Microarray Gene Expression) standards • Research groups share microarray data with others, and use it to solve their biological questions • MGED society’s standard definitions • MIAME (Minimum Information for the Annotation of a Microarray Experiment) • MAGE-OM and MAGE-ML • Exchange object model and format for MIAME • Structural feature of MAGE-OM • a variety set of objects defining the same data types including complex types. Multimedia & Internet Laboratory, Sejong University

  9. Background (6) • Decision Tree • a simple model for easy understanding classification rules correlations, and effects between variables • Proper formining structural features of MAGE-ML DTD itself (Not MAGE-ML instances !!!) • Possible to classify all elements three levels: • A root, mediators group, and bottoms group Multimedia & Internet Laboratory, Sejong University

  10. Methods (1) • Classification of core features using decision tree • Terminologies for expression of a complexType • e: an element defined in XML schema • E: an elements set of e • SE: a sub-elements set of e • a: an attribute of e • A: an attributes set of e • SA: an attributes set for all sub-elements of e • complexType: Structural information that consists of SE and (or) A of e. • Lowestchild: an element without a sub-element • Lowestparent: an element with a sub-element that is one of the lowest child elements • PG (ParentGroup): a set of candidate elements to be parents of a LowestChild • LPCG (TheLowest Parent CandidateGroup): a set of candidates to be LowestParent • LCG (TheLowestChildGroup): a set of Lowest child elements • LPG (The Lowest Parent Group): a set of Lowest Parent elements • ULPG (Upper Level Parent Group): a set of upper level parents, including elements that are neither LowestChild nor LowestParent Multimedia & Internet Laboratory, Sejong University

  11. Methods (2) • Expression of a complexType • A complexType defines structural information of elements • A set of arrays including data type • Definition of structural similarity SEelex = {e1, e2, … , en}, SAelex = {Ae1, Ae2, … , Aen} complexType(elex) = {SEelex, SAelex} • complexType(elex) == complexType(eley) Multimedia & Internet Laboratory, Sejong University

  12. Methods (3) • Decision Tree for recognizing the core features • Condition 1: If rule 1 is satisfied, then e arrives at LCG. Otherwise, it arrives at PG. • Condition 2: If rule 2 is satisfied, then e and its similar element e arrive at a new LCG. • Condition 3: If rule 3 is satisfied, then e arrives at LPG. Otherwise, it arrives at ULPG. • Condition 4: If rule 4 is satisfied, then e and elements similar to e arrive at a new LPG. Multimedia & Internet Laboratory, Sejong University

  13. Methods (4) • Classification rules • Rule 1 • Decide that an element should belong to group LCG or PG For each ei  E { if(number of elements in SEei == 0){ ei is classified into LCG; }else{ ei is classified into PG; } } Multimedia & Internet Laboratory, Sejong University

  14. Methods (5) • Classification rules • Rule 2 • Classify multiple sets of LCG p = 0; For each ei  LCG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LCGq) { ei is classified into LCGq; Flag=1; } } If (Flag==0) { For each ej  LCG0 if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LCGp; } } } Multimedia & Internet Laboratory, Sejong University

  15. Methods (6) • Classification rules • Rule 3 • Separate elements in PG into two groups: LPG and ULPG For each ei  PG { if(SEei  LCG) { ei is classified into LPG; }else{ ei is classified into ULPG; } } Multimedia & Internet Laboratory, Sejong University

  16. Methods • Classification rules • Rule 4 • Classify multiple sets of LPG p = 0; For each ei  LPG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LPGq) { ei is classified into LPGq; Flag=1; } } If (Flag==0) { For each ej  LPG0 if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LPGp; } } } Multimedia & Internet Laboratory, Sejong University

  17. Result (1) • Database design by the proposed decision tree Multimedia & Internet Laboratory, Sejong University

  18. Result (2) • Database space complexity • Time complexity Multimedia & Internet Laboratory, Sejong University

  19. Result (3) • Reconstructing the XML Document Multimedia & Internet Laboratory, Sejong University

  20. Conclusions • Proposed approach • Mine elements with structural similarity from XML Schema for biological information • Experimental result • Mining structural similarity of object model is proper to microarray data and more efficient than previous approaches • Future work • Plan to extend current classification rules to root, LCG, LPG, ULPG respectively Multimedia & Internet Laboratory, Sejong University

More Related