1.03k likes | 1.2k Views
Supporting on-the-fly data Integration for bioinformatics. Candidate: Xuan Zhang Advisor: Gagan Agrawal. Road Map. Mission Statement Motivation Implementation Comprehensive Examples Future work Conclusion. Mission Statement. Enhance information integration systems on Functionality
E N D
Supporting on-the-fly data Integration for bioinformatics Candidate: Xuan Zhang Advisor: Gagan Agrawal
Road Map • Mission Statement • Motivation • Implementation • Comprehensive Examples • Future work • Conclusion
Mission Statement • Enhance information integration systems on • Functionality • On-the-fly data incorporation • Flat file data process • Usability • Declarative interface • Low programming requirement
Motivation • Integration is essential for biological research • Biological data include • Sequences: DNA (GenBank), protein (Swiss-Prot) • Structure: RNA (RNAbase), protein (PDB) • Interaction: pathway (KEGG), regulation (GRBase) • Function: disease (OMIM) • 2ndary: protein family (Pfam) • Biological data is inter-related.
Motivation • Challenges of bioinformatics integration • Data volume: overwhelming • DNA sequence: 100 gigabases (August, 2005) • Data growth: exponential Figure provided by PDB
Motivation • Challenges of bioinformatics integration (cont.) • Tools: Many and more • Service interfaces: Variety • Web pages • Web service • Grid service
Motivation • Challenges of bioinformatics integration (cont.) • Inter-operability: Low • Heterogeneous data sources • Semi-structured by nature • Flat file, relational, object-oriented databases • Independently developed tools • No data exchange standard • Little Collaboration
Mission Statement Motivation Implementation Future Conclusion Approach Overview Advantage Components Road Map
Approach Summary • Metadata • Declarative description of data • Data mining algorithms for semi-automatic writing • Reusable by different requests on same data • Code generation • Request analysis and execution separated • General modules with plug-in data module
System Overview Understand Data Process Data Data File User Request Metadata Description Layout Miner Answer Layout Descriptor --------------------------------------------------- Schema Descriptor Code Generation Request Processor Layout Descriptor --------------------------------------------------- Schema Descriptor Layout Descriptor --------------------------------------------------- Schema Descriptor Schema Miner Information Integration System
Advantages • Simple interface • At metadata level, declarative • General data model • Semi-structured data • Flat file data • Low human involvement • Semi-automatic data incorporation • Low maintenance cost • OK Performance • Linear scale guaranteed
Mission Statement Motivation Implementation Future Conclusion Road Map • Approach Overview • Advantage • Components
System Components • Understand data • Layout mining • Schema mining • Process data • Wrapper generation • Query Process • Query Process with indices
Layout Mining Data File • Goal 1: Separate delimiters from values • D-score: location & frequency • Goal 2: Organize delimiters and values • NFA Token Parser Tokens Delimiter Mining Candidate Delimiters Layout Learning Layout Descriptor
Schema Mining Road Map • Schema Mining • Overview • Mining System • Core Mining Algorithm • Experiments
Schema Mining Goals • Ultimate goal: discover schema about an unknown flat file dataset • Immediate goal: Assign attributes with meaningful labels
Our Approach • Summarize values from bottom up • Use knowledge from • Ontology • Heuristics • A head-up: attribute label attribute name • What we can mine • date • What we cannot do • Creation date, last modification date, birthday, …
Schema Mining Road Map • Schema Mining • Overview • Mining System • Core Mining Algorithm • Experiments
Schema Mining System Raw attribute values • Major Components • Data Cleaning and summarization • Score calculation • Score function • Ontology • Heuristics • Score Clustering Value cleaning and summarization Attribute summaries Score calculation Cutoff values Scores Clustering algorithm Labeling Attribute Labels
Data Summarization • Goal: reduce amount of data • Collect frequent tokens • Approximate frequent token mining algorithm • Goal: reduce amount of data • Collect frequent tokens • Approximate frequent token mining algorithm • Token categorization by profile • Token profile: a ordered list of N(numerical), A(alphabetic) and special characters • Token categories: • Word, number, else and other user defined categories
Score Function Template • Desired property • Simple • Adjustable trade-off between sensitivity and error tolerance
Score Clustering • Goal: Sort attributes into three groups, H (high), L (low) and M (middle), by scores • Mathematically, find two scores, scorei and scorej, from {score1, score2, score3, …, scoreN}, to minimize the standard deviation • N (number of attributes) is not large. Exact answer can be found.
Schema Mining Road Map • Schema Mining • Overview • Mining System • Core Mining Algorithm • Mining with ontology • Mining with heuristics • Experiments
Use of Ontology • An observation: a similarity between ontology and schema • Both satisfy “is-a” relation • E.g “Diabetes is a disease.” • Ontology: “diabetes” is a child of “disease” • Schema: “diabetes” is a valid instance of attribute “disease” • Common ancestors in ontology ~ attribute label
Real-world Complications • To find an arbitrary value in an ontology • Complete and comprehensive ontology? • Selective sampling • Error-free dataset? • Adjustable sensitivity & fault tolerance • Performance
Ontology Database • Goal: to approximate a complete comprehensive ontology database • Approach • “Complete”: sample popular terms • “Comprehensive”: public ontology databases + common facts • Result • 6 major categories • 386 terms
Ontology Based Metrics (1) • Occurrence(term) = Frequent_Count[i], if term=Frequent_Token[i] mini:[0, t] Frequent_Count[i], if term=Frequent_Token[0]|…|Frequent_Token[t] 0, else • Strength(term) = Occurrence(term) + Strength(child_term)
Ontology Based Metrics (2) • Two factors • Relative strength compared with other concepts • Completeness of ontology as a whole • Ontology score = product of two factors • Each modulated by the template score function
Mining With Heuristics (1) • Use token profile • “number”: {N, N.N} • “date”: {N-A-N, N/N/N} • Use frequent token counts • “identification”: Frequent_Counts[]=1 • Use other token information • “biological sequence”: length >45, or in 10’s
Mining With Heuristics (2) • Use token sequence information • “people name”: length (2~3), separator (“,” or “and”), profile (not number, date) • Again, these counts are modulated by the template function to calculate scores
Schema Mining Road Map • Schema Mining • Overview • Mining System • Core Mining Algorithm • Experiments
Schema Mining Experiment Design • Datasets • GenBank, UniProt SWISSPROT and Pfam • Cutoff values • Exact clustering • Evaluation • Weighted Cohen’s Kappa Compare group most, middle and little with true label Y(yes), P(partial) and N(no)
Result Summary: Kappa Very good Good Moderate 1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type, 7: name, 8: number, 9: organism, 10: publication method, 11: sequence
Schema Mining Summary • According to Kappa tests, results are good or very good • Possible improvement • Clustering method with better intelligence • Better ontology database • More involved language analysis • Hybrid of bottom-up and top-down approaches
System Components • Understand data • Layout mining • Schema mining • Process data • Metadata description language • Wrapper generation • Query Process • Query Process with indices
Data Process Overview • Automatic code generation approach • Input • Metadata about datasets involved • Optional: • Implicit data transformation task • Request by users • Indexing functions • Output • Executable programs • General modules • Task-specific data module
Metadata Description • Two aspects of data in flat files • Logical view of the data • Physical data organization • Two components of every data descriptor • Schema description • Layout description • Design goals • Powerful • Easy for writing and interpretation
Metadata Challenges • Examples of sequence formats • ALN/ClustalW format • AMPS Block file format • ClustalW • Codata • EMBL • GCG/MSF • GDE • Genebank • Fasta (Pearson) • NBRF/PIR • PDB format • Pfam/Stockholm format • Phylip • Raw • RSF • UniProtKB/Swiss-Prot • Major Challenges: • Various representation • Semi-structured data • >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL • { • name "Short name for sequence" • longname "Long (more descriptive) name for sequence" • sequence-ID "Unique ID number" • creation-date "mm/dd/yy hh:mm:ss" • direction [-1|1] • strandedness [1|2] • type [DNA|RNA||PROTEIN|TEXT|MASK] • offset (-999999,999999) • group-ID (0,999) • creator "Author's name" • descrip "Verbose description“ • comments "Lines of comments that can be fairly arbitrary text about a sequence. Return characters are allowed, but no internal double quotes or brace characters. Remember to close with a double quote" • sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" • } • LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993 • DEFINITION Mouse fosB mRNA. • ACCESSION X14897 • VERSION X14897.1 GI:50991 • KEYWORDS fos cellular oncogene; fosB oncogene; oncogene. • SOURCE Mus musculus. • ORGANISM Mus musculus • Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; • Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. • REFERENCE 1 (bases 1 to 4145) • AUTHORS Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Muller,R. and • Bravo,R. • TITLE The product of a novel growth factor activated gene, fos B, • interacts with JUN proteins enhancing their DNA binding activity • JOURNAL EMBO J. 8 (3), 805-813 (1989) • MEDLINE 89251612 • PUBMED 2498083 • COMMENT clone=AC113-1; cell line=NIH3T3. • FEATURES Location/Qualifiers • source 1..4145 • /organism="Mus musculus" • /db_xref="taxon:10090“ • CDS 1202..2218 /note="fosB protein (AA 1-338)" /codon_start=1 /protein_id="CAA33026.1" /db_xref="GI:50992" /db_xref="MGD:95575" /db_xref="SWISS-PROT:P13346" /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991 t 1 others ORIGIN 1 ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61 aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa 121 actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt 181 gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta 301 tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata 421 gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541 aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661 agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca 781 gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901 ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021 gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg 1081 agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc 1141 catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc 1261 cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381 aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc 1441 c List and example provided by EMBL-EBI
Schema Descriptors • Follow XML DTD standard for semi-structured data • Simple attribute list for relational data <?xml version='1.0' encoding='UTF-8'?> <!ELEMENT FASTA (ID, DESCRIPTION, SEQ)> <!ELEMENT ID (#PCDATA)> <!ELEMENT DESCRIPTION (#PCDATA)> <!ELEMENT SEQ (#PCDATA)> [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string
Layout Descriptors • Overall structure (FASTA example) DATASET “FASTAData” { //Dataset name DATATYPE {FASTA} //Schema name DATASPACE LINESIZE=80 { // ---- File layout details goes here ---- } DATA {osu/fasta} //File location }
File Layout • Key observations on line-based biological data files • Strings of variable length • Delimiters widely used • Data fields may be divided into variables • Repetitive structures >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …
Layout Descriptors • File layout (FASTA example) >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … DATASPACE LINESIZE=80 { < “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF > }
System Component • Understand data • Layout mining • Schema mining • Process data • Metadata description language • Wrapper generation • Query execution • Query execution with indices
Wrapper Generation Road Map • Motivation and overview • System structure • Wrapper generation • Wrapper execution • Experiments
Wrapper Generation Motivation • Wrappers are essential for bioinformatics integration • Heterogeneous data sources • Function: transform data • Current solutions • Manually written wrappers • Scripts
Wrapper GenerationAdvantages • Wrapper generated automatically • Stand-alone programs for integration systems and workflows • Little human interference. New resources can be integrated on-the-fly • Direct transformation. No unnecessary intermediate form needed • Only requires data description at metadata level, one descriptor/data source • Transfer data from flat files directly • No DB support required • No other domain or format heuristics
Wrapper GenerationSystem Overview Layout Descriptor Schema Descriptors Layout Parser Mapping Generator Mapping File Mapping Parser Data Entry Representation Schema Mapping Wrapper generation system Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer wrapper