Supporting on-the-fly data Integration for bioinformatics

Supporting on-the-fly data Integration for bioinformatics Candidate: Xuan Zhang Advisor: Gagan Agrawal

Road Map • Mission Statement • Motivation • Implementation • Comprehensive Examples • Future work • Conclusion

Mission Statement • Enhance information integration systems on • Functionality • On-the-fly data incorporation • Flat file data process • Usability • Declarative interface • Low programming requirement

Motivation • Integration is essential for biological research • Biological data include • Sequences: DNA (GenBank), protein (Swiss-Prot) • Structure: RNA (RNAbase), protein (PDB) • Interaction: pathway (KEGG), regulation (GRBase) • Function: disease (OMIM) • 2ndary: protein family (Pfam) • Biological data is inter-related.

Motivation • Challenges of bioinformatics integration • Data volume: overwhelming • DNA sequence: 100 gigabases (August, 2005) • Data growth: exponential Figure provided by PDB

Motivation • Challenges of bioinformatics integration (cont.) • Tools: Many and more • Service interfaces: Variety • Web pages • Web service • Grid service

Motivation • Challenges of bioinformatics integration (cont.) • Inter-operability: Low • Heterogeneous data sources • Semi-structured by nature • Flat file, relational, object-oriented databases • Independently developed tools • No data exchange standard • Little Collaboration

Mission Statement Motivation Implementation Future Conclusion Approach Overview Advantage Components Road Map

Approach Summary • Metadata • Declarative description of data • Data mining algorithms for semi-automatic writing • Reusable by different requests on same data • Code generation • Request analysis and execution separated • General modules with plug-in data module

System Overview Understand Data Process Data Data File User Request Metadata Description Layout Miner Answer Layout Descriptor --------------------------------------------------- Schema Descriptor Code Generation Request Processor Layout Descriptor --------------------------------------------------- Schema Descriptor Layout Descriptor --------------------------------------------------- Schema Descriptor Schema Miner Information Integration System

Advantages • Simple interface • At metadata level, declarative • General data model • Semi-structured data • Flat file data • Low human involvement • Semi-automatic data incorporation • Low maintenance cost • OK Performance • Linear scale guaranteed

Mission Statement Motivation Implementation Future Conclusion Road Map • Approach Overview • Advantage • Components

System Components • Understand data • Layout mining • Schema mining • Process data • Wrapper generation • Query Process • Query Process with indices

Layout Mining Data File • Goal 1: Separate delimiters from values • D-score: location & frequency • Goal 2: Organize delimiters and values • NFA Token Parser Tokens Delimiter Mining Candidate Delimiters Layout Learning Layout Descriptor

Schema Mining Road Map • Schema Mining • Overview • Mining System • Core Mining Algorithm • Experiments

Schema Mining Goals • Ultimate goal: discover schema about an unknown flat file dataset • Immediate goal: Assign attributes with meaningful labels

Our Approach • Summarize values from bottom up • Use knowledge from • Ontology • Heuristics • A head-up: attribute label  attribute name • What we can mine • date • What we cannot do • Creation date, last modification date, birthday, …

Schema Mining System Raw attribute values • Major Components • Data Cleaning and summarization • Score calculation • Score function • Ontology • Heuristics • Score Clustering Value cleaning and summarization Attribute summaries Score calculation Cutoff values Scores Clustering algorithm Labeling Attribute Labels

Data Summarization • Goal: reduce amount of data • Collect frequent tokens • Approximate frequent token mining algorithm • Goal: reduce amount of data • Collect frequent tokens • Approximate frequent token mining algorithm • Token categorization by profile • Token profile: a ordered list of N(numerical), A(alphabetic) and special characters • Token categories: • Word, number, else and other user defined categories

Score Function Template • Desired property • Simple • Adjustable trade-off between sensitivity and error tolerance

Score Clustering • Goal: Sort attributes into three groups, H (high), L (low) and M (middle), by scores • Mathematically, find two scores, scorei and scorej, from {score1, score2, score3, …, scoreN}, to minimize the standard deviation • N (number of attributes) is not large. Exact answer can be found.

Schema Mining Road Map • Schema Mining • Overview • Mining System • Core Mining Algorithm • Mining with ontology • Mining with heuristics • Experiments

Use of Ontology • An observation: a similarity between ontology and schema • Both satisfy “is-a” relation • E.g “Diabetes is a disease.” • Ontology: “diabetes” is a child of “disease” • Schema: “diabetes” is a valid instance of attribute “disease” • Common ancestors in ontology ~ attribute label

Real-world Complications • To find an arbitrary value in an ontology • Complete and comprehensive ontology? • Selective sampling • Error-free dataset? • Adjustable sensitivity & fault tolerance • Performance

Ontology Database • Goal: to approximate a complete comprehensive ontology database • Approach • “Complete”: sample popular terms • “Comprehensive”: public ontology databases + common facts • Result • 6 major categories • 386 terms

Ontology Based Metrics (1) • Occurrence(term) = Frequent_Count[i], if term=Frequent_Token[i] mini:[0, t] Frequent_Count[i], if term=Frequent_Token[0]|…|Frequent_Token[t] 0, else • Strength(term) = Occurrence(term) +  Strength(child_term)

Ontology Based Metrics (2) • Two factors • Relative strength compared with other concepts • Completeness of ontology as a whole • Ontology score = product of two factors • Each modulated by the template score function

Mining With Heuristics (1) • Use token profile • “number”: {N, N.N} • “date”: {N-A-N, N/N/N} • Use frequent token counts • “identification”: Frequent_Counts[]=1 • Use other token information • “biological sequence”: length >45, or in 10’s

Mining With Heuristics (2) • Use token sequence information • “people name”: length (2~3), separator (“,” or “and”), profile (not number, date) • Again, these counts are modulated by the template function to calculate scores

Schema Mining Experiment Design • Datasets • GenBank, UniProt SWISSPROT and Pfam • Cutoff values • Exact clustering • Evaluation • Weighted Cohen’s Kappa Compare group most, middle and little with true label Y(yes), P(partial) and N(no)

Result Summary: Kappa Very good Good Moderate 1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type, 7: name, 8: number, 9: organism, 10: publication method, 11: sequence

Cellular Component (O)

Date (H)

Organism Name (O)

Schema Mining Summary • According to Kappa tests, results are good or very good • Possible improvement • Clustering method with better intelligence • Better ontology database • More involved language analysis • Hybrid of bottom-up and top-down approaches

System Components • Understand data • Layout mining • Schema mining • Process data • Metadata description language • Wrapper generation • Query Process • Query Process with indices

Data Process Overview • Automatic code generation approach • Input • Metadata about datasets involved • Optional: • Implicit data transformation task • Request by users • Indexing functions • Output • Executable programs • General modules • Task-specific data module

Metadata Description • Two aspects of data in flat files • Logical view of the data • Physical data organization • Two components of every data descriptor • Schema description • Layout description • Design goals • Powerful • Easy for writing and interpretation

Metadata Challenges • Examples of sequence formats • ALN/ClustalW format • AMPS Block file format • ClustalW • Codata • EMBL • GCG/MSF • GDE • Genebank • Fasta (Pearson) • NBRF/PIR • PDB format • Pfam/Stockholm format • Phylip • Raw • RSF • UniProtKB/Swiss-Prot • Major Challenges: • Various representation • Semi-structured data • >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL • { • name "Short name for sequence" • longname "Long (more descriptive) name for sequence" • sequence-ID "Unique ID number" • creation-date "mm/dd/yy hh:mm:ss" • direction [-1|1] • strandedness [1|2] • type [DNA|RNA||PROTEIN|TEXT|MASK] • offset (-999999,999999) • group-ID (0,999) • creator "Author's name" • descrip "Verbose description“ • comments "Lines of comments that can be fairly arbitrary text about a sequence. Return characters are allowed, but no internal double quotes or brace characters. Remember to close with a double quote" • sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" • } • LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993 • DEFINITION Mouse fosB mRNA. • ACCESSION X14897 • VERSION X14897.1 GI:50991 • KEYWORDS fos cellular oncogene; fosB oncogene; oncogene. • SOURCE Mus musculus. • ORGANISM Mus musculus • Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; • Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. • REFERENCE 1 (bases 1 to 4145) • AUTHORS Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Muller,R. and • Bravo,R. • TITLE The product of a novel growth factor activated gene, fos B, • interacts with JUN proteins enhancing their DNA binding activity • JOURNAL EMBO J. 8 (3), 805-813 (1989) • MEDLINE 89251612 • PUBMED 2498083 • COMMENT clone=AC113-1; cell line=NIH3T3. • FEATURES Location/Qualifiers • source 1..4145 • /organism="Mus musculus" • /db_xref="taxon:10090“ • CDS 1202..2218 /note="fosB protein (AA 1-338)" /codon_start=1 /protein_id="CAA33026.1" /db_xref="GI:50992" /db_xref="MGD:95575" /db_xref="SWISS-PROT:P13346" /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991 t 1 others ORIGIN 1 ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61 aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa 121 actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt 181 gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta 301 tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata 421 gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541 aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661 agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca 781 gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901 ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021 gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg 1081 agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc 1141 catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc 1261 cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381 aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc 1441 c List and example provided by EMBL-EBI

Schema Descriptors • Follow XML DTD standard for semi-structured data • Simple attribute list for relational data <?xml version='1.0' encoding='UTF-8'?> <!ELEMENT FASTA (ID, DESCRIPTION, SEQ)> <!ELEMENT ID (#PCDATA)> <!ELEMENT DESCRIPTION (#PCDATA)> <!ELEMENT SEQ (#PCDATA)> [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

Layout Descriptors • Overall structure (FASTA example) DATASET “FASTAData” { //Dataset name DATATYPE {FASTA} //Schema name DATASPACE LINESIZE=80 { // ---- File layout details goes here ---- } DATA {osu/fasta} //File location }

File Layout • Key observations on line-based biological data files • Strings of variable length • Delimiters widely used • Data fields may be divided into variables • Repetitive structures >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …

Layout Descriptors • File layout (FASTA example) >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … DATASPACE LINESIZE=80 { < “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF > }

System Component • Understand data • Layout mining • Schema mining • Process data • Metadata description language • Wrapper generation • Query execution • Query execution with indices

Wrapper Generation Road Map • Motivation and overview • System structure • Wrapper generation • Wrapper execution • Experiments

Wrapper Generation Motivation • Wrappers are essential for bioinformatics integration • Heterogeneous data sources • Function: transform data • Current solutions • Manually written wrappers • Scripts

Wrapper GenerationAdvantages • Wrapper generated automatically • Stand-alone programs for integration systems and workflows • Little human interference. New resources can be integrated on-the-fly • Direct transformation. No unnecessary intermediate form needed • Only requires data description at metadata level, one descriptor/data source • Transfer data from flat files directly • No DB support required • No other domain or format heuristics

Wrapper GenerationSystem Overview Layout Descriptor Schema Descriptors Layout Parser Mapping Generator Mapping File Mapping Parser Data Entry Representation Schema Mapping Wrapper generation system Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer wrapper

Supporting on-the-fly data Integration for bioinformatics

Supporting on-the-fly data Integration for bioinformatics

Presentation Transcript

Bioinformatics for Genome data analysis

Data Integration for Big Data

supporting data

Bioinformatics Data Representation and Integration

Methods for Data Integration

Data Integration for the Relational Web

Data Integration on the Semantic Sensor Web

Faces on the fly

“Sub on the Fly”

Lobsters on the Fly

A JDBC driver supporting Data Source Integration

Ontologies and vocabularies supporting data integration: emphasis on mouse phenotypes

Data Integration for the Relational Web

Sensor Grid Integration An Agent-Based Workbench for On-the-Fly Sensor-Data Analysis

Architectural Constraints on Current Bioinformatics Integration Systems

THE NEED FOR DATA/MODEL INTEGRATION

Bioinformatics workflow integration

Chapter 15: Data Integration on the Web

Feedback on the Fly

Interpretation on the fly

Bioinformatics for Genome data analysis

supporting data