1 / 35

Text Mining the technology to convert text into knowledge

Text Mining the technology to convert text into knowledge. Stan Matwin School of Information Technology and Engineering University of Ottawa Canada stan@site.uottawa.ca. Plan. What? Why? How? Who?. What?. Text Mining (TM) = Data Mining from textual data

eron
Download Presentation

Text Mining the technology to convert text into knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Miningthe technology to convert textinto knowledge Stan Matwin School of Information Technology and Engineering University of Ottawa Canada stan@site.uottawa.ca

  2. Plan • What? • Why? • How? • Who? codata 2002

  3. What? • Text Mining (TM) = Data Mining from textual data • Finding nuggets in otherwise uninteresting mountains of ore • DM = finding interesting knowledge (relationships, facts) in large amounts of data codata 2002

  4. What cnt’d • Working with large corpora • …and little knowledge • Discovering new knowledge • … e.g. in Grimm’s fairy tales • vs uncovering of existing knowledge • …e.g. find mySQL developers with  1yr experience in a file of 5000 CVs • Has to treat data as NL codata 2002

  5. What? Cnt’d • Uncovering aspect of TM • TM = Information Extraction from Text • Text -> Data Base mapping • TM and XML codata 2002

  6. Examples • Extracting information from CVs: skills, systems, technologies etc • Personal news filtering agent • Research in functional genomics about protein interaction codata 2002

  7. Why? • Moore’s law, and… • Storage law codata 2002

  8. How? A combination of • Machine learning • Linguistic analysis • Stemming • Tagging • Parsing • Semantic analysis codata 2002

  9. Some TM-related tasks • Text segmentation • Topic identification and tracking • Text summarization • Language identification • Author identification codata 2002

  10. Two case studies • CADERIGE • Spam detection (with AmikaNow) codata 2002

  11. Caderige « Catégorisation Automatique de Documents pour l'Extraction de Réseaux d'Interactions Géniques » Knowledge extraction from Natural Language texts codata 2002

  12. Caderige • Objective: to extract information of interest to geneticists from on-line bastract and/or paper databases (e.g. Medline) • Ensure acceptable recall and precision codata 2002

  13. The araR gene is monocistronic, and the promoter region contains -10 and -35 regions (as determind by primer extension analysis) similar to those recognized by RNA polymerase containing the major vegetative cell sigma factor sigmaA. An insertion-deletion mutation in the araR gene leads to constitutive expression of the L-arabinose metabolic operaon. We demonstrate that the araR gene codes for a negative regulator of the ara operon and that the expression of araR is repressed by its own product. The fragment (it.) can be selected by means of keywords codata 2002

  14. This question cannot be answered with keywords alone; semantic knowledge that repression is a type of regulation is req’d It has been proposed that Pho-P plays a key role in the activation of tuA and in the repression of tagA and tagD. "What are the proteins involved in the regulation of tagA?” codata 2002

  15. does not answer After determination of the nucleotide sequence and deduction of the purR reading frame, the PurR product was found to be highly similar to the purR-encoded repressor from Bacillus subtilis. "What are the proteins involved in the regulation of purR?", In fact, parsing is needed to see that PurR and purR-encoded Repressor are objects of the verb to be similar codata 2002

  16. RNA isolated from a sigma B deletion mutant revealed that the transcription of gspA is sigmaB dependent. Conceptual interpretation is needed to see that is an answer to "What are the proteins involved in the regulation of gspA gspA is sigmaB dependent is interpreted as protein sigmaB regulates gspA codata 2002

  17. CADERIGE Architecture codata 2002 Forms matching • • • - fragment selectors - text - Query extraction grammars - Thesaurus - Linguistic resources normalization normalization s conceptual gragrammar text mining extr. extraction using by index resources selection MedLine abstracts of linguistic fragment acquisition labeling query Extraction

  18. 3 steps • Focusing: learned filters • Linguistic Analysis: lexicalsyntactic/semantic • Syntax-semantics mapping 3. Extraction codata 2002

  19. Caderige: example codata 2002

  20. Current stage • 1 done • XML for 3 designed • Tools for 2 chosen codata 2002

  21. Email filters • Spam elimination • Automatic filing • Compliance enforcement • …. codata 2002

  22. Email… • The trick: cast it as a text classification problem • Build a training set • train your favouritre classifier • Deploy it codata 2002

  23. State of the art • Current accuracy 80% codata 2002

  24. Difficulties • multi-class problem where • classes overlap • and are hierarchical • recall vs precision codata 2002

  25. TM: who – academically? • David Lewis • Yimin Yang – CMU • Ray Mooney - UT Austin • Nick Cercone - Waterloo • Guy Lapalme – U. de Montréal • TAMALE - University of Ottawa codata 2002

  26. Who – industrially? • Google • Clearforest • AmikaNow codata 2002

  27. Conclusion • Text mining – a necessity (so “!” instead of “?”) • Still in its infancy • Methods must exploit linguistic knowledge codata 2002

  28. Classification • Prevalent practice: examples are represented as vectors of values of attributes • Theoretical wisdom, confirmed empirically: the more examples, the better predictive accuracy codata 2002

  29. ML/DM at U of O • Learning from imbalanced classes: applications in remote sensing • a relational, rather than propositional representation: learning the maintainability concept • Learning in the presence of background knowledge. Bayesian belief networks and how to get them. Appl to distributed DB codata 2002

  30. Why text classification? • Automatic file saving • Internet filters • Recommenders • Information extraction • … codata 2002

  31. Text classification: standard approach • Remove stop words and markings • remaining words are all attributes • A document becomes a vector <word, frequency> • Train a boolean classifier for each class • Evaluate the results on an unseen sample Bag of words codata 2002

  32. Text classification: tools • RIPPER A rule-based learner Works well with large sets of binary features • Naïve Bayes Efficient (no search) Simple to program Gives “degree of belief” codata 2002

  33. “Prior art” • Yang: best results using k-NN: 82.3% microaveraged accuracy • Joachim’s results using Support Vector Machine + unlabelled data • SVM insensitive to high dimensionality, sparseness of examples codata 2002

  34. SVM in Text classification SVM Training with 17 examples in 10 most frequent categories gives test performance of 60% on 3000+ test cases available during training Transductive SVM Maximum separation Margin for test set codata 2002

  35. Combining classifiers Comparable to best known results (Yang) codata 2002

More Related