Semiautomatic domain model building from text-data

Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain, December 1-2, 2011

Introduction and goals • The basic tasks in creating a domain model: • selection of domain and scope • consideration of reusability • finding a important terms • defining classes and class hierarchy • defining properties of classes and constraints • creation of instances of classes • Goals • designing a method for semiautomatic domain creation • different input documents • different languages • design and implementation of tool

State of the art • Algorithm and tasks work with domain model • different document formats • different languages • domain model • concepts, relations • domain model creation = time consuming • manual creation • automatic creation • semiautomatic creation

Tools and methods • natural language processing – NLP • Stanford NLP • Stanford Parser • Stanford POS tagger • Stanford Named Entity Recognizer • multi-language environment – Google Translate • WordNet (synsets) • Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG

Processing of text documents <html><body><p>An integer character constant has type int.</p></body></html> An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.

Processing of text documents - extraction, cleaning, translation • input TXT, HTML, PDF • removal of occurrences of specialcharacters using regular expressions • numeric designation of chapters and references • removal of single letter prepositions (\\s+[^Aa\\s\\.]{1})+\\s+ • parentheses, dashes, and other • translation into English – the tools work only with english text • Google Translate

Processing of text documents - annotation • Stanford CoreNLP • Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer • machine learning over large data, statistical model of maximum entropy • learned models included • Activities • tokenization • sentence splitting • POS tagging - Part-of-speech • lemmatization • NER - Named Entity Recognition

Example <html><body><p>An integer character constant has type int.</p></body></html> An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.

Mining concepts • tokens marked by POS tagger as nouns are first concept candidates • one word or multi-words nouns • identifying token as concept by disambiguation from WordNet • assigning synset – automatic, manual • using domain term for searching • possible selection of incorrect synset – with other meaning

Mining relations • unoriented / oriented • unnamed / named • WordNet – concept must have synset • hyperonyms and hyponyms – IsA relations • holonyms and meronyms – partOf relations • relation orientation based on concept order • only direct relations • from text • lexical-syntactic patterns • decomposition of multi-word terms – right part of term corresponds to existing concept assignment expression assignment expression IsA expression • sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun integral type IsA type

Tool

Experiment • ANSI/ISO C language • comparison with existing manually created ontology • 2 experiments • all concept candidates • only first 200 candidates • 3 variants of experiment • only candidates • candidates and IsA proposals • candidates and IsA proposals and NER entities

First 30 candidates

Experiment

Experiment • Variant of experiment without IsA relations only with NER entities

Conclusions and further work • concepts => lightweight ontology • enables better automatic relations mining

Contacts Petr Šaloun FEECS, VSB–Technical University of Ostrava petr.saloun@vsb.cz Petr Klimánek (was: Faculty of Science, University of Ostrava) p.klimanek@gmail.com Zdenek Velart FEECS, VSB–Technical University of Ostrava zdenek.velart@gmail.com

Semiautomatic domain model building from text-data

Semiautomatic domain model building from text-data

Presentation Transcript

Multivariable model building with continuous data

Semiautomatic Generation of Resilient Data Extraction Ontologies

Domain Model

Domain Model Refinement

Text Extraction from Big Data

Adapting Text instead of the Model : An Open Domain Approach

Domain Analysis Model

National Trauma Data Base Domain Analysis Model

Cadastre Domain Model

Data::Domain

CONTENT • THE PROBLEM DOMAIN • MODELS AND MODEL ACCESS • BUILDING MODEL HISTORY

A semantic approach for extracting domain taxonomies from text

Building A P-20 Data Model

Semiautomatic Generation of Resilient Data-Extraction Ontologies

XML Data Management 4. Domain Object Model

Semiautomatic Generation of Resilient Data-Extraction Ontologies

Domain Model:

From Unstructured Text to StructureD Data

The Domain Model