170 likes | 332 Views
Semiautomatic domain model building from text-data. Petr Šaloun Petr Klimánek Zdenek Velart. SMAP 2011, Vigo, Spain, December 1-2, 2011. Introduction and goals. The basic tasks in creating a domain model: selection of domain and scope consideration of reusability
E N D
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain, December 1-2, 2011
Introduction and goals • The basic tasks in creating a domain model: • selection of domain and scope • consideration of reusability • finding a important terms • defining classes and class hierarchy • defining properties of classes and constraints • creation of instances of classes • Goals • designing a method for semiautomatic domain creation • different input documents • different languages • design and implementation of tool
State of the art • Algorithm and tasks work with domain model • different document formats • different languages • domain model • concepts, relations • domain model creation = time consuming • manual creation • automatic creation • semiautomatic creation
Tools and methods • natural language processing – NLP • Stanford NLP • Stanford Parser • Stanford POS tagger • Stanford Named Entity Recognizer • multi-language environment – Google Translate • WordNet (synsets) • Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG
Processing of text documents <html><body><p>An integer character constant has type int.</p></body></html> An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.
Processing of text documents - extraction, cleaning, translation • input TXT, HTML, PDF • removal of occurrences of specialcharacters using regular expressions • numeric designation of chapters and references • removal of single letter prepositions (\\s+[^Aa\\s\\.]{1})+\\s+ • parentheses, dashes, and other • translation into English – the tools work only with english text • Google Translate
Processing of text documents - annotation • Stanford CoreNLP • Stanford Parser, Stanford POS tagger, Stanford Named Entity Recognizer • machine learning over large data, statistical model of maximum entropy • learned models included • Activities • tokenization • sentence splitting • POS tagging - Part-of-speech • lemmatization • NER - Named Entity Recognition
Example <html><body><p>An integer character constant has type int.</p></body></html> An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.
Mining concepts • tokens marked by POS tagger as nouns are first concept candidates • one word or multi-words nouns • identifying token as concept by disambiguation from WordNet • assigning synset – automatic, manual • using domain term for searching • possible selection of incorrect synset – with other meaning
Mining relations • unoriented / oriented • unnamed / named • WordNet – concept must have synset • hyperonyms and hyponyms – IsA relations • holonyms and meronyms – partOf relations • relation orientation based on concept order • only direct relations • from text • lexical-syntactic patterns • decomposition of multi-word terms – right part of term corresponds to existing concept assignment expression assignment expression IsA expression • sentence syntax analysis – amod parser (adjectival modifier), adjective followed by noun integral type IsA type
Experiment • ANSI/ISO C language • comparison with existing manually created ontology • 2 experiments • all concept candidates • only first 200 candidates • 3 variants of experiment • only candidates • candidates and IsA proposals • candidates and IsA proposals and NER entities
Experiment • Variant of experiment without IsA relations only with NER entities
Conclusions and further work • concepts => lightweight ontology • enables better automatic relations mining
Contacts Petr Šaloun FEECS, VSB–Technical University of Ostrava petr.saloun@vsb.cz Petr Klimánek (was: Faculty of Science, University of Ostrava) p.klimanek@gmail.com Zdenek Velart FEECS, VSB–Technical University of Ostrava zdenek.velart@gmail.com