190 likes | 301 Views
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science. 152020 Pereslavl-Zalessky Russia. INEX: Tools for Information Extraction. Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science 152020 Pereslavl-Zalessky
E N D
Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science 152020 Pereslavl-Zalessky Russia
INEX: Tools for Information Extraction Artificial Intelligence Research CentreProgram Systems InstituteRussian Academy of Science 152020 Pereslavl-Zalessky Russia +7 48535 98065 inex@epk.botik.ru
Information extraction Objective: • extract meaningful information of a pre-specified type from (typically large amounts of) texts for further analytical purposes Output: • data structures of a pre-specified format (filled scenario templates)
Examples • Sports report: <winner>, <loser>, <score>, <location>, <date>… • Database on rental accommodation opportunities: <location>,<renting price>, <bedrooms number>, <phone number>…
Possible IE application scenarios: • inference of new information (knowledge acquisition) • query formulation and answering in human-computer systems • automatic generation of abstracts and summaries • visualization of document content, etc.
The `Newsmaking’ task • <newsmaker> • <type of newsmaker> (person or organization) • <message> • <type of message> (original, cited, a reference to another newsmaker)
Tokenisation & sentence segmentation • Tokenisation identification of words, punctuation marks, delimiters, special characters • Sentence segmentation recognizing sentence boundaries
Morphological analysis • maps every word-form of the input text to (a) canonical form(s) • recognizes the word's morphological properties Results are typically ambiguous.
Filtering • reduces the text to be subjected to further processing to potentially relevant portions
Disambiguation • a side effect of other processes (e.g., microsyntactic analysis) • a stand-alone stage
Microsyntactic analysis • identifies noun phrases (NP) • identifies some regularly formed constructions (numbers, dates, personal proper names)
Macrosyntactic analysis • identifies clause boundaries • constructs clause hierarchy within a sentence
Named entity recognizer • identifies proper names • assigns semantic features to certain items
Information extraction rules • a domain knowledge representation formalism (scenario templates) • a set of patterns to identify template elements in a text (covering the many possible ways to talk about the target event elements)
IE pattern includes: • a set of rules that define how to retrieve this pattern in a text • a set of constraints imposed on textual elements to fit into a particular slot of the target
Coreference Resolver • recognizes different occurrences of the same entity in a text
Merging partial results • merging partially filled templates to produce a final, maximally filled template