520 likes | 1.01k Views
Introduction. PresentationMotivationContentsInformation ExtractionNamed Entity Recognition (NER)An experiment with NERConclusions. Information Extraction. Automatic identification of selected types of entities, relations or events in free text" (GRISHAM, 2003)Related areasInformation Retri
E N D
1. Named Entity Recognition Beto Boullosa
2. Introduction Presentation
Motivation
Contents
Information Extraction
Named Entity Recognition (NER)
An experiment with NER
Conclusions
3. Information Extraction “Automatic identification of selected types of entities, relations or events in free text” (GRISHAM, 2003)
Related areas
Information Retrieval, Knowledge Extraction
IE x IR
4. Information Extraction Applications
“Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000)
Raw texts => structured databases
Templates filling
Improving search engines
Auxiliary tool for other language applications
5. IE History Early projects
Knowledge-based, rule-based
FRUMP – 1979
Newswire
LSP (Language String Project) – 1981
AMA – American Medical Association
Patient summaries
6. IE History MUC – Message Understanding Conferences (1987)
DARPA, NRAD
Standardization
Evaluation
Dissemination
DARPA’s TIPSTER Program: Document Detection, Summarization and Information Extraction – until 1998
TREC (Text Retrieval Conferences)
7. IE History MUC
Evaluation standards (for the 1st time in MUC-2)
Recall
correct units
total units
Precision
correct units
units found
F-Measure
(ß2+1) PR
ß2P + R
8. IE History MUC
Template filling
Mr. John Smith was appointed CEO of ACME last December 31.
9. IE History MUC-6 (1995)
Extraction of Named Entities
names of persons, organizations, locations
temporal expressions, currency and percentages
Extraction of Template Elements
grouping of entity attributes together into entity “objects”
Extraction of events (or Scenario Templates)
Extraction of coreferences
10. IE History MUC-6
ENAMEX (“entity name expression”) tag
people, organization and locations
NUMEX (“numeric expression”) tag
currency and percentages
TIMEX (“time expression”) tag
temporal expressions – dates and times
11. IE History MUC-6
Andrew Johnson was appointed last Sunday president of ACME, the biggest company in Santa Barbara, California, with an estimated $300 million market capacity.
<ENAMEX TYPE=”PERSON”>Andrew Johnson</ENAMEX> was appointed <TIMEX TYPE=”DATE”>last Sunday</TIMEX> president of <ENAMEX TYPE=”ORGANIZATION”>ACME</ENAMEX>, the biggest company in <ENAMEX TYPE=”LOCATION”>Santa Barbara</ENAMEX>, <ENAMEX TYPE=”LOCATION”>California</ENAMEX> with an estimated <NUMEX TYPE=”MONEY”>$300 million</NUMEX> market capacity.
12. IE History MUC-7 (1998)
Tasks
Named Entities (NE task)
Template Element (TE task)
Scenario Template (ST task)
Template Relation (TR task)
Coreferences (CO task)
System portability among domains
13. IE History Domains used in MUCs:
14. IE History Results in MUC-6:
15. IE History Other conferences
MET (Multilingual Entity Task Evaluation)
Japanese NEs
IREX
Japan, 1998
Organization, Person, Location, Artifact, Date, Time, Money and Percent
16. IE History Other conferences
HUB-4 and ACE (Automatic Content Extraction)
NIST National Institute of Standards and Technology
Spoken and printed text
CoNLL (Conference on Natural Language Learning)
Since 1997
NEs in the 2002 and 2003 editions
Multilingual
of person (PER), location (LOC), organization (ORG) and other (O) classes
17. IE Techniques and tasks IE techniques:
Document indexing + text understanding
Document Indexing
Tags texts with different descriptors, giving a kind of semantic representation for its contents
Text Understanding
Builds a knowledge representation of texts
IE history:
TU => DI
More tractable perspective
18. IE Techniques and tasks FC Barcelona sold goalkeeper Valdés to Espanyol last August 14
19. IE Techniques and tasks Compare with:
20. IE Techniques and tasks Events and relations extraction
Knowledge-based techniques
Regular expressions and patterns
Knowledge-poor approaches
Machine learning, statistics
Coreferences
Anaphora resolution
Cross-document
21. IE Techniques and tasks Performance:
Events and relations extraction
x
Named entities extraction
Why?
22. Named Entity Recognition Recognition x Classification
“Name Identification and Classification”
NER as:
as a tool or component of IE and IR
as an input module for a robust shallow parsing engine
Component technology for other areas
Question Answering (QA)
Summarization
Automatic translation
Document indexing
Text data mining
Genetics
…
23. Named Entity Recognition NE Hierarchies
Person
Organization
Location
But also:
Artifact
Facility
Geopolitical entity
Vehicle
Weapon
Etc.
SEKINE & NOBATA (2004)
150 types
Domain-dependent
24. Named Entity Recognition Internal and external features (or evidences)
Capitalization
not all languages
speech data
trigger words
“El senyor Balaguer vol comprar-se un cotxe nou.”
“La ciutat de Balaguer és tot un compendi de història de Catalunya”.
25. Named Entity Recognition Handcrafted systems
Knowledge (rule) based
Patterns
Gazetteers
Automatic systems
Statistical
Machine learning
Unsupervised
Analyze: char type, POS, lexical info, dictionaries
Hybrid systems
26. Named Entity Recognition Handcrafted systems
LTG
F-measure of 93.39 in MUC-7 (the best)
Ltquery, XML internal representation
Tokenizer, POS-tagger, SGML transducer
Nominator (1997)
IBM
Heavy heuristics
Cross-document co-reference resolution
Used later in IBM Intelligent Miner
27. Named Entity Recognition Handcrafted systems
LaSIE (Large Scale Information Extraction)
MUC-6 (LaSIE II in MUC-7)
Univ. of Sheffield’s GATE architecture (General Architecture for Text Engineering )
JAPE language
FACILE (1998)
NEA language (Named Entity Analysis)
Context-sensitive rules
NetOwl (MUC-7)
Commercial product
C++ engine, extraction rules
28. NER – automatic approaches Learning of statistical models or symbolic rules
Use of annotated text corpus
Manually annotated
Automatically annotated
“BIO” tagging
Tags: Begin, Inside, Outside an NE
Probabilities:
Simple:
P(tag i | token i)
With external evidence:
P(tag i | token i-1, token i, token i+1)
“OpenClose” tagging
Two classifiers: one for the beginning, one for the end
29. NER – automatic approaches Decision trees
Tree-oriented sequence of tests in every word
Determine probabilities of having a BIO tag
Use training corpus
Viterbi, ID3, C4.5 algorithms
Select most probable tag sequence
SEKINE et al (1998)
BALUJA et al (1999)
F-measure: 90%
30. NER – automatic approaches HMM
Markov models, Viterbi
Separate statistical model for each NE category + model for words outside NEs
Nymble (1997) / IdentiFinder (1999)
Maximum Entropy (ME)
Separate, independent probabilities for every evidence (external and internal features) are merged multiplicatively
MENE (NYU - 1998)
Capitalization, many lexical features, type of text
F-Measure: 89%
31. NER – other approaches Hybrid systems
Combination of techniques
IBM’s Intelligent Miner: Nominator + DB/2 data mining
WordNet hierarchies
MAGNINI et al. (2002)
Stacks of classifiers
Adaboost algorithm
Bootstrapping approaches
Small set of seeds
Memory-based ML, etc.
32. Named Entity Recognition Handcrafted systems x automatic systems
Ease of change
Portability (domains and languages)
Scalability
Language resources
Cost-effectiveness
33. NER in various languages Arabic
TAGARAB (1998)
Pattern-matching engine + morphological analysis
Lots of morphological info (no differences in ortographic case)
Bulgarian
OSENOVA & KOLKOVSKA (2002)
Handcrafted cascaded regular NE grammar
Pre-compiled lexicon and gazetteers
Catalan
CARRERAS et al. (2003b) and MÁRQUEZ et al. (2003)
Extract catalan NEs with spanish resources (F-measure 93%)
Bootstrap using catalan texts
34. NER in various languages Chinese & Japanese
Many works
Special characteristics
Character or word-based
No capitalization
CHINERS (2003)
Sports domain
Machine learning
Shallow parsing technique
ASAHARA & MATSMUTO (2003)
Character-based method
Support Vector Machine
87.2% F-measure in the IREX (outperformed most word-based systems)
35. NER in various languages Dutch
DE MEULDER et al. (2002)
Hybrid system
Gazetteers, grammars of names
Machine Learning Ripper algorithm
French
BÉCHET et al. (2000)
Decision trees
Le Monde news corpus
German
Non-proper nouns also capitalized
THIELEN (1995)
Incremental statistical approach
65% of corrected disambiguated proper names
36. NER in various languages Greek
KARKALETSIS et al. (1998)
English – Greek GIE (Greek Information Extraction) project
GATE platform
Italian
CUCCHIARELLI et al. (1998)
Merge rule-based and statistical approaches
Gazetteers
Context-dependent heuristics
ECRAN (Extraction of Content: Research at Near Market)
GATE architecture
Lack of linguistic resources: 20% of NEs undetected
Korean
CHUNG et al. (2003)
Rule-based model, Hidden Markov Model, boosting approach over unannotated data
37. NER in various languages Portuguese
SOLORIO & LÓPEZ (2004, 2005)
Adapted CARRERAS et al. (2002b) spanish NER
Brazilian newspapers
Serbo-croatian
NENADIC & SPASIC (2000)
Hand-written grammar rules
Highly inflective language
Lots of lexical and lemmatization pre-processing
Dual alphabet (Cyrillic and Latin)
Pre-processing stores the text in an independent format
38. NER in various languages Spanish
CARRERAS et al. (2002b)
Machine Learning, AdaBoost algorithm
BIO and OpenClose approaches
Swedish
SweNam system (DALIANIS & ASTROM, 2001)
Perl
Machine Learning techniques and matching rules
Turkish
TUR et al (2000)
Hidden Markov Model and Viterbi search
Lexical, morphological and context clues
39. Named Entity Recognition Multilingual approaches
Goals - CUCERZAN & YAROWSKI (1999)
To handle basic language-specific evidences
To learn from small NE lists (about 100 names)
To process large and small texts
To have a good class-scalability (to allow the definition of different classes of entities, according to the language or to the purpose)
To learn incrementally, storing learned information for future use
40. Named Entity Recognition Multilingual approaches
GALLIPI (1996)
Machine Learning
English, Spanish, Portuguese
ECRAN (Extraction of Content: Research at Near Market)
REFLEX project (2005)
the US National Business Center
41. Named Entity Recognition Multilingual approaches
POIBEAU (2003)
Arabic, Chinese, English, French, German, Japanese, Finnish, Malagasy, Persian, Polish, Russian, Spanish and Swedish
UNICODE
Language independent architecture
Rule-based, machine-learning
Sharing of resources (dictionary, grammar rules…) for some languages
BOAS II (2004)
University of Maryland Baltimore County
Web-based
Pattern-matching
No large corpora
42. NER – other topics Character x word-based
JING et al. (2003)
Hidden Markov Model classifier
Character-based model better than word-based model
NER translation
Cross-language Information Retrieval (CLIR), Machine Translation (MT) and Question Answering (QA)
NER in speech
No punctuation, no capitalization
KIM & WOODLAND (2000)
Up to 88.58% F-measure
NER in Web pages
wrappers
43. NER: an experiment in Catalan General architecture
Common API
Segmentation module
POS-tagger
Disambiguator
Grammar module
Module for accessing the system dictionaries
44. NER: an experiment in Catalan General architecture
Typographical error detection module
Spelling error detection module
Grammatical error detection module
NER module
45. NER: an experiment in Catalan NER Module
Dictionary
Multi tokens
WORD FORM#LEMMA:TAG:FREQUENCY|WORD FORM:FREQUENCY|WORD FORM:FREQUENCY …
can#can:N5-FP:444|barbet:42|barça:23|Barceló:4
Categories
PERSON
Names and surnames
LOCATION
Common indicators
ORGANIZATION
Common indicators
UNKNOWN
46. NER: an experiment in Catalan NER Module
Rules
Locations
Verb_viure a location”
Exiliat novament, Macià viu a Bélgica.
“Verb_néixer a location”
Joan neix a Barcelona
Persons
“Sr. person”
El Sr. Companys va sortir.
“El position de location, person”
El alcalde de Barcelona, Joan Clos.
47. NER: an experiment in Catalan NER Module
Rules
Organizations
“El position de organization”
El president de Cases Rives.
“Organization, verb_fundat el”
El club Orfeas Smyrna, fundat el 1890 per jònics que residien a la ciutat turca.
Combinations
For persons, organizations and locations
48. NER: an experiment in Catalan NER Module
Error detection and suggestion
Pre-defined spelling rules
Inserting try characters before every letter of the word
Swapping characters one by one
Inserting try characters in their places
The NER correction as input for the Grammar module
49. NER: an experiment in Catalan Results
20 catalan texts
Wikipedia, El Periòdic
10000 words
Various domains
Precision : 70%
Recall: 75%
F-Measure: 72%
Error correction and suggestions
50. Conclusions Needs better tuning
Rules
Dictionary
can#P0
can benet#Can Benet:N4BMS:9#can:N4BMS:9#benet:N4BMS:9:10000000#P1
can benet de#P0
can benet de la#P0
can benet de la prua#Can Benet de la Prua#can:N4BMS#benet:N4BMS#de:P#el:EA--FS#prua:N4BFS#P1
Test statistical based-engine?
Treatment of gender, number
Expand to full IE system