Chinese and English Named Entity Recognition: Data Analysis and SVM Modeling

FA XIANFinding, Ascertaining, eXtracting, And Identifying Names Yuan Ding and John Blitzer

Preliminaries • 250K of hand-annotated Chinese newswire • 2586 Person Names • 1422 Unique • BBN Identifinder • Trained on 650K of English Text • NYU MENE • Trained on 270K of English Text

Modeling Overview • SVM (Support Vector Machine) • Common Feature Set • Common Training/Evaluation Data

Features • Character based feature set • Chinese character (GB2312), English letter • Ci-2 Ci-1 Ci Ci+1 Ci+2 • Word boundary • Begin_of_Word (BOW), End_of_Word (EOW) • Previous tag (B,I,O) • [O]Outside PN, [I]Inside PN, [B]Beginning of PN

Data • Characters: 392795(training) + 34740(testing) • Unique Characters: 3252 • 行销全球的 <ENAMEX TYPE="ORGANIZATION:corporation">宜进实业</ENAMEX> <ENAMEX TYPE="PER_DESC">董事长</ENAMEX> <ENAMEX TYPE="PERSON">詹正田</ENAMEX>忿忿指陈：原先 <NUMEX TYPE="QUANTITY:weight">一公斤</NUMEX> <NUMEX TYPE="MONEY">一块八美金</NUMEX> 的加工丝，目前滑落至 <NUMEX TYPE="MONEY">一块三</NUMEX>

Example ju shuo yuehan zai bin da shangxue • 据说约翰在宾大上学。 C0 C1 C2 C3 C4 …… • O O B I O O O O O O (Tag) • B E B E BE B E B E BE (Word boundary) • O O O B I O O O O O (Previous tag) • 据说约翰在宾大上学 (Previous char)

Model 1: SVM (Support Vector Machine) • To search the Optimal Separating Hyperplane to maximize the margin [V.Vapnik 1995]

SVM - Properties • Two strong properties • – High generalization performance independent of feature dimension • – Training with combinations of multiple features by using a Kernel Function. • Maximal Margin Strategy • Separate positive and negative (binary) examples with a Linear Hyperplane: (w *x + b=0; w; x in Rn; b in R) • Find an optimal hyperplane (parameter w; b) with the maximal margin

SVM – Kernel Function • Kernel Function • K(x,y) = (x) • (y) • x,y are vectors in input space • (x), (y) are vectors in feature space • d (feature space) >> d (input space) • No need to compute (x) explicitly • d-th polynomial kernel • K(xi, xj) = (xi * xj + 1)^d • considering combinations of up to d features

SVM – Polynomial Kernel • So, the larger the d, the better? – Not necessarily! • Larger d • Virtually considering all d-grams • Higher precision, lower recall • Potentially equivalent to over fitting • Smaller d • Model trained is more general

Evaluation Setup • Features • [ES1]: { Ci-1 Ci Ci+1 } + { BOW EOW } • [ES2]: [ES1] + { Prev_Tag } • [ES3]: Ci-2 Ci-1 Ci Ci+1 Ci+2 (Ci+2 only for SVM) • Estimation of input space: 5 * 3200 = 16000 binary features

Target of Decision • SVM Model • Binary classifier • I and O only • Any word contains an “I” is viewed as a Person’s name.

Experiment Setup ju-shuo |yue-han| zai| bing-da| shang-xue据说约翰在宾大上学 Gold O O B I O O O O O Model1 O B I I I O O O OModel2 O O O O O O O O OUse perfect word boundary in evaluation Model1: positive<+3> gold<+1> true positive<+1>Model2: positive<+0> gold<+1> true positive<+0>Use automatic segmenter Model1: positive<+1> gold<+1> true positive<+0>Model2: positive<+0> gold<+1> true positive<+0>

Results Current English Person Name Finder: F-score around 75%

Future Work • Dynamic decoding • Integrate with word segmenter • Other named entities

Thank you! Special Thanks To BBN M. Palmer E. Loper S. Kulick T. Morton T. Joachims (Author of SVMLight)

Chinese and English Named Entity Recognition: Data Analysis and SVM Modeling

Chinese and English Named Entity Recognition: Data Analysis and SVM Modeling

Presentation Transcript

DING DING DING - BELL RINGER!!!

Names and Binding

Identifying sets and classes: taxonomies as finding aids

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

Identifying Folate Vitamins And Others For Extracting The Be

Finding and Identifying Plants

Ding ding ding…the final countdown

Names and Nails John 19:17-22

Ad ding and subtracting Polynomials

Extracting Hormones and DNA

Names and Binding

Names and Titles

Extracting and weaving silk

Extracting Semantic Knowledge from Wikipedia Category Names

Extracting biological names and relations from texts

Names and Addresses

Identifying and Finding a Solution to Your Business Weaknesses

Names and Binding

Roles and Names