170 likes | 303 Views
Shanxi HPC Research Center. NLP and Big Data. Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China. Introduction. Internet is a big knowledge base unstructured NLP & IE “understand” human language. Unstructured data. Structure data. Problems. Human language changed
E N D
Shanxi HPC Research Center NLP and Big Data Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China
Introduction • Internet is a big knowledge base • unstructured • NLP & IE • “understand” human language Unstructured data Structure data
Problems • Human language changed • Let Google it ! • Net language ( LOL,给力) • compounds words (JFK airport) • Domain knowledge • Domain specific training sets • Chinese tokenization • 小菊/ nr /的/u/生活/vn/很/d/给/v力/vg • 小菊/ nr /的/u/生活/vn/很/d/给力/a
NLP need big data • Unsupervised (weekly supervised)learning • knowledge acquisition • Relationship • New words • NE gazette
knowledge acquisition Large scale Corpus from Web Weekly supervised learning Bootstrapping technique Map reduce,Hbase Location NE and new word P = 87.28%, 72.1%
Chinese NLP & IE engine Pipeline FST & statistic mixture model Input:plain text Out : structured XML Map reduce Speed: 500KB/s in 10 nodes
Information object Profile and Event
Example Profile In Concept-Based Profile, its attributes are filled by its participant profiles.
Cross Document Information fusion Hierarchical Clustering Map Reduce Hbase Half Million Profiles Computing complexity P=94.65% R=88.24% F= 91.33%
Information Graph multi-dimension Orange: location Gray: organization Blue: Person Source: 2012 People’s daily Query: China Agricultural University Expand 1 level
Organization-Organization Network Query: China Agricultural University filter: Organization
Location-Personal Network Query : 青岛港, filter:Location
Person-location Network Query: 金日成
Future Work • Query Language • Graph Mining • Enhance NLP Engine • visualization
Questions? Thank you