150 likes | 258 Views
Introduction to CL. Session 1: 7/08/2011. What is computational linguistics?. P rocessing natural language text by computers for practical applications ... or linguistic research Among practical applications Sometimes the computer only needs to classify or transform the text
E N D
Introduction to CL Session 1: 7/08/2011
What is computational linguistics? • Processing natural language text by computers • for practical applications • ... or linguistic research • Among practical applications • Sometimes the computer only needs to classify or transform the text • ... but sometimes it needs to “understand” • Ex: Watson: winner of ‘Jeopardy’ • CL vs. NLP (natural language processing)
NLP applications • Automatic speech recognition (ASR): speech text • Machine translation (MT): L1 L2 • Information retrieval (IR): Query + documents a subset of doc • Information extraction (IE): document “database”
NLP applications (cont) • Question answering (QA): Question + documents Answer • Summarization: documents summary • Natural language generation (NLG): representation text
Other Applications • Call Center • Spam filter • Spell checker • Sentiment analysis: product reviews • Bio-NLP: processing clinical data • ….
Basic NLP tasks: Shallow processing • Tokenization: • He visited New York in 2003. • Morphological analysis: • visited visit + -ed • Part-of-speech tagging • He/Pron visited/V New/?? York/N in/Prep 2003/CD • Name-entity tagging • He visited [LOCATION New York] in [YEAR 2003] • Chunking • [NP He] [V visited] [NP New York] in [NP 2003]
Basic NLP tasks: Deep processing • Parsing • (S (NP (PRON he)) (VP (V visited) ….) • Semantic analysis • Semantic tagging: [AGENT He] visited [DEST New York] …. • Meaning: visit (he, New-York) • Discourse • Co-reference: “He” refers to “John” • Discourse structure • Dialogue • Generation
Ambiguity • Phonological ambiguity: (ASR) • “too”, “two”, “to” • “ice cream” vs. “I scream” • “ta” in Mandarin: he, she, or it • Morphological ambiguity: (morphological analysis) • unlockable: [[un-lock]-able] vs. [un-[lock-able]] • Syntactic ambiguity: (parsing) • John saw a man with a telescope. • Time flies like an arrow.
Ambiguity (cont) • Lexical ambiguity: (WSD) • Ex: “bank”, “saw”, “run” • Semantic ambiguity: (semantic representation) • Ex: every boy loves his mother • Ex: John and Mary bought a house • Discourse ambiguity: • Susan called Mary. She was sick. (coreference resolution) • It is pretty hot here. (intention resolution) • Machine translation: • “brother”, “cousin”, “uncle”, etc.
Ambiguity resolution • Rule-based or knowledge-based: • Parsing: • I saw a man with a hat • I saw a man with a telescope (in my hand) • WSD: • “bank” • MT: • “brother”, “cousin”, “uncle” • Statistical approach: • Require training data • Build a statistical model • Knowledge and rules can be incorporated into the model as features etc.
Major approaches to NLP • Rule-based approach • Statistical approach • Supervised learning • Semi-supervised learning • Unsupervised learning
Supervised learning algorithms • Hidden Markov Model (HMM) • Decision tree • Decision list • Naïve Bayes • Transformation-based Learning (TBL) • Maximum Entropy (MaxEnt) • Support Vector Machine (SVM) • Conditional Random Field (CRF) • …
Data • Raw text: • Monolingual: English/Chinese/Arabic Gigawords • Parallel data: UN data, EuroParl • Treebank: • Syntactic treebanks: a set of parse trees • Proposition Bank: • Discourse Treebank • Dictionaries • WordNet • FrameNet • …
Task1 ML1 ML2 D1 D2 D_n Applications Task2 Task_i … … ML_m …
The role of linguistics knowledge in NLP • An NLP system is language-independent. • Good or bad? • Good: it can be ported to many languages without any changes. • Bad: it cannot take advantage of properties of certain languages. • How to incorporate (linguistic) knowledge in statistical systems? • the design of models • as features • as filters • … Building a treebank is an effective way.