410 likes | 592 Views
LING/C SC/PSYC 438/538. Lecture 1 Sandiway Fong. Syllabus. Details: 538: there are no formal pre-requisites for this class 438: LING 388 or a course in one of the following: formal languages, syntax, data structures, or compilers. Instructor: Sandiway Fong,
E N D
LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong
Syllabus • Details: • 538: there are no formal pre-requisites for this class • 438: LING 388 or a course in one of the following: formal languages, syntax, data structures, or compilers. • Instructor: Sandiway Fong, Depts. of Linguistics and Computer Science • Office: Douglass 311 (ph. 626-6567) • Hours: • by appt. or walk-in • after class (best if you have Qs about the class) • Email: sandiway@email.arizona.edu • Meet here in computer lab • 11:00am–12:15pm • Social Sciences 224 • No class • Thursday November 24th (Thanksgiving) • Course objectives: • introduction to computational linguistics • survey a range of topics • introduction to programming • hands-on experience: computer lab exercises • Expected learning outcomes: • acquire ability to write short programs • familiarity with basic concepts, techniques and applications • be equipped to take more advanced classes in computational linguistics • Grading • 438 • homeworks 65% • (note: all homeworks are required but not necessarily graded • homework may also be in the form of readings: (graded) pop quiz next lecture) • midterm exam 35% • 538 • homeworks 50% • (homeworks, will be a superset of exercises for 438) • midterm exam 25% • term report and presentation 25%
Syllabus • Absences • tell me ahead of time so we can make special arrangements • ok: religious, dean-sanctioned, medical, etc... • Required text • Speech and Language Processing, Jurafsky & Martin, 2nd edition, Prentice Hall 2008 • Special equipment • none • all software required for the course is freely available off the net • Classroom etiquette • you may ask questions • use your own laptop or lab computer • you may collaborate on classroom exercises • switch off cellphones • no communication during an exam • Homeworks • you may discuss the question with other students • however, you must write it up yourself (in your own words) • cite (web) references and your classmates (in the case of discussion) • Student Code of Academic Integrity: plagiarism etc. • http://dos.web.arizona.edu/uapolicies • Other stuff • See http://web.arizona.edu/%7Epolicy/Undergraduate%20Course%20Syllabus%20Policy.pdf • Revisions to the syllabus • “the information contained in the course syllabus, other than the grade and absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.”
Textbook (J&M) 2008 (2nd edition) Nearly 1000 pages (full year’s worth…) 25 chapters Divided into 5 parts Words Speech – not this course Syntax Semantics and Pragmatics Applications
Syllabus • Coverage • Intro to programming • we’re going to use Perl • (Python is another popular language) • Topics: selected chapters from J&M • Chapters 1–6, skip Speech part (7–11), 12–25
Book chapters • 1. Introduction • 1.1. Knowledge in Speech and Language Processing • 1.2. Ambiguity • 1.3. Models and Algorithms • 1.4. Language, Thought, and Understanding • 1.5. The State of the Art • 1.6. Some Brief History • 1.6.1. Foundational Insights: 1940s and 1950s • 1.6.2. The Two Camps: 1957–1970 • 1.6.3. Four Paradigms: 1970–1983 • 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993 • 1.6.5. The Field Comes Together: 1994–1999 • 1.6.6. The Rise of Machine Learning: 2000–2008 • 1.6.7. On Multiple Discoveries • 1.6.8. A Final Brief Note on Psychology • 1.7. Summary • Bibliographical and Historical Notes • I. Words • 2. Regular Expressions and Automata • 2.1. Regular Expressions • 2.1.1. Basic Regular Expression Patterns • 2.1.2. Disjunction, Grouping, and Precedence • 2.1.3. A Simple Example • 2.1.4. A More Complex Example • 2.1.5. Advanced Operators • 2.1.6. Regular Expression Substitution, Memory, and ELIZA • 2.2. Finite-State Automata • 2.2.1. Use of an FSA to Recognize Sheeptalk • 2.2.2. Formal Languages • 2.2.3. Another Example • 2.2.4. Non-Deterministic FSAs • 2.2.5. Use of an NFSA to Accept Strings • 2.2.6. Recognition as Search • 2.2.7. Relation of Deterministic and Non-Deterministic Automata • 2.3. Regular Languages and FSAs • 2.4. Summary • Bibliographical and Historical Notes • Exercises • 3. Words and Transducers • 3.1. Survey of (Mostly) English Morphology • 3.1.1. Inflectional Morphology • 3.1.2. Derivational Morphology • 3.1.3. Cliticization • 3.1.4. Non-Concatenative Morphology • 3.1.5. Agreement • 3.2. Finite-State Morphological Parsing • 3.3. Construction of a Finite-State Lexicon • 3.4. Finite-State Transducers • 3.4.1. Sequential Transducers and Determinism • 3.5. FSTs for Morphological Parsing • 3.6. Transducers and Orthographic Rules • 3.7. The Combination of an FST Lexicon and Rules • 3.8. Lexicon-Free FSTs: The Porter Stemmer • 3.9. Word and Sentence Tokenization • 3.9.1. Segmentation in Chinese • 3.10. Detection and Correction of Spelling Errors • 3.11. Minimum Edit Distance • 3.12. Human Morphological Processing • 3.13. Summary • Bibliographical and Historical Notes • Exercises • 4. N-Grams • 4.1. Word Counting in Corpora • 4.2. Simple (Unsmoothed) N-Grams • 4.3. Training and Test Sets • 4.3.1. N-Gram Sensitivity to the Training Corpus • 4.3.2. Unknown Words: Open Versus Closed Vocabulary Tasks • 4.4. Evaluating N-Grams: Perplexity • 4.5. Smoothing • 4.5.1. Laplace Smoothing • 4.5.2. Good-Turing Discounting • 4.5.3. Some Advanced Issues in Good-Turing Estimation • 4.6. Interpolation • 4.7. Backoff • 4.7.1. Advanced: Details of Computing Katz Backoffα and P* • 4.8. Practical Issues: Toolkits and Data Formats • 4.9. Advanced Issues in Language Modeling • 4.9.1. Advanced Smoothing Methods: Kneser-Ney Smoothing • 4.9.2. Class-Based N-Grams • 4.9.3. Language Model Adaptation and Web Use • 4.9.4. Using Longer-Distance Information: A Brief Summary • 4.10. Advanced: Information Theory Background • 4.10.1. Cross-Entropy for Comparing Models • 4.11. Advanced: The Entropy of English and Entropy Rate Constancy • 4.12. Summary • Bibliographical and Historical Notes • Exercises
Lecture slides • Download from my homepage • http://dingo.sbs.arizona.edu/~sandiway/#courses • available from class time (and afterwards) • In .pptx and .pdf formats
Homework: Reading • Chapter 1 from JM • introduction and history • available online • http://www.cs.colorado.edu/%7Emartin/SLP/Updates/1.pdf • Whole book is available as an e-book • www.coursesmart.com
Homework: Perl • Install Perl on your laptop • If you don’t already have it, it’s freely available here • http://www.activestate.com/ (free and for-fee versions)
Learning Perl • Learn Perl • Books… • Online resources • http://learn.perl.org/ • Next time, we begin with ... • http://perldoc.perl.org/perlintro.html
Language and Computers • Enormous amounts of data stored • world-wide web (WWW) • corporate databases • your own hard drive • Major categories of data • numeric • Language: words, text, sound • pictures, video
Language and Computers • We know what we want from computer software • “killer applications” • those that can make sense of language data • retrieve language data: (IR) • summarize knowledge contained in language data • sentiment analysis from online product reviews • answer questions (QA), make logical inferences • translate from one language into another • recognize speech: transcribe • etc...
Language and Computers • In other words, we’d like computers to be smart about language • possess “intelligence” • pass the Turing Test …
Language and Computers • In other words, we’d like computers to be smart about language • possess intelligence • well, perhaps not too smart… From 2001… (HAL)
Language and Computers • (Un)fortunately, we’re not there yet… • gap between what computers can do and • what we want them to be able to do Often quoted (but not verified): "The spirit is strong, but the flesh is weak" was translated into Russian and then back to English, the result was "The vodka is good, but the meat is rotten." but with Google translate or babelfish, it’s not difficult to find (funny) examples…
Language and Computers • and how can we tell if the translation is right anyway? • http://fun.drno.de/pics/english/only-in-china/TranslateServerError.jpg
Language and Computers • In language analysis, errors compound http://tashian.com/multibabel/
Language and Computers • Google Translate • two real experiences….
Language and Computers • 4,000,000 yen or 40,000 yen?
Language and Computers • Non-compositionality Puzzle 40,000 yen + less than
Language and Computers • Non-compositionality Puzzle
Language and Computers • What happened? • 4万円 + 以下 • 40,000 yen + less than • 万円 = million (error)
Language and Computers • August 2011 • bug has been fixed • sort of…
Language and Computers • I paid a visit to the Peabody Essex Museum (Salem MA) • Yin Yu Tang: A Chinese Home … so what do those 3 characters (Yin Yu Yang) – the name of the house actually mean?
Language and Computers … excuse my poor attempt at handwriting input
Language and Computers First character 蔭 translates as Yam. Capitalization is important: it’s not an edible tuber. It’s returning a name. But this is not a translation! Even if it’s just romanization, shouldn’t it be Yin? BTW, Yam here suggests Cantonese pronunciation Why?
Language and Computers • almost useful..? Gee, seems like one needs to already know Chinese to get Google translate to yield a translation of the first character 蔭 as shade. (I cheated by entering the collocation.) All 3 characters gets us back to square zero 1st two characters returns title/name
Language and Computers • almost useful…? • Is it possible to use Google translate to figure out the name of the house means • “Mansion of plentiful shade” can you come up with a hypothesis for why this happens? i.e. what technology do you think Google translate uses?
Applications • technology is still in development • even if we are willing to pay... • machine translation has been worked on since after World War II (1950s) • still not perfected today • why? • what are the properties of human languages that make it hard?
Natural Language Properties • which properties are going to be difficult for computers to deal with? • grammar (Rules for putting words together into sentences) • How many rules are there? • 100, 1000, 10000, more … • Portions learnt or innate • Do we have all the rules written down somewhere? • lexicon(Dictionary) • How many words do we need to know? • 1000, 10000, 100000 … • meaning and inference (semantic interpretation, commonsense world knowledge)
Computers vs. Humans • Knowledge of language • Computers are way faster than humans • They kill us at arithmetic and chess • But human beings are so good at language, we often take our ability for granted • Processed without conscious thought • Do pretty complex things and now Jeopardy as well …
Examples • Knowledge • Which report did you file without reading? • (Parasitic gap sentence) We take for granted this process of “fillingin” or recovering the missing information
Examples • Changes in interpretation • John is too stubborn to talk to • John is too stubborn to talk to Bill
Examples • Ambiguity • where can I see the bus stop? • stop: verb or part of the noun-noun compound bus stop • Context (Discourse or situation)
Examples • Ungrammaticality • *Which book did you file the report without reading? • * = ungrammatical Colorless green ideas sleep furiously. Furiously sleep ideas green colorless. • ungrammatical vs. incomprehensible
Examples • The human parser has quirks • Ian told the man that he hired a story • Ian told the man that he hired a secretary • Garden-pathing: a temporary ambiguity • tell: someone something vs. … • More subtle differences • The reporter who the senator attacked admitted the error • The reporter who attacked the senator admitted the error