410 likes | 436 Views
An introduction to Finite State Methods in NLP, covering topics such as finite state acceptors, regular expressions, determinization, minimization, and their application in modeling grammaticality.
E N D
Finite-State Methods 600.465 - Intro to NLP - J. Eisner
c a e Finite state acceptors (FSAs) • Things you may know about FSAs: • Equivalence to regexps • Union, Kleene *, concat, intersect, complement, reversal • Determinization, minimization • Pumping, Myhill-Nerode Defines the language a? c* = {a, ac, acc, accc, …,, c, cc, ccc, …} 600.465 - Intro to NLP - J. Eisner
n-gram models not good enough • Want to model grammaticality • A “training” sentence known to be grammatical: BOS mouse traps catch mouse traps EOS • Resulting trigram model has to overgeneralize: • allows sentences with 0 verbsBOS mouse traps EOS • allows sentences with 2 or more verbsBOS mouse traps catch mouse traps catch mouse traps catch mouse traps EOS • Can’t remember whether it’s in subject or object(i.e., whether it’s gotten to the verb yet) trigram model must allow these trigrams 600.465 - Intro to NLP - J. Eisner
Noun Noun Noun Verb Noun preverbal states(still need a verb to reach final state) postverbal states(verbs no longerallowed) Finite-state models can “get it” • Want to model grammaticalityBOS mouse traps catch mouse traps EOS • Finite-state can capture the generalization here: Noun+ Verb Noun+ Allows arbitrarily long NPs (just keep looping around for another Noun modifier). Still, never forgets whether it’s preverbal or postverbal! (Unlike 50-gram model) 600.465 - Intro to NLP - J. Eisner
How powerful are regexps / FSAs? • More powerful than n-gram models • The hidden state may “remember” arbitrary past context • With k states, can remember which of k “types” of context it’s in • Equivalent to HMMs • In both cases, you observe a sequence and it is “explained” by a hidden path of states. The FSA states are like HMM tags. • Appropriate for phonology and morphology Word = Syllable+ = (Onset Nucleus Coda?)+ = (C+ V+ C*)+ = ( (b|d|f|…)+ (a|e|i|o|u)+ (b|d|f|…)* )+ 600.465 - Intro to NLP - J. Eisner
finite-state can handle this pattern (can you write the regexp?) but not this pattern,which requires a CFG How powerful are regexps / FSAs? • But less powerful than CFGs / pushdown automata • Can’t do recursive center-embedding • Hmm, humans have trouble processing those constructions too … • This is the rat that ate the malt. • This is the malt that the rat ate. • This is the cat that bit the rat that ate the malt. • This is the malt that the rat that the cat bit ate. • This is the dog that chased the cat that bit the rat that ate the malt. • This is the malt that [the rat that [the cat that [the dog chased] bit] ate]. 600.465 - Intro to NLP - J. Eisner
Noun Noun S = S = NP Verb NP converting to FSA copies the NP twice Noun Verb Noun duplicatedstructure duplicatedstructure Noun NP = Noun How powerful are regexps / FSAs? • But less powerful than CFGs / pushdown automata • More important: Less explanatory than CFGs • An CFG without recursive center-embedding can be converted into an equivalent FSA – but the FSA will usually be far larger • Because FSAs can’t reuse the same phrase type in different places more elegant – usingnonterminals like thisis equivalent to a CFG 600.465 - Intro to NLP - J. Eisner
Strings vs. String Pairs • FSA = “finite-state acceptor” • Describes a language(which strings are grammatical?) • FST = “finite-state transducer” • Describes a relation (which pairs of strings are related?) • underlying form surface form • sentence translation • original edited • … 600.465 - Intro to NLP - J. Eisner
position in upper string 0 1 2 3 4 5 l:e a:e r:e a:e c:e 0 1 2 3 4 a:c l:c c:c a:c r:c e:c e:c e:c e:c e:c e:c l:e a:e r:e a:e c:e l:a c:a a:a r:a a:a e:a e:a e:a e:a e:a e:a l:e a:e r:e a:e c:e position in lower string l:c c:c a:c r:c a:c e:c e:c e:c e:c e:c e:c l:e a:e r:e a:e c:e l:a c:a a:a a:a r:a e:a e:a e:a e:a e:a e:a l:e a:e r:e a:e c:e Example: Edit Distance Cost of best path relatingthese two strings? 600.465 - Intro to NLP - J. Eisner
Example: Morphology VP [head=vouloir,...] V[head=vouloir,tense=Present,num=SG, person=P3] ... veut 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) vouloir +Pres +Sing + P3 Finite-state transducer veut canonical form inflection codes v o u l o i r +Pres +Sing +P3 v e u t inflected form Example: Unweighted transducer VP [head=vouloir,...] V[head=vouloir,tense=Present,num=SG, person=P3] ... veut the relevant path 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen vouloir +Pres +Sing + P3 Finite-state transducer veut canonical form inflection codes v o u l o i r +Pres +Sing +P3 v e u t inflected form Example: Unweighted transducer • Bidirectional: generation or analysis • Compact and fast • Xerox sells for about 20 languges including English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, ... • Research systems for many other languages, including Arabic, Malay the relevant path 600.465 - Intro to NLP - J. Eisner
Regular Relation (of strings) • Relation: like a function, but multiple outputs ok • Regular: finite-state • Transducer: automaton w/ outputs • b ? a ? • aaaaa ? a:e b:b {b} {} a:a a:c {ac, aca, acab, acabc} ?:c b:b ?:b • Invertible? • Closed under composition? ?:a b:e 600.465 - Intro to NLP - J. Eisner
Regular Relation (of strings) • Can weight the arcs: vs. • b {b} a {} • aaaaa {ac, aca, acab, acabc} • How to find best outputs? • For aaaaa? • For all inputs at once? a:e b:b a:a a:c ?:c b:b ?:b ?:a b:e 600.465 - Intro to NLP - J. Eisner
Acceptors (FSAs) Transducers (FSTs) c c:z a a:x Unweighted e e:y c:z/.7 c/.7 a:x/.5 a/.5 Weighted .3 .3 e:y/.5 e/.5 Function from strings to ... {false, true} strings numbers (string, num) pairs 600.465 - Intro to NLP - J. Eisner
Sample functions Acceptors (FSAs) Transducers (FSTs) {false, true} strings Grammatical? Markup Correction Translation Unweighted numbers (string, num) pairs How grammatical? Better, how likely? Good markups Good corrections Good translations Weighted 600.465 - Intro to NLP - J. Eisner
Terminology (acceptors) Regular language defines recognizes compiles into Regexp FSA implements accepts matches (or generates) matches String 600.465 - Intro to NLP - J. Eisner
Terminology (transducers) Regular relation defines recognizes compiles into Regexp FST implements accepts matches (or, transducesone string of the pair intothe other) (or generates) matches ? String pair 600.465 - Intro to NLP - J. Eisner
3 views of a context-free rule (randsent) • generation (production): S NP VP • parsing (comprehension): S NP VP • verification (checking): S = NP VP (parse) v o u l o i r +Pres +Sing +P3 v e u t Perspectives on a Transducer • Remember these CFG perspectives: • Similarly, 3 views of a transducer: • Given 0 strings, generate a new string pair (by picking a path) • Given one string (upper or lower), transduce it to the other kind • Given two strings (upper & lower), decide whether to accept the pair FST just defines the regular relation (mathematical object: set of pairs). What’s “input” and “output” depends on what one asks about the relation.The 0, 1, or 2 given string(s) constrain which paths you can use. 600.465 - Intro to NLP - J. Eisner
abcd abcd f g Functions ab?d 600.465 - Intro to NLP - J. Eisner
Functions abcd ab?d Function composition: f g [first f, then g – intuitive notation, but opposite of the traditional math notation] 600.465 - Intro to NLP - J. Eisner
g 3 4 abcd 2 2 abed abed 8 6 abd abjd ... From Functions to Relations f ab?d abcd 600.465 - Intro to NLP - J. Eisner
3 4 2 2 8 6 From Functions to Relations abcd ab?d abed Relation composition: f g abd ... 600.465 - Intro to NLP - J. Eisner
3+4 2+2 6+8 From Functions to Relations abcd ab?d abed Relation composition: f g abd ... 600.465 - Intro to NLP - J. Eisner
2+2 From Functions to Relations ab?d abed Pick min-cost or max-prob output Often in NLP, all of the functions or relations involved can be described as finite-state machines, and manipulated using standard algorithms. 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) c l e a r e v e f a t h Lexical Transducer (a single FST) Lexicon FSA composition Regular Expressions for Rules ComposedRule FSTs Compiler b i g +Adj +Comp g e b i g r one path Building a lexical transducer big | clear | clever | ear | fat | ... Regular Expression Lexicon 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) c l e a r e v e f a t h Lexicon FSA Building a lexical transducer • Actually, the lexicon must contain elements likebig +Adj +Comp • So write it as a more complicated expression:(big | clear | clever | fat | ...) +Adj ( | +Comp | +Sup) adjectives | (ear | father | ...) +Noun (+Sing | +Pl) nouns | ... ... • Q: Why do we need a lexicon at all? big | clear | clever | ear | fat | ... Regular Expression Lexicon 600.465 - Intro to NLP - J. Eisner
g 3 4 abcd 2 2 abed abed 8 6 abd abjd ... Inverting Relations f ab?d abcd 600.465 - Intro to NLP - J. Eisner
Inverting Relations g-1 f -1 3 4 abcd ab?d abcd 2 2 abed abed 8 6 abd abjd ... 600.465 - Intro to NLP - J. Eisner
3+4 2+2 6+8 Inverting Relations abcd ab?d abed (f g)-1 = g-1 f -1 abd ... 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) être+IndP +SG + P1 suivre+IndP+SG+P1 suivre+IndP+SG+P2 suivre+Imp+SG + P2 Weighted version of transducer: Assigns a weight to each string pair “upper language” 4 payer+IndP+SG+P1 19 20 12 Weighted French Transducer 50 3 suis paie “lower language” paye 600.465 - Intro to NLP - J. Eisner
Composition Cascades • You can build fancy noisy-channel models by composing transducers … • Examples: • Phonological/morphological rewrite rules? • English orthography English phonology Japanese phonology Japanese orthography • e.g. ??? goruhubaggu • Information extraction 600.465 - Intro to NLP - J. Eisner
FASTUS – Information Extraction Appelt et al, 1992-? Input: Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with … Output: Relationship: TIE-UP Entities: “Bridgestone Sports Co.” “A local concern” “A Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Amount: NT$20000000 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 33
FASTUS: Successive Markups(details on subsequent slides) Tokenization .o. Multiwords .o. Basic phrases (noun groups, verb groups …) .o. Complex phrases .o. Semantic Patterns .o. Merging different references 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 34
FASTUS: Tokenization Spaces, hyphens, etc. wouldn’t would not their them ’s company. company . butCo. Co. 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 35
FASTUS: Multiwords “set up” “joint venture” “San Francisco Symphony Orchestra,” “Canadian Opera Company” … use a specialized regexp to match musical groups. ... what kind of regexp would match company names? 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 36
FASTUS : Basic phrases Output looks like this (no nested brackets!): … [NG it] [VG had set_up] [NP a joint_venture] [Prep in] … Company Name: Bridgestone Sports Co. Verb Group: said Noun Group: Friday Noun Group: it Verb Group: had set up Noun Group: a joint venture Preposition: in Location: Taiwan Preposition: with Noun Group: a local concern 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 37
FASTUS: Noun Groups Build FSA to recognize phrases like approximately 5 kg more than 30 people the newly elected president the largest leftist political force a government and commercial project Use the FSA for left-to-right longest-match markup What does FSA look like? See next slide … 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 38
FASTUS: Noun Groups Described with a kind of non-recursive CFG … (a regexp can include names that stand for other regexps) NG Pronoun | Time-NP | Date-NP NG (Det) (Adjs) HeadNouns … Adjs sequence of adjectives maybe with commas, conjunctions, adverbs … Det DetNP | DetNonNP DetNP detailed expression to match “the only five, another three, this, many, hers, all, the most …” … 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 39
FASTUS: Semantic patterns BusinessRelationship =NounGroup(Company/ies) VerbGroup(Set-up) NounGroup(JointVenture) with NounGroup(Company/ies) | … ProductionActivity = VerbGroup(Produce) NounGroup(Product) NounGroup(Company/ies) NounGroup & … is made easy by the processing done at a previous level Use this for spotting references to put in the database. 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 40
Composition Cascades • You can build fancy noisy-channel models by composing transducers … • … now let’s turn to how you might build the individual transducers in the cascade. • We’ll use a variety of operators that combine simpler transducers and acceptors into more complex ones. • Composition is just one example. 600.465 - Intro to NLP - J. Eisner