Accurate Parsing

Accurate Parsing ('they worry that air the shows , drink too much , whistle johnny b. goode and watch the other ropes , whistle johnny b. goode and watch closely and suffer through the sale', 2.1730387621600077e-11) David Caley Thomas Folz-Donahue Rob Hall Matt Marzilli

Accurate Parsing: Our Goal Given a grammar • For a sentence S, return the parse tree with the max probability conditioned upon S. arg-max t inT P (t| S) where T is the set of possible parse trees of sentence S

Talking Points • Using the Penn-Treebank • Reading in n-ary trees • Finding Head-tags within n-ary productions • Converting to Binary Trees • Inducing a CFG grammar • Probabilistic CYK • Handling Unary rules • Dealing with unknowns • Dealing with run times • Beam search, limiting depth of unary rules, further optimizations • Example Parses and Trees • Lexicalization Attempts

Using the Penn-Treebank: Our Training Data • Contains tagged data and n-ary trees used from a Wall Street Journal corpus. • Contains some information unneeded by the parser. • Questionable Tagging • (JJ the) ?? • Example…

Using the Penn-Treebank: Handling N-ary trees ( (S (NP-SBJ-1 (NNS Consumers) ) (VP (MD may) (VP (VB want) (S (NP-SBJ(-NONE- *-1) ) (VP (TO to) (VP (VB move) (NP (PRP$ their) (NNS telephones) ) (ADVP-DIR (NP (DT a) (RB little) ) (RBR closer) (PP (TO to) (NP (DT the) (NN TV) (NN set) )))))))) Functional tags such as NP-SBJ-1 are ignored We simply call this an NP Also –NONE- tags are used for traces, these are ignored also.

Using the Penn-Treebank: Head-Tag Finding Algorithm For a context-free rule X -> Y1 … Yn, for each rule we can use a function to determine the “head” of the rule. In the example above this could be any Y1 … Yn The head is the most important child tag. • Head-Tags Algorithm as Outlined in Collins Thesis • Allow us to determine the head-tags that will be used for later binary tree conversion

Using the Penn-Treebank: Head-Tag Finding Algorithm If nothing is found in a list traversal the head-tag becomes the left or right most element.

Using the Penn-Treebank: Head-Rule Finding Algorithm • Rules for NPs are a bit different • If the last word is tagged POS, return (last-word) • Else • Search from right to left for the first child which is in the set {NN, NNP, NNPS, NNS, NX, POS, JJR} • Else • Search from left to right for first child which is an NP • Else • Search from right to left for the first child which is in the set {$, ADJP, PRN} • Else • Do the same with the set {CD} • Else • Do the same with the set {JJ, JJS, RB, QP} • Else • Return the last word

Using the Penn-Treebank: Binary Tree Conversion • Now we put the Head-Tags to use • Necessary for CFG grammar use with probabilistic CYK R - > LiLi-1…L1LoHRoR1 … Ri-1Ri A General n-ary rule LiLi-1…L1LoHRoR1 … Ri-1 Ri On right side of H-tag we recursively split last element to make a new binary rule, left recursive. On the left side we do the same by removing the first element, right recursive. Li Li-1…L1LoH

Using the Penn-Treebank: Grammar Induction Procedure • After we have binary trees we can easily begin to identify rules and record their frequency • Identify every production and save them into a python dictionary • Frequencies cached in a local file for later use, read-in on subsequent executions • No immediate smoothing is done on probabilities, Grammar is later trimmed to help with performance

Probabilistic CYK: The Parsing Step • We use a Probabilistic CYK implementation to parse our CFG grammar and also assign probabilities to final parse trees. • Useful to provide multiple parses and disambiguate sentences • New Concerns • Unary Rules and their lengths • Runtime (result of incredibly large grammar)

Probabilistic CYK: Handling Unary Rules within Grammar • Unary Rules of the form X->Y or X->a are ubiquitous in our grammar • The closure of a constituent is needed to determine all the unary productions that can lead to that constituent. • Def Closure(X) = U{Closure(Y) | Y->X}, i.e all non terminals that are reachable, by unary rules, from X. • We implement this iteratively and also maintain a closed list and limit depth, to prevent possible infinite recursion

Probabilistic CYK: Dealing with Run times • Beam Search • Limit the number of nodes saved in each cell of CYK dynamic programming table. • Using beam width k, All generations are kept sorted and the k best are saved for the next iteration • Experiences with 100, 200, 1000? list size <= k

Probabilistic CYK: Dealing with Run Times • Another optimization was to remove all productions rules with frequency < fc • Used fc = 1, 2… • Also limited depth when calculating the unary rules (closure) of a constituent present in our CYK table • Extensive unary rules found to greatly slow down our parser • Also long chains of unary productions have extremely low probabilities, they are commonly pruned by beam search anyway

Probabilistic CYK: Random Sentences and Example Trees • Some random sentences from our grammar with associated probabilities. ('buy jam , cocoa and other war-rationed goodies',0.0046296296296296294) ('cartoonist garry trudeau refused to impose sanctions , including petroleum equipment , which go into semiannual payments , including watches , including three , which the federal government , the same company formed by mrs. yeargin school district would be confidential', 2.9911073159300768e-33) ('33 men selling individual copies selling securities at the central plaza hotel die', 7.4942533128815141e-08)

Probabilistic CYK: Random Sentences and Example Trees ('young people believe criticism is led by south korea', 1.3798001044090654e-11) ('the purchasing managers believe the art is the often amusing , often supercilious , even vicious chronicle of bank of the issue yen-support intervention', 7.1905882731776209e-1)

Accurate Parsing Conclusion • Massive Lexicalized Grammar • Working Probabilistic Parser • Future Work • Handle sparsity • Smooth Probabilities

Accurate Parsing

Accurate Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Accurate Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing