LING 408/508: Computational Techniques for Linguists

LING 408/508: Computational Techniques for Linguists Lecture 26 10/26/2012

Outline • Chomsky Normal Form (CNF) • Cocke-Kasami-Younger (CKY) algorithm • Long assignment #7

Chomsky Normal Form (CNF) • http://en.wikipedia.org/wiki/Chomsky_normal_form • Can convert any CFG to a weakly equivalent one in Chomsky Normal Form • Weakly equivalent: generates same set of strings • In CNF, all productions are either: • Binary branching nonterminal rules A  B C • Unary terminal rules A  w • Also: epsilon can only be generated by start symbol S  ε

Examples of CFGs in CNF • G1: S  A B C A  a B  b C  c • G1 in CNF: S  A X X  B C A  a B  b C  c • G2: S  a b • G2 in CNF: S  A B A  a B  b • G3: • S  A • A  a • G3 in CNF: • S  a

Example: balanced brackets • Language of strings such as: • [], [[[[]]]], [[[]][]], [[[][]][[]]] • Can’t have: ][, [[], [][[ • Original CFG S  [S] | SS | ε • In CNF S  AB | AC | SS | ε C  SB A  [ B  ] S A C [ S B A B ] [ ]

Key issues in parsing • Handle ambiguity • Want to be able to find all possible parses of a sentence • (LING 539: find the single best parse) • Avoid re-parsing constituents • Efficiency • Don’t take forever • An algorithm is efficient if its complexity is polynomial: O(nk) for some constant k

CKY: Cocke-Kasami-Younger • CKY parsing algorithm • It is a chart parser: combines top-down and bottom-up parsing • More generally, it is a dynamic programming algorithm • Requires a CFG in Chomsky Normal Form • Complexity: O(N3), where N is the length of the sentence • Discovers all possible parses • Parses each constituent only once • Independently invented by: • Cocke • Kasami & Younger • Sometimes called “CYK”

Dynamic programming parsing • Problem: exponential number of parses for a sentence • Don’t want to enumerate every single parse • Want to reuse parsed constituents in subsequent parses • (These are problems of top-down and bottom-up parsing) • Store parses of constituents in a table • When parsing a constituent, look up parse in table rather than parse again (just like memoized Fibonacci) • Different parses can be computed from the table • O(N3) time to compute all possible parses of a sentence

Spans of constituents • Indicate span of each constituent through beginning and ending indices • i [0:1] • shot [1:2] • an elephant [2:4] • shot an elephant [1:4] • i shot an elephant in my pajamas [0:7]

Well-formed substring table • Records all constituents in the sentence • Size N x N upper triangular table/chart/matrix • Vertical axis: starting position of constituent • Horizontal axis: ending position of constituent • Cell [i, j] in table: • Contains a list of nonterminals for constituent(s) that begin at index i, and end at index j

1 i 2 shot 3 an 4 elephant 5 in 6 my 7 pajamas 0 NP 1 V VP 2 DT NP 3 N S -> NP VP PP -> P NP NP -> DT N | DT X | 'i' X -> N PP VP -> V NP | VP PP DT -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' 4 P PP 5 DT NP 6 Span of V is (1, 2) Span of NP is (2, 4) Span of VP is (1, 4) N Span of P is (4, 5) Span of NP is (5, 7) Span of PP is (4, 7)

1 i 2 shot 3 an 4 elephant 5 in 6 my 7 pajamas 0 NP S 1 V VP 2 DT NP 3 N X S -> NP VP PP -> P NP NP -> DT N | DT X | 'i' X -> N PP VP -> V NP | VP PP DT -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' 4 P PP 5 DT NP 6 Span of NP is (0, 1) Span of VP is (1, 4) Span of S is (0, 4) N Span of N is (3, 4) Span of PP is (4, 7) Span of X is (3, 7)

1 i 2 shot 3 an 4 elephant 5 in 6 my 7 pajamas 0 NP S 1 V VP 2 DT NP NP 3 N X S -> NP VP PP -> P NP NP -> DT N | DT X | 'i' X -> N PP VP -> V NP | VP PP DT -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' 4 P PP 5 DT NP Span of DT is (2, 3) Span of X is (4, 7) Span of NP is (2, 7) 6 N

1 i 2 shot 3 an 4 elephant 5 in 6 my 7 pajamas 0 NP S 1 V VP VP 2 DT NP NP 3 N X S -> NP VP PP -> P NP NP -> DT N | DT X | 'i' X -> N PP VP -> V NP | VP PP DT -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' 4 P PP 5 DT NP 6 Span of V is (1, 2) Span of NP is (2, 7) Span of VP is (1, 7) N Span of VP is (1, 4) Span of NP is (4, 7) Span of VP is (1, 7)

S in upper-right corner indicates existence of parse of entire sentence 1 i 2 shot 3 an 4 elephant 5 in 6 my 7 pajamas 0 NP S S 1 V VP VP 2 DT NP NP 3 N X S -> NP VP PP -> P NP NP -> DT N | DT X | 'i' X -> N PP VP -> V NP | VP PP DT -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' 4 P PP 5 DT NP Span of NP is (0, 1) Span of VP is (1, 7) Span of S is (0, 7) 6 N

CKY: another illustration The man ate a cookie • First, add the lexical edges (length 1) • Then, for each w, add edges of length w, • Using results from previous iterations w=2 i w=3 0 1 2 3 4 5 w=4 j w=5

Pseudocode for CKY 1. Base case: lower diagonal of table For all i, add A  wi to table[i, i+1] 2. Inductive case: rest of table for w = 2 to N: for i = 0 to N-w: for k = 1 to w-1: if: A  B C and B   table[i, i+k] and C   table[i+k, i+w] then: add A  B C to table[i, i+w] 3. If S  table[0, N], return True N = length of input sentence w = width of constituent we are looking for i = starting location of possible constituent k = size of split point between two sub-constituents 3 loops, O(N3)

Return value • 3. If S  table[0, N], return True • The algorithm as presented so far only tells us that there exists a successful parse for the entire sentence. • Want to return the parse tree • Need backpointers (next time)

Due 11/5

LING 408/508: Computational Techniques for Linguists