Lambert Schomaker

KI2 – 5 Grammar inference Lambert Schomaker Kunstmatige Intelligentie / RuG

Grammar inference (GI) • methods, aimed at uncovering the grammar which underlies an observed sequence of tokens • Two variants: • explicit, formal GI deterministic token generators • implicit, statistical GI stochastic token generators

Grammar inference AABBCCAA..(?).. what’s next? ABA  1A 1B 1A AABBAA  2A 2B 2A or  AAB(+mirrorsymmetric)  (2A B)(+mirrored) repetition, mirrorring, insertion, substitution

Strings of tokens • DNA: ACTGAGGACCTGAC… • output of speech recognizers • words from an unknown language • tokenized patterns in the real world

Strings of tokens • DNA: ACTGAGGACCTGAC… • output of speech recognizers • words from an unknown language • tokenized patterns in the real world A B A

Strings of tokens • DNA: ACTGAGGACCTGAC… • output of speech recognizers • words from an unknown language • tokenized patterns in the real world A B A  Symm(B,A)

GI • induction of structural patterns from observed data • representation by a formal grammar versus: • emulating the underlying grammar withoutmaking the rules explicit (NN,HMM)

GI, the engine Grammar Induction Data Grammatical rules (seq (repeat 3 a)(repeat 3 b)) (seq a b) (symmetry (repeat 2 c) (seq a b)) aaabbb ab abccba

The hypothesis behind GI Grammar Induction Generator process Data G’ G0 aaabbb ab abccba Find G’  G0

The hypothesis behind GI Grammar Induction Generator process Data G’ G0 aaabbb ab abccba Find G’  G0 It is not claimed that G0 actually ‘exists’

Learning • Until now it was implicitly assumed that the data consists of positive examples • A very large amount of data is needed to induce an underlying grammar • It is difficult to find a good approximation to G0 if there are no negative examples: e.g. “aaxybb does NOT belong to the grammar”

Learning… Convergence G0 = G* is assumed for infinite N Grammar Induction Generator process Data G’ G0 sample1 G1 sample2 G1+2 sample3 G1+2+3 . . . sampleN G*

Learning… (Convergence G0 = G* is assumed for infinite N) More realistic: a PAC, probably approximately correct G* Grammar Induction Generator process Data G’ G0 sample1 G1 sample2 G1+2 sample3 G1+2+3 . . . sampleN G*

PAC GI the language generated by G0 the language explained by G* L(G0) L(G*) P[ p(L(G0)  L(G*)) <  ] > (1 - )

PAC GI the language generated by G0 the language explained by G* L(G0) L(G*) The probability that “the probability of finding elements {L0xor L*} is smaller than ”, will be larger than 1-  P[ p(L(G0)  L(G*)) <  ] > (1 - )

Example • S+ = { aa, aba, abba, abbba} a a  aa a

Example • S+ = { aa, aba, abba, abbba} a a  aa a b a ab ba

Example • S+ = { aa, aba, abba, abbba} a a  aa a b a ab ba bb b a

Example • S+ = { aa, aba, abba, abbba} a a  aa a b a ab ba bb b a b

Many GI approaches are known (Dupont, 1997)

Second group: Grammar Emulation • Statistical methods, aiming at producing token sequences with the same statistical properties as the generator grammar G0 • 1: recurrent neural networks • 2: Markov models • 3: hidden-Markov models

Grammar emulation, training ABGBABGACTVYAB <x>. . . predict x context window Grammar emulator

Recurrent neural networks for grammar emulation • Major types: • Jordan (output-layer recurrence) • Elman (hidden-layer recurrence)

Jordan MLPs • Assumption: current state is represented by output unit activation at the previous time step(s) and by the current input Input state Output t

Elman MLPs • Assumption: current state is represented by hidden unit activation at the previous time step(s) and by the current input Input state Output t

Markov variants • Shannon: fixed 5-letter window for English to predict next letter • Variable-length Markov Models (VLMM) (Guyon & Pereira) idea: the width of the context window to predict the next token in a sequence is variable and depends on statistics

Results • Example output of letter VLMM, trained on news item texts (250 MB training set) “liferator member of flight since N. a report the managical including from C all N months after dispute. C and declaracter leaders first to do a lot of though a ground out and C C pairs due to each planner of the lux said the C nailed by the defender begin about in N. the spokesman standards of the arms responded victory the side honored by the accustomers was arrest two mentalisting the romatory accustomers of ethnic C C the procedure. “

Lambert Schomaker