2010/2/4 Yi-Ting Huang

Zanzotto, F. M. &Moschitti, A. Automatic Learning of Textual Entailments with Cross-pair similarities. ACL2006. Pennacchiotti, M., &Zanzotto, F. M. Learning Shallow Semantic Rules for Textual Entailment. Recent Advances in Natural Language Processing (RANLP2007). 2010/2/4 Yi-Ting Huang

Recognizing Textual Entailment (RTE) • What is RTE: • To determine whether or not a text T entails a hypothesis H. • Example: • T1: “At the end of the year, all solid companies pay dividends.” • H1: “At the end of the year, all solid insurance companies pay dividends.” • H2: “At the end of the year, all solid companies pay cashdividends.” • Why RTE is important: • It will allow us to model more accurate semantic theories of natural languages and design important applications(QA or IE, etc.)

Idea… (1/2) • T3H3? • T3: “All wild animals eat plants that have scientifically proven medicinal properties.” • “All wild mountain animals eat plants that have scientifically proven medicinal properties.” • T1: “At the end of the year, all solid companies pay dividends.” • H1: “At the end of the year, all solid insurance companies pay dividends.” • Yes!T3 is structurally (and somehow lexically similar) to T1 and H3 is more similar to H1 than to H2. Thus, from T1H1 we may extract rules to derive that T3H3.

Idea… (2/2) • We should rely not only on a intra-pair similarity between T and H but also on a cross-pair similarity between two pairs (T’,H’) and (T’’,H’’). intra-pair similarity cross-pair similarity

Research purpose • In this paper, we define a new cross-pair similarity measure based on text and hypothesis syntactic trees and we use such similarity with traditional intra-pair similarities to define a novel semantic kernel function. • We experimented with such kernel using Support Vector Machines on the test tests of the Recognizing Textual Entailment (RTE) challenges.

Term definition • word wt in Text T, word wh in Hypothesis H anchors is the pairs (wt, wh), e.g. indicate (companies, companies) calls placeholders.

Intra-pair similarity (1/3) • Intra-pair similarity: to anchor the content words in the hypothesis WH to words in the text WT. • Each word wh in WH is connected to all words wt in WT. that have the highest similarity simw(wt, wh).As result, we have a set of anchors and the subset of words in T connected with a word in H. • We select the final anchor set as the bijective relation between WTand WT‘that mostly satisfies a locality criterion: whenever possible, words of constituent in H should be related to words of a constituent in T .

Intra-pair similarity (2/3) and the lemmatized formlw of a word w • Two words are maximally similar if these have thesame surface form • To use one of the WordNet (Miller, 1995) similarities indicated with d(lw, lw’) (Corley andMihalcea, 2005). We adopted the wn::similarity package (Pedersen et al., 2004) to compute the Jiang&Conrath (J&C) distance (Jiang and Conrath, 1997) . • We can use WordNet 2.' (Miller, 1995) to extract different relation between words such as the lexical entailment between verbs (Ent) and derivationally relation between words (Der). • We use the edit distance measure lev(wt,wh)to capture the similarity between words that are missed by the previous analysis for misspelling errors or for the lack of derivationally forms not coded in WordNet.

Intra-pair similarity (3/3) • The above word similarity measure can be used to compute the similarity between T and H. In line with (Corley and Mihalcea, 2''5), where idf(w) is the inverse document frequency of the word w. A selected portion of the British National Corpus to compute the inverse document frequency (idf). We assigned the maximum idf to words not found in the BNC.

Cross-pair syntactic kernels • To capture the number of common subtrees between texts (T’, T’’) and hypotheses (H’, H’’) that share the same anchoring scheme respectively. • to derive the best mapping between placeholder sets. • a cross-pair similarity

The best mapping • Let A’ and A’’ be the placeholders of (T’,H’)and (T’’,H’’), |A’|≥|A’’| and we align a subset of A’ to A’’. • Let C be the set of all bijective mappings from , an element is substitution function. • The best alignment:where (i) returns the syntactic tree of the text S with placeholders replaced by means of the substitution c.(ii) iis the identity substitution(iii) is a function that measures the similarity between the two trees t1, t2.

Example where (i) returns the syntactic tree of the text S with placeholders replaced by means of the substitution c.(ii) i is the identity substitution(iii) is a function that measures the similarity between the two trees t1, t2.

Cross-pair similarity • A tree kernel function over t1and t2 is KT(t1, t2) = ,where Nt1 and Nt2 are the sets of the t1’s and t2’s nodes, respectively. • Given a subtree space F = , the indicator function Ii(n) is equal to 1 if the target fiis rooted at node n and equal to 0 otherwise. • In turnis the number of levels of the subtreefi. • Thus assigns a lower weight to larger fragments.when = 1, is equal to the number of common fragments rooted at nodes n1 and n2. where as KT (t1, t2) we use the tree kernel function defined in (Collins and Duffy, 2002).

Example KT(t1, t2) = Ii(n) is equal to 1 if the target fiis rooted at node n and equal to 0 otherwise

Kernel function inSVM • The KTfunction has been proven to bea valid kernel, i.e. its associated Gram matrix is positive semidefinite. • Some basic operations on kernel functions, e.g. the sum, are closed with respect to the set of valid kernels. • The cross-pair similarity would be a valid kernel and we could use it in kernel based machines like SVMs. • We developed SVM-light-TK (Moschitti, 2006) which encodes the basic tree kernel function, KT, in SVM light (Joachims, 1999). We used such software to implement Ks, K1, K2and Ks+ Kikernels(i {1, 2}).

Limit • The limitation of the cross-pair similarity measure is then that placeholders do not convey the semantic knowledgeneeded in cases such as the above, where the semantic relation between connected verbs is essential.

Adding semantic information • Defining anchor types:Avaluable source of relation types among words isWordNet. • similaraccording to the WordNet similaritymeasure, tocapture synonymy and hyperonymy. • surface matching when words or lemmas match, it captures semantically equivalent words.

Augmenting placeholders with anchortypes (1/2) • typed anchor model (ta) : anchor types augment only the pre-terminal nodes of the syntactic tree; • propagated typed anchor model (tap) :anchors climb up in the syntactic tree according to some specific climbing-up rules, similarly to what done for placeholders. • Climbing-up rules: they climb up in the tree according to constituent nodes in the syntactic trees take the placeholder of their semantic heads.

propagated typed anchor model (tap) • if two fragment have the same syntactic structure S(NP, V P(AUX,NP)), and there is a semantic equivalence (=) on all constituents, then entailment hold.

Augmenting placeholders with anchortypes (2/2) • New Rule: if two typed anchors climb up to the same node, give precedence to that with the highest ranking in the ordered set of type

Experimental I • Data set: D1, T1 and D2, T2, are the development and the test sets of the first (Dagan et al., 2005) and second (Bar Haim et al., 2006) challenges. • The positive examples constitute the 50% of the data. • ALL is the union of D1, D2, and T1, which we also split in 70%-30%. • D2(50%)0 and D2(50%)00 is a random split of D2.(homogeneous split.) • Tool:The Charniak parser (Charniak, 2000) and the morpha lemmatiser (Minnen et al., 2001) to carry out the syntactic and morphological analysis.

Results in Experiment I

Finding in Experiment I • The dramatic improvement observed in (Corley and Mihalcea, 2005) on the dataset “Train:D1-Test:T1” is given by the idf rather than the use of the J&C similarity (second vs. third columns). • our approach (last column) is significantly better than all the other methods as it provides the best result for each combination of training and test sets. • By comparing the average on all datasets, our system improves on all the methods by at least 3 absolute percent points. • The accuracy produced by Synt Trees with placeholders is higher than the one obtained with Only Synt Trees.

Experimental II • we compare our ta and tap approaches with the strategies for RTE: lexical overlap, syntactic matching and entailment triggering. • Data set: D2, T2 (Bar Haim et al., 2006) RTE2 challenges. • We here adopt 4-fold cross validation.

Experiment II • Variables: • tree: the first algorithm • lex: lexical overlap similarity (Corley andMihalcea, 2005). • synt: syntactic matching. synt(T,H) is used to compute the score, by comparing all the substructures of the dependency trees of T and H. synt(T,H) = KT(T,H)/|H| where |H| is the number of subtrees in H • lex+trig: • SVO that tests if T and H share a similar subj-verb-obj construct; • Apposition that tests if H is a sentence headed by the verb to be and in T there is an apposition that states H; • Anaphora that tests if the SVO sentence in H has a similar wh-sentence in T and the wh-pronoun may be resolved in T with a word similar to the object or the subject of H.

Results in Experiment II

Finding in Experiment II +4.19% • Syntax structure: this demonstrates that syntax is not enough, and that lexical-semantic knowledge, and in particular the explicit representation of word level relations, plays a key role in RTE. +1.12%

Finding in Experiment II • Also, tap outperforms lex, supporting a complementary conclusion: lexical-semantic knowledge does not cover alone the entailment phenomenon, but needs some syntactic evidence. • the use of cross-pair similarity together with lexical overlap (lex + tree) is successful, as accuracy improves +1.87% and +2.33% over the related basic methods (respectively lexand tree). +0.66%

Conclusions • We have presented a model for the automatic learning of rewrite rules for textual entailments from examples. • For this purpose, we devised a novel powerful kernel based on cross-pair similarities. • Effectively integrating semantic knowledge in textual entailment recognition. • We experimented with such kernel using Support Vector Machines on the RTE test sets.

Moreinformation • TREE KERNELS IN SVM-LIGHTTREE KERNELS IN SVM-LIGHThasreleasedonthewebsite,whichisimplementedbythefirstalgorithm.

2010/2/4 Yi-Ting Huang