170 likes | 256 Views
Annotating and analysing IS in corpora of historical English Berlin, 13-14 November 2009. Coreferencing Treebank data using CESAC. Contents. Coreferencing using Cesac. Overview. CESAC - Goals - Coreference types: operationalizing IS - Input and output Inter-rater agreement Example
E N D
Annotating and analysing IS in corpora of historical English Berlin, 13-14 November 2009 Coreferencing Treebank data using CESAC
Contents Coreferencing using Cesac Overview • CESAC - Goals - Coreference types: operationalizing IS - Input and output • Inter-rater agreement • Example • Summary and conclusion
Goal of CESAC Coreferencing using Cesac CESAC goals • Overall goal - Referring from any one constituent to any other constituent • More specifically - Source and destination: IP/phrase/node or DP/lexeme/endnode - Attributes • Coreference type • Distance measure • (NP type: definite, indefinite, etc.) • (Animacy)
Goal of CESAC Coreferencing using Cesac CESAC coreference types: operationalizing IS • Two basic rules - Do not omit possible coreference information - A source should be linked to the nearest possible destination • Labels encoding different forms of anaphoricity - Identity ‘Jacqueline plays the cello. She is an amazing musician’ - Cross Speech ‘John said to Paul: “Why don’t you play the guitar?”’ - Inferred ‘Do you see that house? They say the kitchen is extremely spacious’ - World knowledge (separate category) ‘According to Burt Reynolds, all dogs go to heaven’ • Cross Speech > Identity > Inferred • Encoding facts vs. encoding interpretations: objective data
The input: Penn-Treebank Coreferencing using Cesac CESAC input 1 • Standard Penn-Treebank format - Collection of <nodes> - Each <node> consists of • Brackets: (…) • Label: (NP …) • Other node: (NP (N …) ) • Lexeme: (N man) • Possibly <lexeme>+<node>: (P to(NP him)) - Attributes in label • (NP-ACC (PRO^A hine)) - Extra-textual data in CODE nodes • (CODE <TEXT: +tyl+aste>)
matching brackets Node Label EndNode The input: Penn-Treebank Coreferencing using Cesac CESAC input 2 ( (CODE <T06080009600,11.4>) (IP-MAT (CONJ And) (NP-NOM (D^N +t+at) (N^N folc)) (NP-ACC (PRO^A hine)) (ADVP-TMP (ADV^T +ta)) (PP (P mid) (NP-DAT (ADJ^D unasecgendlicre) (N^D wur+dmynte))) (PP (P to) (NP-DAT (N^D scipe))) (VBDI gel+addon) (. ,)) (ID coapollo,ApT:11.4.183)) ( (IP-MAT (CONJ and) (NP-NOM (NR^N Apollonius)) (NP-ACC-1 (PRO^A hi)) (VBDI b+ad) (IP-INF (NP-ACC-SBJ *ICH*-1) (QP-ACC (Q^A ealle)) (VB $gretan))) (ID coapollo,ApT:11.4.184))
enriched Penn-Treebank Coreferencing using Cesac CESAC output 1 • Penn-Treebank format • Enriched with coreference information - Source node ID - Destination node ID - Coreference type - Coreference distance – derivable • Destination node example (NP-SBJ (CODE <Coref_Id="339"_/>) (NPR Crist)) • Source node example (NP-OB1 (CODE <Coref_Id="20"_Ref="21"_Type="Identity"_NdDist="16"_/>) (PRO hem) )
NP-SBJ NP-SBJ NPR CODE NPR Crist Crist <Coref Id=“310”> Source node NP-OB1 NP-OB1 PRO PRO CODE hem <Coref Id=“340” Ref=“310” Type=“Identity> hem enriched Penn-Treebank Coreferencing using Cesac CESAC output 2 Destination node <node> = one-or-more <node> OR <lexeme>
<lexeme> + <node> enriched Penn-Treebank Coreferencing using Cesac CESAC output 3 ( (IP-MAT (CONJ and) (NP-NOM *con* (CODE <Coref_Id="1488"_Ref="1489"_Type="Identity"_NdDist="12"_/>)) (VBD l+adde) (NP-ACC (CODE <Coref_Id="1476"_Ref="1477"_Type="Identity"_NdDist="8"_/>) (PRO^A hine)) (PP (P mid) (NP-DAT-RFL (CODE <Coref_Id="1487"_Ref="1488"_Type="Identity"_NdDist="6"_/>) (PRO^D him))) (PP (P to) (NP-DAT (PRO$ his (CODE <Coref_Id="1486"_Ref="1487"_Type="Identity"_NdDist="5"_/>)) (N^D huse)))) (ID coapollo,ApT:12.16.209))
PP PP NP-DAT P NP-DAT P PRO$ to PRO$ to CODE his his <Coref Id=“20” Ref=“21” Type=“Identity> enriched Penn-Treebank Coreferencing using Cesac CESAC output 4 Source node <node> = one-or-more <node> OR <node> <lexeme> OR <lexeme>
Goal of CESAC Coreferencing using Cesac Inter-rater agreement 1 • Two features measured - Coreference destination (node ID) - Coreference type • Adapted version of Cohen’s kappa: κ > .6 • Two important problems - Identity vs. cross speech - Omission of link • Solutions - Create new rule(s) - Adapt/specify existing rule(s)
Goal of CESAC Coreferencing using Cesac Inter-rater agreement 2 • Tool used to calculate inter-rater agreement concerning - Coreference destination (feature 1): κ = .67 - Coreference type (feature 2): κ = .66
Goal of CESAC Coreferencing using Cesac Example 1
Goal of CESAC Coreferencing using Cesac Example 2 • Clean text fragment with translation Ant warshipe hire easkeđ. Hweonene cumest tu fearlac deađes munegunge. Ich cume he seiđ of helle. And Worship him asked, ‘From where come you, Fearlac, death’s reminder?’ ‘I come’, he said, ‘from hell.’ • Text fragment in CESAC coreference file 170.64 [2031 ant[2033 warschipe][2035 hire] easkeđ. Hweonene[2042[2043 ] cumest [2045 tu][2047 fearlac[2049 deađes munegunge]]] .] 170.65 [2053[2054 Ich] cume[2057[2058 he] seiđ] of[2063 helle] .] • Text fragment in Penn-Treebank file ( (IP-MAT (CONJ ant) (NP-SBJ (N warschipe)) (NP-OB1 (PRO hire)) (VBP easke+d) (, .) (CP-QUE-SPE (WADVP-1 (WADV Hweonene)) (IP-SUB-SPE (ADVP-DIR *T*-1) (VBP cumest) (NP-SBJ (CODE <Coref_Id="346"_Ref="345"_Type="CrossSpeech"_NdDist="10"_/>) (PRO tu)) (NP-VOC (CODE <Coref_Id="50"_Ref="346"_Type="Identity"_NdDist="2"_/>) (N fearlac) (NP-PRN (CODE <Coref_Id="49"_Ref="50"_Type="Identity"_NdDist="2"_/>) (N$ dea+des) (N munegunge))))) (E_S .)) (ID CMSAWLES,170.64)) ( (IP-MAT-SPE (NP-SBJ (CODE <Coref_Id="347"_Ref="346"_Type=“CrossSpeech"_NdDist="9"_/>) (PRO Ich)) (VBP cume) (IP-MAT-PRN (NP-SBJ (CODE <Coref_Id="348"_Ref="347"_Type="CrossSpeech"_NdDist="4"_/>) (PRO he)) (VBP sei+d)) (PP (P of) (NP (NPR helle))) (E_S .)) (ID CMSAWLES,170.65))
Goal of CESAC Coreferencing using Cesac Summary and conclusion • Annotation program CESAC - Input: standard Penn-Treebank - Output: relatively easy to analyse - Inter-rater agreement measured • Operationalizing IS - 4 coreference types - As objective as possible: facts vs. interpretations • Plans - Fixed set of coreference types - Larger corpus of coreferenced texts
Goal of CESAC Coreferencing using Cesac Thank you for your attention!