1 / 58

Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0. CD-ROM PRESENTATION Dec 18, 2000. Prague Dependency Treebank 1.0. Functional Generative Description. CD-ROM PRESENTATION Dec 18, 2000. Functional Generative Description.

finna
Download Presentation

Prague Dependency Treebank 1.0

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

  2. Prague Dependency Treebank 1.0 Functional Generative Description CD-ROM PRESENTATION Dec 18, 2000

  3. Functional Generative Description • theoretical framework based on the findings of European structural linguistics, esp. of the classicalPrague School • methodological requirements of aformal description • levels: • tectogrammatical (underlying) representations (TRs) withdependency based syntax • morphemics • phonemics and phonetics • TRs(see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way) Prague Dependency Treebank 1.0

  4. Dependency tree My younger brother arrived there yesterday. Linearized form, one-to-one relation: ((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday) Prague Dependency Treebank 1.0

  5. Dependency Tree • labels - lexical meanings (abstract symbols) with indices • functors • subscripts at parentheses oriented towards head • grammatemes - values of morphological categories • Tense, Modality, Number, Definiteness, etc. • projectivity • valency • arguments (inner participants) and adjuncts (circumstantials or 'free modifications') • obligatoryandoptional with a given head, • deletable or not Prague Dependency Treebank 1.0

  6. Dependency Tree • adjuncts • Locative, several Directional and Temporal modifications • Condition, Means, Manner, etc. • participants (arguments) of verbs • Actor/Bearer (underlying subject) • Objective (Patient, underlying direct object) • Addressee(underlying indirect object) • Effect ('second' object: to choose so. as sth.) • Origin(to make sth. out of sth.) Prague Dependency Treebank 1.0

  7. inner participants Material (Partitive) two baskets of sth. Identitythe river Danube; the notion of operator free modifications Possession (Appurtenance) my table; Jim's brother Restrictive rich man Descriptive the Swedes, who are a Scandinavian nation Dependency Tree Complementations dependent mainly on nouns Prague Dependency Treebank 1.0

  8. Dependency Tree • syntactic grammatemes • Loc, Dir - in, on, under, between... • Regard - with, without • operational (testable) criteria • for distinguishing • arguments from adjuncts, • from each other • deletability (dialogue test) Prague Dependency Treebank 1.0

  9. Simplified valency frames • brother N Appurt • man N • glass N Material • full A Material • read V Act Addr Obj • change V ActObj Orig Eff • give V ActAddrObj obligatory complementations in blue Prague Dependency Treebank 1.0

  10. T there young Topic-focus articulation • contextual boundness • main verb CB/NB (T/F) • dependents to the left/right • communicative dynamism • left-right (mother, sisters, transitive) • partial ordering • underlying word order • left-right • linear ordering left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR) Prague Dependency Treebank 1.0

  11. T F there yesterday young Topic-focus articulation • TFA - one of the basic aspects of underlying structures Prague Dependency Treebank 1.0

  12. Complex sentence • a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause My brother, whom you know, arrived there yesterday. Prague Dependency Treebank 1.0

  13. Complex sentence • functionwords (synsemantic)are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs Martin came there late, since he had to accompany his sick mother. Prague Dependency Treebank 1.0

  14. Complex sentence Martin arrived late to the session, since he had to accompany his sick mother.schematically (morphemes): Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother.dot - close connection of morphemes ('semes') Prague Dependency Treebank 1.0

  15. deleted items restored • order of items - difference between 'underlying' and surface (morphemic) word order • transductive components - Panevová, Oliva, Borota • coordination (multidimensional) • Jim and Mary, who have two children, went to Boston. • the linearized notation is adequate: • ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr children)))Act went (Dir Boston) • structures close to Boolean, i.e.no complex'innate properties' specific for natural language are needed. Prague Dependency Treebank 1.0

  16. Prague Dependency Treebank - corpus annotation • an intermediate level - 'analytical' representations • dependency trees, not always projective • nodes for all word tokens, even for punctuation marks • tectogrammmatical tree: coordinating conjunction as the head Prague Dependency Treebank 1.0

  17. Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

  18. Prague Dependency Treebank 1.0 Morphological Layer CD-ROM PRESENTATION Dec 18, 2000

  19. ACKNOWLEDGEMENTS Prague Dependency Treebank 1.0

  20. ANNOTATED CORPORA PDT version 1.0, 2000 (1996 - 2000) Penn Treebank, release 3, 1999 (1989 - 1999) Prague Dependency Treebank 1.0

  21. TAG SETs Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, … English -language with poor inflection work, works, worked, working Prague Dependency Treebank 1.0

  22. Prague Dependency Treebank 1.0

  23. Lidové noviny Mladá Fronta Dnes Vesmír Českomoravský Profit ...taken from Czech National Corpus ´88, ´89 WSJ articles Air Travel Information System transcripts Brown Corpus Switchboard transcripts TEXT SOURCES Prague Dependency Treebank 1.0

  24. ANNOTATION STRATEGY - Penn Treebank TEXT Ken Church‘s stochastic tagger, Eric Brill‘s transformation tagger corrections by annotator (GNU Emacs Lisp based package) Prague Dependency Treebank 1.0

  25. Automatic Morphological Analyzer (AMA) two independent annotators;Linux, Win tools differences resolved by third annotator comparison with the current AMA; manual resolution; Win tools ANNOTATION STRATEGY - PDT Prague Dependency Treebank 1.0

  26. SGML coding, csts dtd word/tag(|tag)* INTERNAL FORMAT Prague Dependency Treebank 1.0

  27. SAMPLES <s id=“ln95040:020-p1s1“> <f>Pokus<l>pokus<t>NNIS1-----A---- <f>o<l>o<t>RR--4---------- <f>zázrak<l>zázrak<t>NNIS4-----A---- <d>.<l>.<t>Z:------------- The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./. Prague Dependency Treebank 1.0

  28. SGML coding SGML coding word/tag word/lemma/tag CONVERSION pdt2wsj.pl pdt2wsjFLT.pl Prague Dependency Treebank 1.0

  29. DATA SIZE Prague Dependency Treebank 1.0

  30. DATA SETs of MORPHOLOGICALLY ANNOTATED DATA Prague Dependency Treebank 1.0

  31. Automatic Morphological Analyser/Generator of Czech HMAnalyze.pl, HMGenerate.pl Dictionary: CZE_a Remote Acces Czech Taggers HMM Exponential TOOLS Prague Dependency Treebank 1.0

  32. Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

  33. Prague Dependency Treebank 1.0 Analytical Layer in PDT CD-ROM PRESENTATION Dec 18, 2000

  34. Introduction • Input: morphologically tagged sentences • Graph Editor: “user-friendly” software • Output: ATS structure • „surface“ syntax tree structure • nodes labelled by the analytical functions Prague Dependency Treebank 1.0

  35. Two stages (chronologically) • (A) manual „analytic“ annotation (ATS) • training data for (B)(a) • (B) • (a) semiautomatic procedure (Collin‘s parser) • (b) manual correcting of (B)(a) Prague Dependency Treebank 1.0

  36. Constraints and limitations • any string has a node of its own • word-form, punctuation mark, etc. • AuxV, AuxP, AuxC, AuxX, AuxG… • reflecting the coordination and apposition relations • so called third dimension of the graph in the plain tree (X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.) Prague Dependency Treebank 1.0

  37. Constraints and limitations • no missing nodes (on the surface) can be added • analytic funtion Ex_D is used • relations between semi-automatic and manual procedure • 80% edges are established correctly automatically Prague Dependency Treebank 1.0

  38. Project organization • team consisting of 5-6 annotators • handbook for ATS structure annotation • 1999: 100000 sentences on ATS • tectogrammatical annotation follows Prague Dependency Treebank 1.0

  39. AuxT Adv První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang. Prague Dependency Treebank 1.0

  40. Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

  41. Prague Dependency Treebank 1.0 From the Analyticaltowards the Tectogrammatical layer CD-ROM PRESENTATION Dec 18, 2000

  42. Introduction • ATS annotation • nodes: • word forms • punctuation • graphical symbols • TGTS annotation • autosemantic words • deletions • edges: • surface relations • deep layer functions Prague Dependency Treebank 1.0

  43. Tokenization ATS PDT1.0 Morphological tagging and lexical disambiguation Syntactic parsing and analytic function assignment TGTS Tree structure pruning Attribute assignments Annotation process Input Czech sentence Prague Dependency Treebank 1.0

  44. Transition procedure • deterministic procedure operating on trees • macro language for Graph Editor (C++ like) • automatic changes & tools for annotators • Requirements • new attributes for tectogrammatical layer • ATS is recoverable from TGTS • automatized to a maximally high degree Prague Dependency Treebank 1.0

  45. New attributes • trlemma - lemmaof the original node or lemma composed of joined nodes • morphological grammatemes • gender, number, degree of comparison, tense, • aspect, iterativeness, verbal modality, deontic modality, sentence modality • positionof the node • functor, topic-focus articulation, syntactic grammateme, • type of relation (dependency, coordination, apposition), • phraseme, deletion, quoted word, direct speech, • coreference, antecedent Prague Dependency Treebank 1.0

  46. Tree Structure Pruning • U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. • For those, who start actually at zero, the tax outcome for the state is not substantial. Prague Dependency Treebank 1.0

  47. REG Tree Structure Pruning • U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. • For those, who start actually at zero, the tax outcome for the state is not substantial. Prague Dependency Treebank 1.0

  48. verbmod=CDN deontmod=HRT PRED Verbal Nodes • … podnikatelé by měli mít daně … • … enterpreneurs should have (their) taxes … Prague Dependency Treebank 1.0

  49. Attribute Assignments • prepositions stored as fwattribute • quoted words • clause in quotes -> DSP • one pair of quotes in the sentence -> DSPP • string in quotes -> QUOT • gender, number, tense, degcmp, aspect • default values Prague Dependency Treebank 1.0

  50. Macros for Annotators • keyboard shortcuts (in Graph editor) • structure changes • hide/recover nodes • merge nodes • add new nodes • functor assignments Prague Dependency Treebank 1.0

More Related