1 / 27

Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words

Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words. Gertrud Faa ß faaszgd@ims.uni-stuttgart.de Ulrich Heid heid@ims.uni-stuttgart.de E lsab é Taljard elsabe.taljard@up.ac.za DJ Prinsloo danie.prinsloo@up.ac.za. This Talk. Prologue

Download Presentation

Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words Gertrud Faaß faaszgd@ims.uni-stuttgart.de Ulrich Heid heid@ims.uni-stuttgart.de ElsabéTaljard elsabe.taljard@up.ac.za DJ Prinsloo danie.prinsloo@up.ac.za

  2. This Talk • Prologue • Challenges for tagging Sotho texts • Objectives • Descriptive state of the artfor tagging of Sotho texts • Tools • Tagsets • The ambiguity problem • Methodology • Results • Conclusions & future work

  3. Nine Official Bantu Languages of SA • Sotho Group • Northern Sotho / Sepedi • Tswana • Southern Sotho • Nguni Group • Zulu • Swati • Xhosa • Ndebele ********************* • Venda and Tsonga

  4. Noun class system 1 1

  5. Concordial agreement – Northern Sotho Taljard and Bosch (2005)

  6. Challenges for tagging • Ambiguity, for example: • function words: -a- being 9-ways ambiguous, -go- up to30(11,6,5,…)-ways • Unknown words (N+V) • noun derivation: toropo (town) -> toropong (in/at/to town) • verb derivation: next slides

  7. Challenges: unknown words • Agglutinating languages: extensive use of affixes • Example: rekišeditšwe ‘was / were sold for’ < rek- ‘buy’ (verb root) + -iš- (causative) + -el- (applied) + -il- (past tense) + -w- (passive) + -e (inflectional ending)

  8. Examples of suffixes and combinations for a single verb • ROOTetšane, ROOTetšanwa, ROOTetšanwe, ROOTiša, ROOTišitše, ROOTišwa, ROOTišitšwe, ROOTišana, ROOTišane, ROOTišanwa, ROOTišanwe, ROOTišega, ROOTišegile, ROOTišetša , ROOTišeditše, ROOTišetšwa, ROOTišeditšwe, ROOTišetšana, ROOTišetšane, ROOTišetšanwa, ROOTišetšanwe, ROOTišiša, ROOTišišitše, ROOTišišwa, ROOTišišitšwe, ROOTišišana, ROOTišišane, ROOTišišanwa, ROOTišišanwe, ROOToga, ROOTogile, ROOTogwa, ROOTogilwe, ROOTogana, ROOTogane, ROOToganwa, ROOToganwe, ROOTogela, ROOTogetše, ROOTogelwa, ROOTogetšwe, ROOTola, ROOTotše, ROOTolwa, ROOTotšwe, ROOTolana, ROOTolane, ROOTolanwa, ROOTolanwe, ROOTolega, ROOTolegile, ROOTolela, ROOToletše, ROOTolelwa, ROOToletšwe, ROOTolelana, ROOTolelane, ROOTolelanwa, ROOTolelanwe, ROOTolla, ROOTolotše, ROOTollwa, ROOTolotšwe, ROOTollana, ROOTollane, ROOTollanwa, ROOTollanwe, ROOTollega, ROOTollegile, ROOTollela, ROOTolletše, ROOTollelwa, ROOTolletšwe, ROOTollelana, ROOTollelane, ROOTollelanwa, ROOTollelanwe, ROOTolliša, ROOTollišitše, ROOTollišwa, ROOTollišitšwe,ROOTollišana, ROOTollišane, ROOTollišanwa, ROOTollišanwe, ROOTologa, ROOTologile, ROOTologana, ROOTologane, ROOTologanwa, ROOTologanwe, ROOTološa, ROOTološitše, ROOTološwa, ROOTološitšwe, ROOTološana, ROOTološane, ROOTološanwa, ROOTološanwe, ROOTološetša, ROOTološeditše, ROOTološetšwa, ROOTološeditšwe, ROOTološetšana, ROOTološetšane, ROOTološetšanwa, ROOTološetšanwe, ROOToša, ROOTošitše, ROOTošwa, ROOTošitšwe, ROOTošetša, ROOTošeditše, ROOTošetšwa, ROOTošeditšwe, ROOTošetšana, ROOTošetšane, ROOTošetšanwa, ROOTošetšanwe

  9. Solution for unknown verbs and nouns • Verb guesser: detection of • longest match suffix combinations • occurrences in corpora • Noun guesser: matching of • singular/plural-forms • nominal suffixes • occurrences in corpora

  10. Objectives • Tagging with a detailed tagset: class numbers • Nouns, adjectives, pronouns, concords, demonstratives • Disambiguation • Motivation: tagging used as preprocessing for: • Chunking, parsing • Lexicography (tag relatively large corpora,e.g. PSC) • Detailed linguistic research (e.g. grammar development) • Information extraction

  11. State of the art for tagging: Sotho languages • Comparison of tagsets and tools is hardly possible • Different applications of tagged material(linguistic description, lexicography, parsing, etc.) • Different number of tags • Differences in granularity

  12. Descriptive State of the Art: tagsets and tools

  13. Descriptive State of the Art for tagging: Sotho languages Tools: • Full • De Schryver and de Pauw (2007)Northern Sotho tagger (statistical) • Partial • Kotzé (several publications, e.g. 2008) Verbal and nominal segment(finite state)

  14. Descriptive state of the art for tagging: Sotho languages Applications of tagsets: • De Schryver and de Pauw (2007):used for lexicography • Van Rooy and Pretorius (2003):linguistic description of Setswana • Taljard et al. (2008): morphosyntactic and general linguistic description

  15. The ambiguity problem • -a-, -go-: see handout for possible readings • Local context may not identify noun class of subject concord:(Masogana) … A nwa bjalwa CS06 drink beer(Young men) … “They drink beer.”

  16. The ambiguity problem: possible solutions • Dependent on objectives • Flat tagset ignoring irrelevant details(cf. handout for -go-) • Layered tagset: granularity

  17. Tagset (cf. Handout) • Level 1 • Noun = (N) • Subject concord (CS), Object concord (CO) • Pronouns (PRO) • Level 2 • emphatic (only for pronouns) EMP • possessive (dto.) POSS • Level 3 • Classes -> N.01a, N.01, N.02, N.03, … , PERS, etc. • Example:noun of class 1 = N.01possessive pronoun of class 6 = PRO.POSS.06

  18. RF tagger technology(cf. Schmid and Laws (2008) • Hidden Markov Model (HMM) Tagger • Additional external lexicon • Large, fine-grained tagsets • Several levels of description: e.g. German articles: ART.Definiteness.Case.Number.Gender • Calculates joint (product) probabilities

  19. Training corpus • 45,000 tokens manually annotated word forms from two text types • Not balanced (25,000 tokens out of a novel, 2 times 10,000 tokens out of dissertations)

  20. Comparing taggers on manually annotated data • Tree-Tagger (Schmidt 1994) • TnT Tagger (Brants 2000) • MBT Tagger (Daelemans et al. 2007) • RF-Tagger (Schmid and Laws 2008)

  21. Effects of size of training corpus No more adding of training data necessary

  22. Effects of highly polysemous function words • Distribution problem • Probability guesses for scarce labels become unreliable • a : • PART (45) vs. CS.01 (1,182) • 91% incorrect labeling of PART. • Detailed discussion: • Handout: -a- refer to pages 2, 4

  23. Alternative proposal: hybrid taggers Spoustová et al. (2007) • Combine rule-based tagging with statistical tagging For Northern Sotho: - Contextual disambiguation works fine with RF-tagger if unambiguous indicators are available • Disambiguating macros (using the same indicators) hence have little effect • Ambiguous contexts hard to account for either way: need for parsing?

  24. Results: 10-fold cross validation • Without guessers (to simulate similar conditions for TnT and MBT) • RF-tagger: 91.00% • TnT tagger: 91.01% • MBT: 87.68% • with guessers: (several thousand nouns and verbs part of the lexicon) • Tree-tagger: 92.46% • RF-tagger: 94.16%

  25. Conclusions • Different intended uses lead to different tagsets (granularity, number of tags) • Including noun class information is essential for general linguistic research, e.g. grammar development, applications of chunking/parsing • RF-Tagger performs well for our layered tagset with the existing amount of training data (45,000), over 94% correct • Ambiguous contexts and sparse data problem combined lead to a high error rate for statistical parsing - not likely to be solvable with macros • Chunking / Parsing might lead to a more adequate solution for this problem

  26. Future work • Apply RF-tagger to the PSC corpus • Evaluate results • Instead of preprocessing rules, a partial postprocessing may make sense (e.g. chunking, parsing)

More Related