Using Finite State Technology in a Tool for Linguistic Exploration

Using Finite State Technology in a Tool for Linguistic Exploration Kemal Oflazer, Mehmet Erbaş, Müge Erdoğmuş Sabancı University Istanbul, Turkey

Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples.

Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples. • A showcase for the natural language processing resources and technology (for Turkish in this case)

Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples. • A showcase for the natural language processing resources and technology (for Turkish in this case) • A testbed for the use of NLP in (native and foreign) language learning.

Background and Motivation • Joint work with UC Berkeley, recently funded by US and Turkish NSF as a 3-year joint project, as a follow-up project to: • TELL – Turkish Electronic Living Lexicon (US NSF) • A Unified Electronic Lexicon Of Turkish (US and Turkish NSF)

Turkish • Agglutinative morphology with many morphophonological processes • e.g., vowel harmony • pronunciation (phoneme selection/stress position) is a function morphological structure and function, and lexical semantics • lots of derivational processes • semi/non-lexicalized collocations • free constituent order

LingBrowser Functionality (Current Prototype) • Access to linguistic information in arbitrary Turkish Web content and text • Lexical • phonological • phonemes, syllables, stress position • morphological • Lexical and surface morpheme structure, morphological features encoded • Semantic • dictionary access • WordNet access, • root word translation

LingBrowser Functionality(On-going Work and Future) • Access to linguistic information in arbitrary Turkish Web content and text • Multi-word constructs • Named-entity identification • Surface syntax • NP extraction and structure display • Surface syntactic relations • Lexical Translation/Paraphrasing • Phrasal translation

LingBrowser Prototype

LingBrowser Prototype • Morphological Analysis

LingBrowser Prototype • Surface Morpheme Structure

LingBrowser Prototype • Lexical Morpheme Structure

LingBrowser Prototype • Aligned Lexical Surface Structure

LingBrowser Prototype • Pronunciation Representation (SAMPA) • Interleaved • Parallel

LingBrowser Prototype • WordNet Lookups (via aligned Turkish and English Wordnets) • English translations/glosses of the root word • Turkish Synonyms

LingBrowser Prototype • Word Concordances • Morphological Concordance • All forms with the selected root / POS combination are listed in context • one can see possible objects of a verb regardless of the inflected/derived form it appears in • Much more meaningful for languages like Turkish, Finnish, etc.

(Prototype) Implementation • LingBrowser (indirectly) employs almost all the finite state language resources we have built over the last 10 years • All built using Xerox xfst, lexc and twolc • Indirectly via a database interface

Finite State Transducers Employed Total of 750 xfst regular expressions + 100K root words (mostly proper names) over about 50 files Stress Computation Transducer Syllabification Transducer Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer Surface form

Finite State Transducers Employed Two-Level Morphological Analyzer 1M States, 1.6 M Transitions Stress Computation Transducer Syllabification Transducer • ev+Noun+A3sg+P3sg+Loc • ev+Noun+A3sg+P2sg+Loc Exceptional Phonology Transducer Feature Form SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

Finite State Transducers Employed Lexical Morphemes Transducer ~400K States, 1M Transitions Stress Computation Transducer Syllabification Transducer • ev+sH+ndA • ev+Hn+DA Exceptional Phonology Transducer Lexical Morpheme Sequence SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

Finite State Transducers Employed Surface Morphemes Transducer ~560K States, 1.4M Transitions Stress Computation Transducer • ev+i+nde • ev+in+de Syllabification Transducer Surface Morpheme Sequence Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

Finite State Transducers Employed Pronunciation e – v i n – “d e Pronunciation Lexicon Transducer ~6.5M States, 8.5M Transitions Stress Computation Transducer Syllabification Transducer Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

Finite State Transducers Employed • Aligned pairs transducer • Input is the surface form • Output is a representation of the aligned lexical-surface feasible pairs; e.g. for evinde wewant to produce • ev+Hn+DA ev+sH+nDA • ev0in0de ev00i0nde evinde

Aligned-pairs Transducer • We use a modified version of the two-level rule transducer • Feasible pair a:b is replaced with "a-b":b • A rule like a:b => LC _ RCis rewritten as "a-b":b => LC' _ RC‘where contexts are in terms of the new feasible pairs • Let’s call this the AlignedTwoLevelTransducer

Aligned-pairs Transducer • A MapToPairs transducer maps each lexical symbol in the original grammar to the representations of the feasible pairs in the original grammar in which it is the lexical side • e.g., if we have A:a, A:e and A:0 as three feasible pairs with A on the lexical side, • then MapToPairs maps A to "A-a ", "A-e"and"A:0"

Aligned-pairs Transducer Feature Symbols Lexicon Transducer Lexical Symbols

Aligned-pairs Transducer The new transducer accepts all lexical symbol sequences allowed by the morphotactic constraints. Feature Symbols Lexical Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols

Aligned-pairs Transducer This transducer maps lexical symbol sequences to valid possible feasible pair sequences Feasible-pair symbols MapToPairs Feature Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols

Aligned-pairs Transducer This transducer accepts all potentially valid feasible pair sequences. Feasible-pair symbols Feasible-pair symbols MapToPairs Feasible-pair sequence Recognizer Extract Upper Side Feature Symbols Extract Lower Side Feasible-pair symbols Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols

Aligned-pairs Transducer This transducer maps surface forms to feasible pair sequences subject to morphographemic and morphotactic constraints. Feasible-pair symbols Feasible-pair symbols MapToPairs Feasible-pair sequence Transducer Extract Upper Side Feature Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l AlignedTwoLevelTransducer Lexical Symbols Lexical Symbols Surface Symbols

Aligned-pairs Transducer • ev+Hn+DA ev+sH+nDA • |||||||| ||||||||| • ev0in0de ev00i0nde Feasible-pair sequence Transducer AlignedTwoLevelTransducer evinde

Implementation • Other resources used • Turkish WordNet aligned with the English WordNet • Current prototype was implemented in 4 months as a senior project, on MS .NET platform • Now being ported to Java

Implementation • Text is annotated on the background with multiple threads • All text items are reverse indexed on relevant features (morphemes, features, syllables, phonemes, etc) for fast search, e.g., • Find all bi-syllabic words with an open syllable ending in “a” • Find all words with bi-syllabic roots with a long root final vowel • Find all finite verbs in future tense with 3rd plu agreement. • Find all words using the lexical morpheme +sHz” • Find all words in which lexical “+sH” is aligned with surface “00u” • Find all words with syllables with multiconsonant codas

Future Functionality • Lexical paraphrasing • evimizdekiler  (those things) in our house

Future Functionality • Lexical paraphrasing • evimizdekiler  (those things) in our house • Gets nasty when multiple derivations are present • Finlandiyalılaştıramadıklarımızdanmışsınızcasına  (behaving) as one of those who we could not convert into a Finn(ish citizen) • Tree transducers

Future Functionality • Lexical paraphrasing • evimizdekiler  (those things) in our house • Gets nasty when multiple derivations are present • Finlandiyalılaştıramadıklarımızdanmışsınızcasına  (behaving) as one of those who we could not convert into a Finn(ish citizen) • Tree transducers • Extensive explanatory feedback • Morphographemics (why is lexical s deleted?) • show triggering contexts in addition to the rule • Pronunciation (why is this syllable stressed?) • show exceptional stress morphemes and explain their intearction

Future Functionality • Drills • Generate surface form from lexical form • Segment into surface morphemes • Identify morphosyntactic features encoded by morphemes • Generate surface form from a set of features

Future Functionality • Surface syntactic relations Eski Mısır kültüründe, çocuğa akıllı küçük denilmekteydi. Küçük yetişkin deyimi geleneksel toplumların çocuğu yetişkin yaşamına teşvik eden işleriyle kabul gördü. Ortaçağ'da ise, Avrupa'da çocuklara küçük hayvanlar denildi. Sanayileşme bu kültürel ayırımı hayata geçirerek çocuğu yetişkin yaşamından kopardı. Çocukluğu yetişkinlikten ayrı bir döneme indirgemek, çocukların geleceğe uyumlarını güçleştirecektir. Kaldı ki, bilgi toplumunda öylesi bir soyutlamanın, yani çocukluğun yetişkinlikten ayrı tutulmasının, imkansız denecek hale geldiği ise, açık bir gerçektir... sanayi+Noun+..^DB+Verb+Become..^DB+Noun+Inf+..+Nom Subject kop+Verb^DB+Verb+Caus+Past+A3sg

Planned Deployment • We expect to have a version to be tested in Sharon Inkelas’ Linguistics course at Berkeley, by Fall 2006.

Summary • LingBrowser is an active and interactive tool for linguistic exploration on real (Turkish) text • Query • Search • See explanations • Extensive use of finite state language resources • Being extended to included additional functionality.

Using Finite State Technology in a Tool for Linguistic Exploration

Using Finite State Technology in a Tool for Linguistic Exploration

Presentation Transcript

Using Technology as a Tool in the Differentiated Classroom

Finite State Machine for Games

Synthesis For Finite State Machines

Visual Tool for Literature Exploration

Finite State Automata

Finite State Machine

Space Exploration Using Technology in the Science Curriculum

Finite State Machine for Games

Finite State Machine for Games

Finite State Machines

Technology: A Tool for Advocacy

Tool-supported Program Abstraction for Finite-state Verification

Finite Automata (Finite State Machine)

Finite State Machine in DCS

Using Unicode for Linguistic Data

Using Technology as a Tool in the Differentiated Classroom

Using Technology As A Learning Tool

A Single Final State for Finite Accepters

Tool-supported Program Abstraction for Finite-state Verification

Constellation: A Visualization Tool for Linguistic Queries from MindNet

Tool-supported Program Abstraction for Finite-state Verification

Visual Tool for Literature Exploration