1 / 56

Standards, Use and Prospects for Language Resource Management

Standards, Use and Prospects for Language Resource Management. Key-Sun Choi 16 Aug. 2008 TII, Moscow. MOTIVATION. Wikipedia. Web-based collaborative authoring multi-lingual encyclopedia 8.29 M pages/ 253 languages (2007/9) 2.0 M pages/ English (2007/9) ~ now 5.0 M pages. Computer science.

susan
Download Presentation

Standards, Use and Prospects for Language Resource Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Standards, Use and Prospects for Language Resource Management Key-Sun Choi 16 Aug. 2008 TII, Moscow

  2. MOTIVATION

  3. Wikipedia • Web-based collaborative authoring multi-lingual encyclopedia • 8.29 M pages/ 253 languages (2007/9) • 2.0 M pages/ English (2007/9) ~ now 5.0 M pages Computer science Category Classification Databases Computer scientists Algorithms Category Page Divide & Conquer SQL Parallel database Martic Kay Robert Watson

  4. Problem: IS-A Relation Extraction from Wikipedia • Relation Classification from Category System • By Term Formation Rule, Wikipedia Structure (Ponzetto & Strube, 2007) Relation Classification IS-A relation Upper-lower level Category relation Not IS-A relation Computer science Not IS-A IS-A IS-A Databases Computer scientists Algorithms

  5. Relation Extraction by Pattern • (Ryu & Choi, 2007) • http://cseight.kaist.ac.kr:8080/RelExt Computer display mode IS-A Text mode

  6. Problem: IS-A Relation Extraction from Wiktonary • Web-based Collaborative Multilingual Dictionary • 617,639 entries/401 languages • ISA relation extraction from Definition Pattern • http://cseight.kaist.ac.kr:8080/Wiktionary IS-A IS-A

  7. Problem: IS-A Relation Extraction from WordNet • Semantic Word Net (English) • 117,798 nouns, 82,115 synset (Ver. 3.0) • ISA relation extraction through ISA between Synsets Synset #12 engineering, applied science IS-A IS-A Synset #22 Synset #23 Synset #33 electrical engineering computer science, computing chemical engineering

  8. Lexical Markup Framework LMF

  9. Wikipedia:IS-A Annotation IS-A (Entry, Term in Page) IS-A (Term in Page, Term in Page) Synonymy(Entry, Term in Page)

  10. A D PIVOT B E C F cat: ADJP role: alt role: alt cat: NP cat: JJ cat: NNP cat: PUNC type: hyphen cat: VBG N e w Y o r k - b a s e d What is common representation? • Graph Structure

  11. Linguistic Annotation Framework • ISO-GrAF: Graph Structure-based Annotation • GrAF XML schema type hierarchy • graphElementType; Attributes: ID, type • edgeType extends graphElementType • nodeType extends graphElementType • spanType extends nodeType; Attributes: start, end • graphElementSetType • edgeSetType extends graphElementSetType • nodeSetType extends graphElementSetType • featureStructureType • featureType • annotationSetType

  12. Problem: Causality between Terms • Causal relation between terms • Term clustering based on inter-term causality • Terms with similar causality tend to be similar concept. • Realization & Evaluation [ Skin cancer ] usually appears in adulthood, but it is caused by [ sun exposure ] and [ sunburns ] that began in childhood . TG Stat5 Interleukin-2 IL-2 Egr-1 IFN-gamma

  13. Oral bacteria Gum disease Smokeless tobacco product Toothache Periodontal disease Toothache cigarette cigarette Is it true?Terms with similar causality tend to be similar concept. The oral bacteria that cause gum disease appear to be the culprit. Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Gum disease is the second most common cause of toothache Periodontal disease can lead to toothache. Cigarette smoking is the number one environmental risk for periodontal disease.

  14. What to do • Is it true? • Terms with similar causality tend to be similar concept • We try to test the term clustering based on causal information • Prove that causality is one of effective features for term clustering. • Focus on • Causal NP pair extraction (Chang and Choi, 2004) • Causal term pair extraction • Term clustering based on causal similarity • Term clustering evaluation

  15. Features on term clustering (1/3) • Useful features for Term clustering • Internal feature • Word lexicon/structure in terms • (Bourigault and Jacquemin,1999): POS sequences including insertion • NPDNInsAj = NOunl ((Adv? Adj)0-3 Prep Det? (Adv? Adj)0-3 ) Noun3 • 93~98% precision • Outer-term feature • Structural modifier/modifiee of term • Some words nearby term • (Maynard et al., 2000) • Hand-made semantic frame information

  16. Feature Structure Representation (1) Employee • {<SEX, female>, <NAME, Sandy Jones>, <AGE, 30>} (2) Sound segment /p/ • {<CONSONANTAL, + >, <ANTERIOR, + >, <VOICED, ->, <CONTINUANT, ->} (3) Grammatical features of the verb ‘love’ • {<POS, verb>, <VALENCE, transitive>, <SEMANTIC_RELATION, loving>},

  17. FSR: Graph vs. Matrix Notation M

  18. Related Works on term clustering (3/3) • Discussion • Causal information is one of “long-distance contextual information” Cigarette smoking and use of smokeless tobacco products may also cause gum disease. cause use Gum disease Smokeless tobacco product

  19. appears caused Skin cancer it Sun exposure adulthood Sunburns began that child Event & ternary extraction Skin cancer usually appears in adulthood , but it is caused by sun exposure and sunburns that began in childhood . Dependency Structure appears caused by and usually in but Skin cancer it is Sun exposure adulthood Sunburns began in that child NP chunking Verb selection Reference finding Cue phrases filtering Causal event pair candidate <cause event, cue phrase, effect event> Skin cancer – RNP caused by CNP – sun exposure Skin cancer – RNP caused by CNP – sunburns

  20. Representation Scheme • Morpho-syntactic Annotation Framework • Syntactic Annotation Framework

  21. Morpho-Syntactic Annotation Framework: MAF • <token id=" t1 ">to</ token> • <token id=" t2 ">eventually</ token> • 3 <token id=" t3 ">decide</ token> • <wordForm lemma=" to_decide " tokens=" t1 t3 "/> • 5 <wordForm lemma=" eventually " tokens=" t2 "/>

  22. MAF: token <token id=" t1 ">The</ token> <token id=" t2 ">vi c t im</ token> <token id=" t3 ">’ s</ token> <token id=" t4 ">f r i e n d s</ token> <token id=" t5 ">t o ld</ token> <token id=" t6 ">p o l i c e</ token> <token id=" t7 ">that</ token> <token id=" t8 ">Krueger</ token> <token id=" t9 ">drove</ token> <token id=" t10 ">int o</ token> <token id=" t11 ">the</ token> <token id=" t12 ">quar ry</ token> <token id=" t13 ">and</ token> <token id=" t14 ">never</ token> <token id=" t15 ">sur f a c ed</ token> <token id=" t16 ">.</ token>

  23. Syntactic Annotation Framework

  24. Semantic Annotation Framework: TimeML • no more than 60 days • <TIMEX3 tid="t1" type="DURATION" value="P60D" mod="EQUAL_OR_LESS"> no more than 60 days </TIMEX3> • the dawn of 2000 • <TIMEX3 tid="t2" type="DATE" value="2000" mod="START"> the dawn of 2000 </TIMEX3>

  25. ONTOLOGY EXTRACTION/LEARNING AND QUESTION-ANSWERING

  26. Word Segmentation

  27. MULTILINGUAL INFORMATION FRAMEWORK

  28. vendor platform consists_of vender reside_on system software Dev. Env. OS Middleware App. Program Embeddedsystem Embeddedsoftware browser EmbeddedOS Mediaplayer Comm.middleware appliance DigitalCamera Non-real-timeembed. OS Real-time Embed. OS RTOS MP3player DVD player Set-topbox VRTX VxWorks pSOS WinCE Microsoft Wind River IT Ontology IT Core Ontology

  29. A Scenario Rule Reasoner Control Server Ontology Reasoner User What is the bestRTOS Vendor? Do you know? No What is RTOS? Real-time Operating System What are instances? VxWorks Vendor? Wind River . . Microsoft Which is better?

  30. Dialogue acts Well-known examples of communicative functions (“core dialogue acts”): • question • WH-question • YN-question • check/verification • statement/inform • answer (WH-answer. YN-answer) • confirmation, disconfirmation • request • instruct • promise • acknowledgement • greeting

  31. General-purpose functions Applicable in any dimension are: • Information-seeking functions WH-question, YN-question, Alternatives-question, Check,.. • Information-providing functions Inform, WH-Answer, YN-Answer, Confirmation, Disconfirmation, Agreement, Correction,.. • Commissive functions Offer, Promise, AcceptRequest,.. • Directive functions Instruct, Request, Suggest,..

  32. DiaML concrete syntax <diaMLid=‘d2’ speaker=`s’ addressee=‘a’ markable=‘m1’ commfunctions=‘cfs1’> <sourceTextid=‘m1’ =‘sb1’..’se1’blabla` ‘sb3’..se3’blabla> <cfsid=‘cfs1’ taskFun=‘f1’ feedbackFun=‘f2’> <comfunid=‘f1’ function=‘anwer’ respTo=‘d1’> <comfunid=‘f2’ function=‘positiv’ respTo=‘d1’> </cfs> </diaML>

  33. From sentence to ontologies artifact contents device ontology ··· camera video (camera, ISA, device) (camera, hasPropertyOf, that AND (take video)) Triplets extraction Dependency analysis camera is device takes that video Term recognition A [camera] is a [device] that [take]s [video]. Sentence A camera is a device that takes video.

  34. Standards for language processing Access protocols [Corba, SOAP] Primary resources (text, dialogues) Structural mark-up Basic annotations [TEI, MPEG7, TMX (XHTML…), etc.] Knowledge structures Hierarchies of types Relations between concepts (subjects/topics etc.) Links to primary resources [Topic Maps, OIL, RDF] Links NLP structures (annotations) POS tagging Chunks (cf. Named Entities) Deep Syntactic structures Co-references etc. [Eagles/ISLE, CES, MATE,…] Lexical structures (Language models) Terminologies Transfer lexica LTAG/HPSG/LFG lexica [TBX, OLIF, Eagles/ ISLE (Genelex)] Meta-data [Dublin core, OLAC, ISLE, MPEG7, RDF]

  35. Context • ISO TC37 - Terminology and other language resources • SC3 - Computer applications in terminology • ISO 12200 - Martif • Latest version of TEI Terminology chapter • ISO 12620 - Data categories • ISO CD (DIS: under ballot) 16642 - TMF (Terminological Markup Framework) • SC4 - Language resources

  36. TC37/SC4 details • Scope: Platform for designing and implementing linguistic resource formats and processes • Multi-layer annotation of linguistic resources • Exchange of information between NLP modules • General strategy • Involve a wide community from academia and industry • Identification of experts in the various work items • Involvment through national standardizing bodies • Agenda • Current: identification of possible work items and working groups • Constituancy meeting and technical workshop at LREC (May 2002)

  37. Organization • Chair: • Laurent Romary, France • Secretary: • Key-Sun Choi, Korea • International Advisory Committee • Chair: Prof. Antonio Zampolli, Italy

  38. SC4 and other standardizing bodies • TEI • text representation • Reference for primary sources • e.g.: text archives Oscar Text • W3C • basic protocols and formats • XML (Schemas) • XPath • XPointer • + RDF, SVG, SMIL, SOAP ISO TC37/SC4 - language resources, NLP perspective e.g. linguistic annotations, lexical formats Technical background • What about gestures? • Kinetic in the TEI • SMIL? MPEG - Multimedia, XML based e.g. MPEG7-4 Word and phone lattices Audio/Speech

  39. TC37/SC4 Work Items • WG1/WI-0: Terminology of Language Resources • WG1/WI-1: Linguistic annotation framework • WG1/WI-2: Meta-data for multimodal and multilingual information • WG2/WI-3: Structural content representation scheme • WG2/WI-4: Multimodal content representation sheme • WG2/WI-5: Discourse level representation scheme

  40. TC37/SC4 Work Items - cont. • WG3/WI-6a: Multilingual text representation • WG4/WI-7: NLP Lexica • WG5/WI-8: Net-based distributed cooperative work for the creation of LRs

  41. WI-0 • Terminology of Language Resources • Basic terminology of the various sub-fields of language resources and general methodology • Project leader: Klaus-Dirk Schmitz • Sources: • ISO 1087 • LREC proceedings + KAIST • English dictionaries in Linguistics? • Support from GTW

  42. WI-1 • Linguistic annotation framework • Basic mechanisms and data structures for linguistic annotation and representation [data architecture] • Methods and principles for the design of an annotation scheme • Structural nodes and information units, Data category specification • Linking and pointing mechanisms, Feature Structures, Meta-Markup • « Stand-off » and « in-line » views - equivalences, combining levels. • Administrative data categories

  43. WI-1 - cont. • Project leader: Nancy Ide (TBC) • Contributors: Alan Melby, Koiti Hasida, Lee Gillam, Yves Savourel, Laurent Romary… • Possible sources: • TMF, iso12620-revised, Mate (general methodology) • TEI (Linking mechanisms, feature structures) • Link with Linguistic DS

  44. WI-2 • Meta-data for multimodal and multilingual information • Description of a meta-data representation scheme to document linguistic information structures and processes • General content description • Local content description • Project leader: Peter Wittenburg, MPI (Nijmegen, NL) • Participants: Steven Bird, TEI aware person • Possible sources: • OLAC, Mile, TEI Header • Liaison: TC46 (SC9), MPEG7/MDS, SCORM

  45. WI-3 • Structural content representation scheme • Definition of annotation/representation scheme(s) for morpho-syntax and syntax, to be used for annotation and interchange purposes • Meta-model for morpho-syntactic annotation • Meta-model(s) for syntactic annotation (lexicalized grammar, elementary trees, dependancy structures) • + corresponding Data category registries • Project leader:John Carroll ?? • Participants: Nuria Bell • Possible sources: • Eagles, TAGML, Linguistic DS • SIGPARSE • Working group with representatives from existing TreeBanks initiatives

  46. WI-4 • Multimodal meaning representation scheme • Representation scheme for the semantic content of multimodal information (textual, spoken, graphical and gestural) • Meta-modal for content representation (Events, participants, etc.) • Data category registry for multimodal content • Project leader: Harry Bunt (id=“1”) • Possible sources: • SIGSEM working group on semantic content • Chair: #1 • « Liaison » • Semantic web activities

  47. WI-5 • Discourse level representation scheme • Meta-model for discourse and dialogue representation • Meta-model for discourse level annotation (e.g. reference annotation) • + corresponding DatCat registry • Possible sources: • SIGDIAL • DRI - Discourse Resource Initiative • Mate

  48. WI-6 • Multilingual text representation scheme • Framework for representing language specific and multi-lingual textual information • Translation Memory • Alignment – Parallel Corpora • Word count algorithms (characters, words, segments) • Possible sources: • TMX for translation memories • TEI based linking mechanism (or see WI-1) for Parallel texts

  49. WI 6A • Translation Memory, Alignment of parallel corpora • Sources: • OSCAR/TMX for translation memories • TEI based linking mechanism (or see WI-1) for Parallel texts

More Related