Developing Asian Language Corpora: standards and practice

Developing Asian Language Corpora: standards and practice Richard Xiao Tony McEnery Paul Baker Andrew Hardie Lancaster University ALR04 - Sanya, China

An overview of the talk • Corpus development standards • The EMILLE (Enabling Minority Language Engineering) Corpus • The Lancaster Corpus of Mandarin Chinese (LCMC) • XML-aware, Unicode-compliant corpus exploration tools • Software demonstration ALR04 - Sanya, China

Corpus development standards (1) • Why is standardization important? • To be compliant with major international standards • To facilitate electronic data exchange • To foster cooperation and coordination between different centres and projects • To meet the requirements of corpus validation • The ALR Committee is working in the right direction ALR04 - Sanya, China

Corpus development standards (2) • Corpus constituents • Corpus manifest • Type (paper document, computer file, audio/video recording, etc.) • Carrier (computer file name and location, document title etc.) • Status (integral part of corpus, descriptive metadata, associated annotation, documentation, etc.) • Digital components and the storage format (character encoding, binary format, record structure, etc.) • Primary data: corpus files • Ancillary data: corpus documentation ALR04 - Sanya, China

Corpus development standards (3) • Data formats • Primary data • Text files: XML/SGML conforming to a standard or supplied DTD or schema • Audio: MP3 or WAV • Video: MPEG or Quicktime • Image files: PNG or JPG • Ancillary data • Documentation:PDF, HTML, or XML ALR04 - Sanya, China

Corpus development standards (4) • File structure, markup and annotation • Corpus header • providing metadata about the corpus file • TEI/CES-compliance • Corpus body • Containing the corpus data • TEI/CES-compliance • Markup for paragraphs and sentences • Preferably annotated with various levels of linguistic analysis (POS tagging…) • Character encoding • Unicode-compliance (UTF-8/16) ALR04 - Sanya, China

The EMILLE project • The EMILLE project • Funded by the UK EPSRC (Grant references GR/N19106, GR/M70735, GR/N28542 and GR/R42429/01) • Research partners: Lancaster University, Sheffield University, and the Central Institute of Languages (CIIL) in Mysore, India • Three main goals • To build corpora of South Asian languages • To extend the GATE (General Architecture for Text Engineering) LE architecture • To develop basic LE tools • Project site: http://www.emille.lancs.ac.uk/ • GATE: http://gate.ac.uk/sale/tao/index.html#x1-550002.26 ALR04 - Sanya, China

The EMILLE Corpus: An overview • Three components • Monolingual, Annotated, and Parallel • 14 South Asian languages • Spoken data for five language • Monolingual corpora contain more than 96 million words • Spoken data over 2.6 million words • The Urdu corpus is POS tagged • Part of the Hindi corpus is annotated for anaphora • Parallel corpus covers English and five South Asian languages • Corpus building tools: Uni-codify, Uni-viewer, Uni-editor ALR04 - Sanya, China

Language Written Spoken Total Assamese 2,620,000 0 2,620,000 Bengali 5,520,000 442,000 5,962,000 Gujarati 12,150,000 564,000 12,714,000 Hindi 12,390,000 588,000 12,978,000 Kannada 2,240,000 0 2,240,000 Kashmiri 2,270,000 0 2,270,000 Malayalam 2,350,000 0 2,350,000 Marathi 2,210,000 0 2,210,000 Oriya 2,730,000 0 2,730,000 Punjabi 15,600,000 521,000 16,121,000 Sinhala 6,860,000 0 6,860,000 Tamil 19,980,000 0 19,980,000 Telugu 3,970,000 0 3,970,000 Urdu 1,640,000 512,000 2,152,000 Total 93,530,000 2,627,000 96,157,000 The EMILLE Monolingual Corpora ALR04 - Sanya, China

The EMILLE Annotated Corpora • POS tagging • The whole monolingual Urdu corpus • The Urdu component of the EMILLE Parallel Corpora • Anaphoric annotation • Around 100,000 words of news material (20 excerpts from the Ranchi Express data) from the Hindi Monolingual Corpus ALR04 - Sanya, China

The EMILLE Parallel Corpus • 75 advice leaflets published by the UK government • Approximately 200,000 words of English originals with accompanying translations in five South Asian languages • Hindi, Bengali, Punjabi, Gujarati, and Urdu • Covering a range of term-rich domains ALR04 - Sanya, China

The EMILLE corpus building tools • Uni-codify • Allows users to convert 30 (or so) different 8-bit encodings of South Asian scripts into 16-bit little-endian Unicode format • Compiled program accompanied by documentation • Uni-Viewer • Allows users to view Unicode texts • Uni-Editor • Allows users to edit Unicode texts • Urdu POS tagger • POS tagging Unicode-encoded Urdu texts • Accompanied by the tagset and the user manual ALR04 - Sanya, China

The EMILLE Corpus: Availability • The full release of the EMILLE Corpus and tools are distributed free of charge for use in non-profit-making research • Digital sound files will also be released soon • Indexed version for use with Xara will be available soon • Corpus download site • http://www.ling.lancs.ac.uk/corplang/emille ALR04 - Sanya, China

The LCMC Corpus: Aims • Built for the ESRC project Contrasting tense and aspect in English and Chinese (Grant Ref. RES-000-220135) • A Chinese match for FLOB/Frown for BrE/AmE • A publicly available balanced corpus of Mandarin Chinese • Distributed free of charge for use in non-profit-making research ALR04 - Sanya, China

LCMC: Profile • One million words • 1990-1993 • 15 text categories • 500 text samples • Major text provider: SSReader Digital Library in China • Unicode (UTF-8) • XML-conformant mark-up • Marked for paragraphs and sentences • POS-tagged (precision rate 98%+) • Standard character and Romanized Pinyin versions ALR04 - Sanya, China

Corpus POS Bal. Channel Variety Contr. LCMC Yes Yes Written Mainland E – C Sinica Yes Yes Mixed Taiwan No PH No No Written Mainland No PFR Yes No Written Mainland No LIVAC No No Written Mixed C – C SCCSD No Yes Spoken Mainland No TREC No No Written Mainland No Gigaword No No Written Mainland No Callhome No ? Spoken Mixed No Major Chinese corpus resources ALR04 - Sanya, China

Code Text category Samples Proportion A Press reportage 44 8.8% B Press editorials 27 5.4% C Press reviews 17 3.4% D Religion 17 3.4% E Skills/trades/hobbies 38 7.6% F Popular lore 44 8.8% G Biographies/essays 77 15.4% H Miscellaneous 30 6% J Science 80 16% K General fiction 29 5.8% L Mystery/detective fiction 24 4.8% M Science fiction 6 1.2% N Western/adventure fiction 29 5.8% P Romantic fiction 29 5.8% R Humor 9 1.8% Total 500 100% LCMC: Sampling frame ALR04 - Sanya, China

Level Code Gloss Attribute Value 1 text Text type TYPE As per Table 2 Text Category ID As per Table 2 Code 2 file Corpus file ID Text ID plus file number starting from 01 3 p Paragraph --- --- 4 s Sentence n Starting from 0001 onwards 5 w Word POS Part-of-speech tags as per the LCMC tagset c Punctuation and symbol gap Omission --- --- LCMC: Markup ALR04 - Sanya, China

LCMC: Annotation • Segmentation • POS tagging • Applying the Peking University tagset • 26 Level 1 POS tags • 50 Level 2 POS tags • ICTCLAS (Chinese Lexical Analysis System) • Developed by the Institute of Computing Technology, Chinese Academy of Sciences (Zhang & Liu 2002) • A frequency dictionary of 80,000 words • Based on a multi-layer hidden Markov model • Applying the n-shortest paths method • Automatic tagging with a precision rate of 97.16% • Post-editing improved the precision to over 98% ALR04 - Sanya, China

LCMC: Potential use • Monolingual study • Studying Mandarin Chinese as a whole • Exploring variation across text categories • Contrastive study (in conjunction with FLOB/Frown) • Contrasting Chinese and BrE/AmE • Contrasting text categories in Chinese and English ALR04 - Sanya, China

LCMC: Availability • Distributed free of charge for use in non-profit-making research • Accompanied by the user manual • Online search available via WebConc • The LCMC website • http://www.ling.lancs.ac.uk/corplang/lcmc • The Chinese mirror site (Chinese Academy of Social Science) • http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC.htm ALR04 - Sanya, China

Corpus exploration tools • XML-aware, Unicode-compliant corpus exploration tools • The WordSmith Tools version 4 • Presently under beta test • Beta version available • http://www.lexically.net/wordsmith/version4/index.htm • Xara (XML-aware Sara) • Sara:SGML-aware Retrieval Application • For use with the British National Corpus (BNC) • For either local or remote access • Presently under beta test • Documentation available at http://www.oucs.ox.ac.uk/rts/xara/ • A tutorial available at the LCMC website ALR04 - Sanya, China

Software demonstration • Using Xara for local access to LCMC • Query types: Quick query, word query (pattern), POS query, pattern query (regex), Query builder (e.g. a-n vs. a-de-n), etc • Display mode: KWIC mode vs. sentence mode • Display format: Plain vs. XML • Status bar: Reference • Other useful features: distribution, sort, collocation, partition, user-defined stylesheets, etc. • Using Xara to for local access to EMILLE • Using WebConC to access LCMC • http://www.ling.lancs.ac.uk/corplang/lcmc ALR04 - Sanya, China

And… Thank you! Richard Xiao z.xiao@lancaster.ac.uk ALR04 - Sanya, China

Developing Asian Language Corpora: standards and practice