N-gram Tokenization for Indian Language Text Retrieval

N-gram Tokenization for Indian Language Text Retrieval Paul McNamee paul.mcnamee@jhu.edu 13 December 2008

Talk Outline • Introduction • Monolingual Experiments from CLEF 2000-2007 • Words • Stemmed words (Snowball) • Character n-grams (n=4,5) • N-gram stems • Automatically segmented words (Morfessor algorithm) • Skipgrams (n-grams with skips) • Why are n-grams effective? • Bilingual Experiments (CLEF) • FIRE Results • Summary

Morphological Processes • Inflection • box, boxes (plural); actor (male), actress (female) • Conjugation • write, written, writing; swim, swam, swum • Derivation • sleep, sleepy; play (verb), player (noun), playful (adjective) • Word Formation • Compounding: news + paper = newspaper; air + port = airport • Clipping: professor -> prof; facsimile-> fax • Acronyms: GOI = Government of India

Why Do We Normalize Text? • It seems desirable to group related words together for query/document processing • Why? • To make lexicographers happy? • To improve system performance? • If performance is the goal, then it ought not to matter whether the indexing terms look like morphemes, or not

Rule-Based Stemming: Snowball • Applicable to alphabetic languages • An approximation to lemmatization • Identify a root morpheme by chopping off prefixes and suffixes • Used for Dutch, English, Finnish, French, German, Italian, Spanish, and Swedish • Snowball rulesets also exist for Hungarian and Portuguese • No Indian language support Most stemmers are rule-based -ing => e juggling => juggl -es => e juggles => juggl -le => -l juggle => juggl The Snowball project provides high quality, rule-based stemmers for many European languages http://snowball.tartarus.org/

N-gram Tokenization • Represent text as overlapping substrings • Fixed length of nof 4 or 5 is effective in alphabetic languages • For text of length m, there are m-n+1 n-grams • Advantages: simple, address morphology, surrogate for short phrases, robust against spelling & diacritical errors, language-independence • Disadvantages: conflation (e.g., simmer, slimmer, glimmer, immerse), n-grams incur both speed and disk usage penalties

Single N-gram Stemming • Traditional (rule-based) stemming attempts to remove the morphologically variable portion of words • Negative effects from over- and under-conflation HungarianBulgarian _hun (20547) _bul (10222) hung (4329)bulg (963) unga (1773) ulga (1955) ngar (1194) lgar (1480) gari (2477) gari (2477) aria (11036) aria (11036) rian (18485) rian (18485) ian_ (49777) ian_ (49777) Short n-grams covering affixes occur frequently - those around the morpheme tend to occur less often. This motivates the following approach: (1) For each word choose the least frequently occurring character 4-gram (using a 4-gram index) (2) Benefits of n-grams with run-time efficiency of stemming Continues work in Mayfield and McNamee, ‘Single N-gram Stemming’, SIGIR 2003

Statistical Segmentation • Morfessor Algorithm • Given a dictionary list, learns to split words into segments • A form of statistical stemming based on Minimum Description Length (MDL) • > 70% of world languages have concatenative morphology • Creutz & Lagus, ACL-2002 • http://www.cis.hut.fi/projects/morpho • 2007 Morphology Challenge • Successful on an IR task • Multiple segments per word are generated • Examples • affect+ion+ate • author+ized • juggle+d • juggle+r+s • sea+gull+s See McNamee, Nicholas, & Mayfield, ‘Don’t Have a Stemmer? Be un+concern+ed’, SIGIR 2008

Character Skipgrams • Character n-grams: robust matching technique • Skipgrams: super robust matching • Some letters are omitted (essentially a wildcard match) • sw*m matches swim / swam / swum • f**t matches foot / feet • Skip bi-grams for fuzzy matching • Pirkola et al. (2002): learning cross-lingual translation mappings in related languages • Mustafa (2004): monolingual Arabic retrieval • Example: 4,2 skipgrams for Hopkins • 4 letters, 2 skips • hkin, hpin, hpkn, hoin, hokn, hopn • oins, okns, okis, opns, opis, opks • Note: more skipgrams than plain n-grams • Slight gains in Czech, Hungarian, Persian • Application to OCR’d docs?

Generating Indexing Terms

JHU/APL HAIRCUT System • The Hopkins Automatic Information Retriever for Combing Unstructured Text (HAIRCUT) • Uses state-of-the-art statistical language model • Ponte & Croft, ‘A Language Modeling Approach to Information Retrieval,’ SIGIR-98 • Miller, Leek, and Schwartz, ‘A Hidden Markov Model Information Retrieval System’, SIGIR-99. • Typically set λ to 0.5 • Language-neutral • Supports large dictionaries • Used at TREC (10x), CLEF (9x), NTCIR(2x)

CLEF Ad Hoc Test Sets (2000 – 2007)

Tokenization Alternatives • Stemming • Effective in Romance languages • Not always available • N-grams • Language-neutral • Large gains in complex languages • Other techniques • Statistical stemming beats words • Segmentation • Single n-gram stems • No run-time penalty

Monolingual Tokenization

IR & Language Family • 5-gram Gains • Tied to morphological complexity • Small improvements in Romance family • Estimating Complexity • Mean word length • Spearman rho = 0.77 • Information-theoretic approach • Spearman rho = 0.67 • Kettunen et al., Juola HU FI CS BG DE SV RU HU FI CS DE SV NL

Why are N-grams Effective? • (1) Spelling • N-grams localize single letter spelling errors • In news about 1 in 2000 words is misspelled • (2) Phrasal Clues • Word spanning n-grams hint at phrases • Only slight differences observed

(3) Because of Morphological Variation? • N-grams might gain their power by controlling for morphological variation • N-grams focused on root morphemes tend to match across inflected forms • Juola (1998) and Kettunen (2006) did experiments ‘removing’ morphology from language • Such as replacing each surface form with a 6-digit number • I compared words and 5-grams under normal and permuted letter conditions • golfer: legfro • golfed: dofegl • golfing: ligfron

Source of N-gram Power • Idea: remove morphology from a language • Letter order of words was randomly permuted • golfer -> legfro, team-> eamt • golfing, golfer, golfed no longer share a morpheme • 4 conditions: {words,5-grams} x {normal,shuffled}

Corpus-Based Translation • Given aligned parallel texts and a particular term to translate • Find set of documents (sentences) in the source language containing the term • Examine corresponding foreign documents • Extract ‘good’ candidate(s) • Goodness can be based on term similarity measures (Dice, MI, IBM Model 1, etc.) The Rosetta Stone was discovered in 1799 by Napoleonic forces in Egypt. British physicist Thomas Young determined that cartouches were names of royalty. In 1821 Jean François Champollion began deciphering hieroglyphics using parallel data in Demotic and Greek The price of oil increased yesterday. The economy reacted sharply … El precio del petróleo aumentó ayer. La economía reaccionó agudamente …

N-gram Translations • Character n-grams can be statistically translated, just like words • N-grams (such as n=4,5) are smaller than words • May capture affixes and morphological roots • ‘work’ (from working) maps to ‘abaj’ (as in trabajaba) • ‘yrup’ (from syrup) maps to ‘rabe’ (as in jarabe) • Suitable with Proper Nouns • ‘therl’ (from Netherlands) to ‘ses b’ (as in Países Bajos)

Parallel Sources

Effectiveness & Corpus Size English queries translated using Europarl Corpus sub-sampled from 1 to 100%.

Effectiveness by size (2)

FIRE Index Characteristics • Vocabulary size in ILs seems abnormally small • Possibly a bug in my pre-processing or tokenization, perhaps related to Unicode (e.g., continuation or modification characters)

Tokenization for FIRE 2008 • Difficult to interpret results with anomalous vocabulary • Need Failure Analysis • Performance using words in ILs seems quite depressed • Hindi 5-gram run had good relative performance • Difference vs. 4-grams much larger than typically seen

Relative Gains w/ Relevance Feedback • Query expansion using top 10 documents • 50 terms (words), 150 terms (4/5-grams), 400 terms (sk41) • Fairly effective: 20-40% gains

In Conclusion • Compared several forms of representing text • In European languages n-grams obtain 20% gain over words • Rule-based stemming good in Romance languages • Morfessor segments, n-gram stems better than words, not as good as Snowball stemmer • N-grams gains • Greatest in morphologically richer languages • Lost when morphology ‘removed’ from language • FIRE • N-grams and RF also effective in ILs • Must resolve vocabulary issue • Difficulty finding parallel text, but would like to investigate bilingual retrieval

N-gram Tokenization for Indian Language Text Retrieval

N-gram Tokenization for Indian Language Text Retrieval

Presentation Transcript

Language Models for Information Retrieval

Semantic-based Language Models for Text Retrieval and Clustering

Text Processing Front End for Indian Language TTS System

N-gram Models

N-Gram Language Models

n-gram analysis

Smoothing N-gram Language Models

Inverted Indexing for Text Retrieval

Tokenization

Topic-Dependent-Class-Based N-Gram Language Model

Discriminative n-gram language modeling

N-gram Models

Text-retrieval Systems

N -Gram and Local Context Analysis for Persian text retrieval

Risk Minimization and Language Modeling in Text Retrieval

N-gram Models

CS 388: Natural Language Processing: N-Gram Language Models

Text retrieval systems