Language and Document Models in Information Retrieval

Language and Document Models in Information Retrieval ZhuoRan Chen 2006-2-8

Table of Content • Definitions • Applications • Evaluations • SLM for IR • Burstiness

What is a SLM? • A Statistical Language Model (SLM) is a probability distribution function over sequences of words. • An example: P(“Rose is red”) > P(“Red is Rose”) > 0 • Another: P( color around | It might be nice to have a little more ) = ?

Two Stories of SLM • The Story of Document Model • Giving a document (def: a sequence of words), how good is that document (the odds that it is composed by a person)? • Judgment may be drawn from words and other sources, e.g. syntax, burstiness, hyperlinks, etc. • The Story of generating (used in SR and IR) • Giving a training set (def: a collection of sequences), how can we generate a sequence that is in accordance with the training set? • In speech recognition: generating the next word; In IR: generating a query from a document

What SLM can do? • Speech recognition • Machine Translation • Information Retrieval • Handwriting recognition • Spelling check • OCR • …

How can SLM do that? • Compare the probabilities of candidates of word sequences and pick one that “looks” most likely. • The actual question depends on specific field • MT: Giving a bag of words, what is the best permutation to get a sentence? • Speech recognition: Giving the preceding words, what is the next word? • IR: Giving a document, what is the query?

Challenges in SLM • Long sequences • Partial independence assumption • Sparseness • Smoothing methods • Distributions • Is there really one?

Evaluation of SLMs • Indirect evaluation • Compare the outcomes of the application, be it MT, SR, IR, or others. • Issues: slow, depends on dataset, other components, etc • Direct evaluation • Perplexity • Cross entropy

Evaluation of SLM: Perplexity • Definition: Perplexity is geometric average of inverse probability • Formula (from Joshua Goodman) • Usually the lower the better, but … • Limits: • LM must be normalized (sum to 1) • The probability of any term must > 0.

Evaluation of SLM: Cross entropy • Entropy = log2 perplexity • Example

The Poisson Model – Bookstein and Swanson • Intuition: content-bearing words clustered in relevant documents; non-content words occur randomly. • Methods: linear combination of Poisson distributions • The two-poisson model, surprisingly, could account for the occupancy distribution of most words.

Poisson Mixtures – Church & Gale • Enhancements for 2-Poisson: Poisson mixtures, negative binomial... • Problems: Parameter estimation and Overfit From Church&Gale1995

Formulas (from Church&Gale)

SLM for IR – Ponte & Croft • Tell a story different from 2-Poisson • Doesn’t rely on Bayer’s theorem • Conceptually simple and parameter free, leave room for further improvement

SLM for IR – Lafferty and Zhai • A framework that incorporates Bayesian theory, Markov chain, and language modeling by using the “loss function” • Feathers query expansion

SLM for IR – Liu and Croft • The Query likelihood model: • To generate query from document arg max P(D|Q) = arg max P(Q|D)P(D) D D • P(D) assumed to be uniform. Many ways to model P(D|Q): multi-variant Bernoulli, multinomial, tf-idf, HMM, noisy channel, risk minimization function (K-L divergence) and all the smoothing methods.

SLM + Syntactic • Chelba and Jelinek • Construct ngrams from syntactic analysis e.g. The contract ended with a loss of 7 cents after trading as low as 89 cents. (ended (with (…))) after  ended_after • headword: long distance information when predicting using ngram • Left-to-right increasmentally parsing strategy: usable for speech recognition

Smoothing Strategies • No smoothing (Maximal Likelihood) • Interpolation • Jelinek-Mercer • Good-Turing • Absolute discounting

Smoothing Strategies – maximum likelihood • Formula: P(z|xy) = C(xyz)/C(xy) • The name comes from that it does not waste any probability mass on unseen events, maximizes the probability of observed events. • Cons: zero probabilities for unseen n-grams, which will propagate into P(D).

Smoothing Strategies – interpolation • Formula: P(z|xy) = w1*C(xyz)/C(xy) + w2*C(yz)/C(y) + (1-w1-w2)C(z)/C • combine unigram, bigram and trigram • Search for w1, w2 – training set, pick best • Hints: allow enough training data for each parameter • Good in practice

Smoothing Strategies – Jelinek-Mercer • Formula: P(z|xy) = w1*C(xyz)/C(xy) + (1-w1) *C(yz)/C(y) • W1 usually trained using EM. • Also known as “deleted-interpolation”

Example for Good-Turing smoothing (from Joshua Goodman) Image you are fishing and you have caught 5 carp, 3 tuna, 1 trout, 1 bass. How likely is it that your next fish is none of the four species? (2/10) How likely is it that your next fish is tuna? (less than 3/10)

Smoothing Strategies – Good-Turing • Intuition: odds of all unseen events have a total “probability mass” of those occur once; odds for other events adjusted accordingly. • Formula: nr: number of types that occurs r times N: total tokens in corpus p(w) = (r+1)/N *(nr+1/nr) note: maximum likelihood estimation for w is r/N.

Smoothing Strategies – Absolute discounting • Intuition: lower the probability mass of seen events by subtracting a constant D. • Formula: Pa(z|xy) = max{0, C(xyz)-D}/ C(xy) + w*Pa(z|y) w = D*T/N, where N is the number of tokens and T is the number of types. • Rule of thumb: D = n1/(n1+2*n2) • Works well except for count=1 situations

The Study of Burstiness

Burstiness of Words • The definitions of word frequency • Term frequency or TF: count of occurrences in a given document • Document frequency or DF: count of documents in a corpus that a word occurs • Generalized document frequency or DFj: like DF but a word must occurs at least j times • DF/N: Given a word, the chance we will see it in a document (the p in Church2000). • ∑TF/N: Given a word, the average count we will see it in a document • Given we have seen a word in one document, what’s the chance that we will see it again?

Burstiness: the question • What are the chances of seeing one, two, and three “Noriegas” within a document? • Traditional assumptions • Poisson mixture, 2-Poisson model • Independence of word • The first occurrence depends on DF, but the second does not! • The adaptive language model (used in SR) • The degree of adaptation depends on lexical content – independent of the frequency. “word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph” -- Church&Gale

Count in the adaptations • Church’s formulas • Cache model Pr(w) = λPrlocal(w) + (1-λ)Prglobal(w) • History-Test division; Positive and negative adaptations Pr(+adapt) = Pr(w in test| w in history) Pr(-adapt) = Pr(w in test| w not in history) observation: Pr(+adapt) >> Pr(prior) > Pr(-adapt) • Generalized DF dfj = number of documents with j or more instances of w.

Experimental results – 1 • High adaptation words (based on Pr(+adapt2)) • a 14218 13306 • and 14248 13196 • ap 15694 14567 • i 12178 11691 • in 14533 13604 • of 14648 13635 • the 15183 14665 • to 14099 13368 ----------------------------------------- • agusta 18 17 • akchurin 14 14 • amex 20 20 • apnewsalert 137 131 • barghouti 11 11 • berisha 18 17

Experimental results – 2 • Low adaptation words • asia 9560 489 • brit 12632 18 • ct 15694 7 • eds 5631 11 • english 15694 529 • est 15694 72 • euro 12660 261 • lang 15694 24 • ny 15694 370 ---------------------------------------------- • accuses 177 3 • angered 155 2 • attract 109 2 • carpenter 117 2 • confirm 179 3 • confirmation 114 2 • considers 102 2 • Low adaptation words much more than high adaptation ones

Experimental results – 3 • Words with low frequency and high burstiess (many) alianza, andorra, atl, awadallah, ayhan, bertelsmann, bhutto, bliss, boesak, bougainville, castel, chess, chiquita, cleopa, coolio, creatine, damas, demobilization • Words with high frequency and high burstiess (few) a, and, as, at, by, for, from, has, he, his, in, iraq, is, it, of, on, reg, said, that, the, to, u, was, were, with

Experimental results – 4 • Words with low frequency and low burstiess (lots) accelerating, aga, aida, ales, annie, ashton, auspices, banditry, beg, beveridge, birgit, bombardments, bothered, br, breached, brisk, broadened, brunet, carrefour, catching, chant, combed, communicate, compel, concede, constituency, corpses, cushioned, defensively, deplore, desolate, dianne, dismisses • Words with high frequency and low burstiess (few) adc, afri, ams, apw, asiangames, byline, edits, engl, filter, indi, mest, miscellaneous, ndld, nw, prompted, psov, rdld, recasts, stld, thld, ws, wstm

Detection of bursty words from a stream of documents • Idea: Find features that occur with high intensity over a limited period of time • Method: infinite-state automaton. Bursts appear as state transitions -- Kleinberg, Bursty and Hierarchical Structure in Streams. Proc. 8th ACM SIGKDD, 2002

Detecting Bursty Words • Term w occurs in a sequence of text at positions u1, u2, …  events happen with positive time-gap x1, x2, … where x1=u2-u1, x2 = u3 – u2, etc. • Assume the events are emited by a probabilistic infinite-state automaton, each state associated with a exponential density function f(x)=ae-ax, where a is the “rate” parameter (expected value of gap is a-1 )

Finding the state transitions From J. Kleinberg, Bursty and Hierachical Structure in Streams. 8th ACM SIGKDD, 2002 Optimal sequence: less state transitions while keeping the rates closely agreeing with observed gaps.

Sample Results • From database conferences: SIGMOD, VLDB 1975-2001 data, base, application 1975- 1979/1981/1982 relational 1975 – 1989 schema 1975 – 1980 distributed 1977 – 1985 statistical 1981 – 1984 transaction 1987 – 1992 object-oriented 1987 – 1994 parallel 1989 – 1996 mining 1995 – web 1998 – xml 1999 -

Sample Results • From AI conferences, AAAI, IJCAI, 1980 -- 2001 an 1980 – 1982 language 1980 – 1983 image 1980 – 1987 prolog 1983 -- 1987 reasoning 1987 – 1988 decision 1992 – 1997 agents 1998 – agent 1994 – mobile 1996 – web 1996 – bayesian 1996 – 1998 auctions 1998 – reinforcement 1998 –

THE END Discussion?

Language and Document Models in Information Retrieval

Language and Document Models in Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Relevance Models In Information Retrieval

Cross-Language Information Retrieval

Cumulative Progress in Language Models for Information Retrieval

Information Retrieval – Language models for IR

Cross-Language Information Retrieval

Probabilistic Language-Model Based Document Retrieval

Advanced Information- Retrieval Models

Information Retrieval Models

Two-stage Language Models for Information Retrieval

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Document Preprocessing and Indexing SI650: Information Retrieval

Information Retrieval Models

Information Retrieval: Models and Methods

Challenges in Information Retrieval and Language Modeling