Enhancing Text Ranking with Part-of-Speech Tagging for Relevant Document Retrieval

(Some issues in) Text Ranking

Recall General Framework • Crawl • Use XML structure • Follow links to get new pages • Retrieve relevant documents • Today • Rank • PageRank, HITS • Rank Aggregation

Relevant documents • Usually: relevant with respect to a keyword, set of keywords, logical expression.. • Closely related to ranking • “How” relevant is it can be considered another measure • Usually done as a separate step • Recall the Online vs. offline issue.. • But some techniques are reusable

Defining Relevant Documents • Common strategy: treat text documents as “bag of words” (BOW) • Denote BOW(D) for a document D • Bag rather than set (i.e. multiplicity is kept) • Words are typically stemmed • Reduced to root form • Loses structure, but simplifies life • Simple definition: • A document D is relevant to a keyword W if W is in BOW(D)

Cont. • Simple variant • The level of relevance of D to W is the multiplicity of W in BOW(D) • Problem: Bias towards long documents • So divide by the document length |BOW(D)| • This is called term frequency (TF)

A different angle • Given a document D, what are the “most important” words in D? • Clearly high term frequency should be considered • Rank terms according to TF?

Ranking according to TF A 2022 Is 1023 He 350 . . . Liverpool 25 Beatles 12

IDF • Observation: if w is rare in the documents set, but appears many times in a document D, then w is “important” for D • IDF(w) = log(|Docs| / |Docs’|) • Docs is the set of all documents in the corpus, Docs’ is the subset of documents that contain w • TFIDF(D,W)=TF(W,D)*IDF(W) • “Correlation” of D and W

Inverted Index • For every term we keep a list of all documents in which it appears • The list is sorted by TFIDF scores • Scores are also kept • Given a keyword it is then easy to give the top-k

Ranking • Now assume that these documents are web pages • How do we return the most relevant? • How do we combine with other rankings? (e.g. PR?) • How do we answer boolean queries? • X1 AND (X2 OR X3)

Rank Aggregation • To combine TFIDF, PageRank.. • To combine TFIDF with respect to different keywords

Part-of-Speech Tagging • So far we have considered documents only as bags-of-words • Computationally efficient, easy to program, BUT • We lost the structure that may be very important: • E.g. perhaps we are interested (more) in documents for which W is often the sentence subject? • Part-of-speech tagging • Useful for ranking • For machine translation • Word-Sense Disambiguation • …

Part-of-Speech Tagging • Tag this word. This word is a tag. • He dogs like a flea • The can is in the fridge • The sailor dogs me every day

A Learning Problem • Training set: tagged corpus • Most famous is the Brown Corpus with about 1M words • The goal is to learn a model from the training set, and then perform tagging of untagged text • Performance tested on a test-set

Simple Algorithm • Assign to each word its most popular tag in the training set • Problem: Ignores context • Dogs, tag will always be tagged as a noun… • Can will be tagged as a verb • Still, achieves around 80% correctness for real-life test-sets • Goes up to as high as 90% when combined with some simple rules

(HMM)Hidden Markov Model • Model: sentences are generated by a probabilistic process • In particular, a Markov Chain whose states correspond to Parts-of-Speech • Transitions are probabilistic • In each state a word is outputted • The output word is again chosen probabilistically based on the state

HMM • HMM is: • A set of N states • A set of M symbols (words) • A matrix NXN of transition probabilities Ptrans • A vector of size N of initial state probabilities Pstart • A matrix NXM of emissions probabilities Pout • “Hidden” because we see only the outputs, not the sequence of states traversed

Example

3 Fundamental Problems 1) Compute the probability of a given observation Sequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging 3) Given a training set find the model that would make the observations most likely

Tagging • Find the most likely sequence of states that led to an observed output sequence • Problem: exponentially many possible sequences!

Viterbi Algorithm • Dynamic Programming • Vt,k is the probability of the most probable state sequence • Generating the first t + 1 observations (X0,..Xt) • And terminating at state k

Viterbi Algorithm • Dynamic Programming • Vt,k is the probability of the most probable state sequence • Generating the first t + 1 observations (X0,..Xt) • And terminating at state k • V0,k = Pstart(k)*Pout(k,X0) • Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}

Finding the path • Note that we are interested in the most likely path, not only in its probability • So we need to keep track at each point of the argmax • Combine them to form a sequence • What about top-k?

Complexity • O(T*|S|^2) • Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)

Computing the probability of a sequence • Forward probabilities: αt(k) is the probability of seeing the sequence X1…Xt and terminating at state k • Backward probabilities: βt(k) is the probability of seeing the sequence Xt+1…Xn given that the Markov process is at state k at time t.

Computing the probabilities Forward algorithm α0(k)= Pstart(k)*Pout(k,X0) αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)} P(O1,…On)= Σk αn(k) Backward algorithm βt(k) = P(Ot+1…On| state at time t is k) βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)} βn(k) = 1 for all k P(O)= Σk β0(k)* Pstart(k)

Learning the HMM probabilities • Expectation-Maximization Algorithm • Start with initial probabilities • Compute Eij the expected number of transitions from i to j while generating a sequence, for each i,j (see next) • Set the probability of transition from i to j to be Eij/ (ΣkEik) 4. Similarly for omission probability 5. Repeat 2-4 using the new model, until convergence

Estimating the expectancies • By sampling • Re-run a random a execution of the model 100 times • Count transitions • By analysis • Use Bayes rule on the formula for sequence probability • Called the Forward-backward algorithm

Accuracy • Tested experimentally • Exceeds 96% for the Brown corpus • Trained on half and tested on the other half • Compare with the 80-90% by the trivial algorithm • The hard cases are few but are very hard..

NLTK • http://www.nltk.org/ • Natrual Language ToolKit • Open source python modules for NLP tasks • Including stemming, POS tagging and much more

Enhancing Text Ranking with Part-of-Speech Tagging for Relevant Document Retrieval

Enhancing Text Ranking with Part-of-Speech Tagging for Relevant Document Retrieval

Presentation Transcript

CS290 – Some Issues In Ethics

Some Central Issues in Social Research

Retirement in Canada Some Trends…Some Issues

Some issues in assessment: Some examples from engineering

Some Energy Policy Issues in Slovenia

Some common issues in student writing

SOME ISSUES IN REFUGEE CASEWORK

Some Technical Issues

Some other issues in C++

Some ENSDF Issues

Some Inheritance Issues in Java

Some Issues in Charmonium Physics

Some Research Issues

Identity/subjectivity -- some issues in research

Some issues in databse-directory integration

Some Critical Issues…

Some issues encountered in the process

Some sticky issues

Some Management Issues

Some issues 1

Some issues in cluster cosmology