Cmput 650 Final Project

Probabilistic Spelling Correction for Search Queries Cmput 650 Final Project

Overview • Motivation • Problem Statement • Noisy Channel Model • EM Background • EM for Spelling Correction

Search Query Spelling Correction • Motivation • Over 700 M search queries made every day • 10% misspelled • Problems • Queries are often not found in a dictionary • Many possible candidate corrections for any given misspelled query

Possible Approaches • Naïve Method • search a dictionary for the closest match, using levenshtein edit distance • return closest match • Better method • search a dictionary for closest matches • use levenshtein edit distance and word unigram probability to select best match

Noisy Channel Model • Basic Noisy Channel Model • Given v, find best w • argmax n P(wn) = argmax n P(v|wn) * P(wn) • error model: P(v|w); language model P(w) • Why not just use Levenshtein Distance? • eg. britny -> briny vs britney • Further Improvement • Use probabalistic edit distance (error model) and N-gram probability (language model)

Error Model P(v|w) • Standard (Levenshtein) Edit Distance • algorithm, ins,del,sub costs, example n = length (target) m = length(source) for i = 0 to n for j = 0 to m d[i,j] = MIN(d[i-1,j] + ins-cost(targeti), d[i-1,j-1] + sub-cost(sourcej, targeti), d[i,j-1] + del-cost(sourcej) )

Better Error Model P(v|w) • Probabilistic Edit Distance • ED proportional to probability of the edit • Different probability/cost for each edit pair • eg. P(e->i) > P(e->z) • How do we relate edit distance (lower is “better”) and probability (higher is “better”)? • d(v,w) = -log(P(v|w))

What we want • Error Model (Unknown) • P(v|w) • P(w): Language Model (known) • P(w) = c(w) / Σwc(w) • Use query logs and the language model to determine the error model

Probabilistic Edit Distance • Determining the probabilistic edit model • Expectation Maximization • For each query v • Determine the most likely “corrections” using the existing edit distance model and language model • for each word within ED(x) • candidates = args max n P(v|wn)P(wn) • one candidate may be the word itself • Update the edit distance model • What is EM?

Clustering and EM • Hard Clustering (K-means)

Hard and Soft Clustering • Soft Clustering (EM)

Expectation Maximization • E-Step • Assign each data point to each cluster in proportion to how well it fits the cluster • M-Step • Update the cluster centers to reflect the addition of the point

EM for Spelling Correction • For a given query v • Find all candidate words w within ED(x); • E-Step • For each candidate word • E[zvw] =P(w|v)= P(v|w)P(w)/ Σw P(v|w)P(w) • P(v|w) = ΠP(ecij) • P(ecij) is the Probability of edit [letter i-> letter j]

EM for Spelling Correction • M-Step • Given P(v) = P(e1...en|w)P(w) • each ei is a single ins, del, or sub of two letters • want to adjust P(e1).. P(e2) accordingly • f(ei) += P(w) • P(ei) += f(ei) / N • N total number of edit operations for that letter • D(ei) = -log(P(ei))

M-Step • E and M-Step working together E-Step Edit Sequences, P(ES|D) D = -log(P(l1,l2))

Results • Example • Robert is a frequent search term, Qbert is not. • Atari makes a comeback...

Revenge of Qbert

Cmput 650 Final Project

Cmput 650 Final Project

Presentation Transcript

Final Project

MIS 650 Project Seminar

Final Project

Final Project

Final Project

Final Project

Final project

Final Project

Final Project

Final Project

Final Project

Final Project

FINAL PROJECT

FINAL PROJECT

Final Project

Final project

Final Project

Final Project