1 / 17

Cmput 650 Final Project

Probabilistic Spelling Correction for Search Queries. Cmput 650 Final Project. Overview. Motivation Problem Statement Noisy Channel Model EM Background EM for Spelling Correction. Search Query Spelling Correction. Motivation Over 700 M search queries made every day 10% misspelled

pkeeton
Download Presentation

Cmput 650 Final Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Spelling Correction for Search Queries Cmput 650 Final Project

  2. Overview • Motivation • Problem Statement • Noisy Channel Model • EM Background • EM for Spelling Correction

  3. Search Query Spelling Correction • Motivation • Over 700 M search queries made every day • 10% misspelled • Problems • Queries are often not found in a dictionary • Many possible candidate corrections for any given misspelled query

  4. Possible Approaches • Naïve Method • search a dictionary for the closest match, using levenshtein edit distance • return closest match • Better method • search a dictionary for closest matches • use levenshtein edit distance and word unigram probability to select best match

  5. Noisy Channel Model • Basic Noisy Channel Model • Given v, find best w • argmax n P(wn) = argmax n P(v|wn) * P(wn) • error model: P(v|w); language model P(w) • Why not just use Levenshtein Distance? • eg. britny -> briny vs britney • Further Improvement • Use probabalistic edit distance (error model) and N-gram probability (language model)

  6. Error Model P(v|w) • Standard (Levenshtein) Edit Distance • algorithm, ins,del,sub costs, example n = length (target) m = length(source) for i = 0 to n for j = 0 to m d[i,j] = MIN(d[i-1,j] + ins-cost(targeti), d[i-1,j-1] + sub-cost(sourcej, targeti), d[i,j-1] + del-cost(sourcej) )

  7. Better Error Model P(v|w) • Probabilistic Edit Distance • ED proportional to probability of the edit • Different probability/cost for each edit pair • eg. P(e->i) > P(e->z) • How do we relate edit distance (lower is “better”) and probability (higher is “better”)? • d(v,w) = -log(P(v|w))

  8. What we want • Error Model (Unknown) • P(v|w) • P(w): Language Model (known) • P(w) = c(w) / Σwc(w) • Use query logs and the language model to determine the error model

  9. Probabilistic Edit Distance • Determining the probabilistic edit model • Expectation Maximization • For each query v • Determine the most likely “corrections” using the existing edit distance model and language model • for each word within ED(x) • candidates = args max n P(v|wn)P(wn) • one candidate may be the word itself • Update the edit distance model • What is EM?

  10. Clustering and EM • Hard Clustering (K-means)

  11. Hard and Soft Clustering • Soft Clustering (EM)

  12. Expectation Maximization • E-Step • Assign each data point to each cluster in proportion to how well it fits the cluster • M-Step • Update the cluster centers to reflect the addition of the point

  13. EM for Spelling Correction • For a given query v • Find all candidate words w within ED(x); • E-Step • For each candidate word • E[zvw] =P(w|v)= P(v|w)P(w)/ Σw P(v|w)P(w) • P(v|w) = ΠP(ecij) • P(ecij) is the Probability of edit [letter i-> letter j]

  14. EM for Spelling Correction • M-Step • Given P(v) = P(e1...en|w)P(w) • each ei is a single ins, del, or sub of two letters • want to adjust P(e1).. P(e2) accordingly • f(ei) += P(w) • P(ei) += f(ei) / N • N total number of edit operations for that letter • D(ei) = -log(P(ei))

  15. M-Step • E and M-Step working together E-Step Edit Sequences, P(ES|D) D = -log(P(l1,l2))

  16. Results • Example • Robert is a frequent search term, Qbert is not. • Atari makes a comeback...

  17. Revenge of Qbert

More Related