1 / 20

Spelling Correction for Advertising: How “Noise” Can Help

NISS Workshop on Computational Advertising, November 2009. Spelling Correction for Advertising: How “Noise” Can Help. Silviu Cucerzan Microsoft Research Text Mining Search and Navigation. Buying Cheap( er ) on eBay. Canon 30d. Not good for the sellers. Not good for most buyers.

abba
Download Presentation

Spelling Correction for Advertising: How “Noise” Can Help

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NISS Workshop on Computational Advertising, November 2009 Spelling Correction for Advertising:How “Noise” Can Help SilviuCucerzan MicrosoftResearch Text Mining Search and Navigation

  2. Buying Cheap(er) on eBay Canon 30d Not good for the sellers. Not good for most buyers. Not good for the middle man. Cannon 30d

  3. Good Ads for Bad Queries espresso machines cingular wireless

  4. Is a Trusted Dictionary Enough? cheats celine colour christinaaguilera panasonic recorders filter drivers windows files powerpoint • Search: max paynechats and codes new humweepics • Music: selindioncolor of my love cristinaaquillara • Shopping: pansonicdvdreorders brita water filer • Help and Support: printerdiversforwindowvista insert flash flies into power point

  5. Web Query Logs as Corpora • Web Search: over to 1 billion queries per day! • 10-15% of the queries contain spelling errors • highly dynamic domain: many new names and concepts become popular every day extremely difficult to maintain a high-coverage lexicon • difficult to define what a valid web query is The problem The solution e.g.:divx, ecard, ipod, korn, xbox, zune, naboo, nimh, nsync, shrek, 5dmkii, tsx

  6. Problems To Be Handled power crd power cord video crd video card Context-sensitive correction of out-of-lexicon words Context-sensitive correction of in-lexicon words chicken sopchicken soup sop operasoap opera Concatenate and split cheese cake factory cheesecake factory chat inspanichchat in spanish Recognize out-of-lexicon valid words amd processors  amd processors Change in-lexicon words to out-of-lexicon words gun dam fightergundam fighter

  7. An HMM Architecture for Spelling Correction brita water filer input query: brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier states: all alternative spellings from the query log

  8. What about terrible misspellings? • input: arnol shwartzeggar • desired output:arnold schwarzenegger unweighted edit distance: 5

  9. An Iterative Approach Misspelled query: arnol shwartzeggar First iteration: arnold schwartzneggar Second iteration: arnold schwartzenegger Third iteration: arnold schwaxrzenegger Fourth iteration: arnold schwarzenegger Speller output: no more changes

  10. Some Intuition Search Query Log Statistics hunny moon Iterative spelling correction process honeymoon

  11. Basic Assumptions about the “Noise” • query logs contain a lot of different misspellings for most words • the better spelled a word form, the more frequent it is • the correct forms are much more frequent than their misspellings

  12. Another Example

  13. Concatenation and Splitting Store word unigrams and bigrams in the same searchable trie structure. Find alternative spellings for the input words in this common structure.

  14. Avoid Changing the User’s Intent brit file waiter brita water filer brita brit brit. brits briat rita water eater hater later mater oater rater wader wafer wager waiter walter waster waters watery waver filer fiber fifer file filed filers files filet filler filner filter finer firer fiver fixer flier

  15. Modified Viterbi Search – Fringes e.g.:water filer  waiter file in-lexicon words k1k2k1+k2 paths

  16. Modified Viterbi Search – Stop words e.g.: lord of teh rigs lord of the rings

  17. Evaluation

  18. A Closer Look to the Results • 81.8% overall agreement with the annotators • Errors: • alternative queries for valid queries many false positives are reasonable suggestions e.g. cowboy robescowboy ropes • alternative queries for misspelled queries some suggestions could be valid (user’s intent not known) e.g. massanger massager / messenger annotator inter-agreement rate: 91.3%

  19. Evaluation – When we “know” user’s intent (audio flie, audio file) audio file (bueavista, buena vista)  buena vista (carrabean nooms, carrabean rooms)caribbean rooms 368 queries

  20. Learning Curve Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative processthat exploits the collective knowledge of web users”, EMNLP 2004

More Related