180 likes | 293 Views
Netflix and Beyond. Tuning Solr for great results. Walter Underwood http:// wunderwood.org/most_casual_observer /. Typical Web Query Mix. informational navigational (known-site) transactional (known-item) (Andrei Broder , AltaVista, 2002). “talking rat movie”.
E N D
Netflix and Beyond Tuning Solr for great results.Walter Underwood http://wunderwood.org/most_casual_observer/
Typical Web Query Mix • informational • navigational (known-site) • transactional (known-item) (Andrei Broder, AltaVista, 2002)
Top Queries October 2006 • finding neverland • bridgetjones • closer • the incredibles • incredibles • ladder 49 • fat albert • being julia • ray • national treasure alfie spanglish star wars meet the fockers final cut hotel rwanda neverland after the sunset million dollar baby hitch
Netflix Queries • 92% movie titles • 5% genres and categories • 3% people Known-item queries make up 95% of Netflix traffic.
Problematic User Behavior • One or two words? • Partial words • Misspellings
Partial Words • People don’t like to make mistakes: • rat, rata, ratat • apoc • koyaanisq • Phonetic encoding (soundex) assumes complete words
Autocomplete Finishes Words • Load movie titles and popular people • 10% improvement in search quality (MRR) • 10X as much traffic as search queries • Dedicated Solr with RAMDirectory • Front-end HTTP cache, 1 hour lifetime, 80% hit rate
Some Misspellings • shakespear • the incredables • seven samarai • breakfast at tiffiney • blazing sadles • selen • scorupko • taeku • christopherwalkin • return to lonsom dove • teh matrix • comdytv pirhana dungens and dragons pufiyami al pachino incredables gundan seed mobile suit chatterluy white fany to the rsecue meet the faulkers brigettejoes diary oh brother where are thou? pirartes of the carr
Switch from Phonetic to Fuzzy • Tested a dozen algorithms with users • 250K queries per test cell • JaroWinkler slightly better than Levenstein • JaroWinkler with 0.7 is very, very broad match • “koyaanisqatsi” matches “koy” (yuck!) • but “1048” matches “1408”
Problematic Corpus Behavior • Missing movies • Ollie Hopnoodle’s Haven of Bliss • CJ7 • Hard-to-spell names • Ratatouille • Coraline • InglouriousBasterds • Hard-to-remember names • Click • Apocalypto • Seven Up Plus Seven
Metrics: MRR • Mean Reciprocal Rank • Weighted clickthrough, measured on site traffic • #1 is a full click • #2 is a half click • #3 is one third click • etc. • Daily, weekly, and seasonal variations • Overall customer satisfaction • Good for A/B tests, weak for finding bugs
Per-query Metrics • Useful for finding problems • MRR • Clickthrough percent • Most-clicked rank (#1 is good) • Percentage of clicks on most-clicked • known-item queries are over 80% • categories are under 50%
Success Looks Like … • MRR consistently over 0.5 • 85% of clicks on #1