1 / 50

Recommending Similar Items at Large Scale

Recommending Similar Items at Large Scale. Jay Katukuri Merchandising Team - eBay 07/25/ 2012. Similar Items Clustering Platform. Introduction Merchandising Challenges Similar Item Clustering (SIC) Architecture Clustering Approach Features Method Cluster Assignment Service

abbott
Download Presentation

Recommending Similar Items at Large Scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recommending Similar Items at Large Scale Jay Katukuri Merchandising Team - eBay 07/25/2012

  2. Similar Items Clustering Platform • Introduction • Merchandising Challenges • Similar Item Clustering (SIC) Architecture • Clustering Approach • Features • Method • Cluster Assignment Service • Applications • Replacement/Equivalent items on CVIP – Non-winner • Related/Complementary items on Checkout

  3. Introduction • Grouping of items that are similar to each other is essential for recommendation algorithms. • Two distinct items can be considered similar if important features are similar: • Titles • Attributes • Images. • Similar Item Clustering (SIC) platform creates clusters of items. • These clusters are used for various recommendation systems on the site now.

  4. Similar Recommendations: Before

  5. Similar Recommendations: Before & After

  6. Merchandising Challenges - Motivation for SIC • Non-productized inventory, long tail. • Product coverage is there only for few categories • Majority of items are ad hoc listings not covered by catalog taxonomy • Maintaining catalogs is a daunting task for the long tail. • One-of-a-kind inventory, Items are short-lived • Unstructured data • Attribute coverage is minimal • Sparsity in the transactional data • Very few purchases for certain kinds of items

  7. Merchandising Challenges - Motivation for SIC • Item-item pairs are supported by even fewer users. • We may not see users buying both a product and accessories on eBay. • Large Data • Much bigger data set in both users and inventory than other ecommerce sites. • Scale • Several 100 Million listings. • Several millionnew items every day

  8. Similar Item Recommendations

  9. Similar Item Recommendations

  10. Item Signatures: possibility ? Cluster apple ipod touch 4g clear film protector screen clarks women shoe pumps classics

  11. Similar Items: Clustering Architecture Off-line Hadoop Cluster Generation Cluster Dictionary Slow, Periodic ItemCluster Index Cluster Assignment Service • Applications: • Merchandising • Navigation • etc. item Run-time Fast

  12. Cluster Generation

  13. Query-Item Set Click-stream Log Filter Queries by Demand/Supply • Use 1 month of User behavior data to collect initial query-set. • Filter queries by length and category specific demand/supply ratios. Query Normalization Query Backend Query to Items Data

  14. Query Selection • Input Data: • Click-stream logs • Method for choosing the queries: • Minimum frequency • Average supply threshold • Min and max token constraint • Morphological constraints • Queries that have only numbers are not allowed: “10 5”

  15. K-Means Clustering Query to Items Data Base Cluster Generation • Use item title, category and attributes as features for clustering. • Applying k-means on the base clusters separately produce better quality of clusters and makes the process faster. • Use cosine distance for item clustering. • Cluster size is chosen as a tuning parameter. Generate Item Features Scoring Models K-Means Clustering of Base Clusters Split Clusters

  16. Base Cluster Generation • Base Cluster ≡ Query • Find merge candidates based on query term overlap • Eg: “nikeairmax tennis shoes” -> “nikeairmax” “nikeairmax tennis shoes” -> “nike shoes” • Score candidates using cosine similarity • Term weight : TF-IDF in the query space(document=query) • TF : Query Demand • IDF : Number of Queries • Most similar merge candidate wins • Eg: “nikeairmax tennis shoes” -> “nikeairmax” • Merge corresponding recall sets

  17. Base Cluster Merge • Reduces the number of base clusters to half. • Example phrase(hand,made) phrase(king,s) queen quilt phrase(hand,made) phrase(pink,s) quilt phrase(hand,made) phrase(prae,owned) queen quilt phrase(hand,made) queen quilt phrase(hand,made) phrase(prae,owned) quilt phrase(hand,made) quilt size twin phrase(hand,made) quilt silk phrase(hand,made) quilt twin phrase(hand,made) phrase(patch,work) quilt phrase(hand,made) quilt white phrase(hand,made) phrase(king,size) quilt phrase(hand,made) phrase(yo,yo,s) quilt phrase(hand,made) quilt sale phrase(hand,made) quilt red phrase(hand,made) quilt

  18. Item Features Generation Item Title 3x clear screen protector film skin for apple ipod touch 4 4g Normalization 3-x clear screen protector film skin for apple ipod touch 4 4-g 3-x color=clear type=‘screen protector’ film skin compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g Concept Extractor Expansions PHRASE(3,x) color=clear type=‘screen protector’ OR(film,films)OR(skin,skins)compatible brand = ‘for apple’ compatible product=‘ipod touch’ 4 model = 4-g Normalized Item Features

  19. Item Features : Concept Extraction • Problem: Extract concepts from item title. • Purpose: • Attributes coverage is sparse in many categories. • Extracted concepts can be used as features • Approach: • Fast online service to extract entities from any eBay Text (item title/product title etc) • Batch capability to be able to use in Hadoop • Restricted to known and important (above certain threshold) name/values. • Unsupervised model • Use a  statistical approach based on large amount of data

  20. Examples Extracted Structured Data Unstructured Item Title Size - 16 Gender – Women Color- Black Style – Dress Women’s black dress size 16 worn once Itemid : 300494995198 Meta : CSA Brand - Gucci Size– Medium Color - Ivory Material- Leather Style – Handbag Gucci medium ivoryleatherhandbag Itemid : 300477503372 Meta : CSA Brand – Amazon Kindle 3 Model – 3G Type – Leather Case Color– Black Black Leather Case Cover for Reader Amazon Kindle 3 3G Itemid : 380361729748 Meta : Computers & Networking

  21. Dictionary Generation Method Data-warehouse Other dictionaries used: Units dictionary Synonym names Famous persons list Data Cleansing Dictionary Generation Co-occurence Matrix of Name-values Tf-Idf scores of name-values in a category Concept Dictionary

  22. Item Features : Concept Extraction • Co-occurrence of concepts is used to approximate the joint probability. • Brand=apple, model=iphone 4 • Use of dictionaries at multiple levels reduces ambiguity in same value having multiple names. • “apple” is “compatible brand” in accessories category • “apple” is “brand” in devices category • 'hp pavilion', 'hp' are both valid values for brand , ambiguity is resolved using tf-idfscores of name value pairs in particular category. • Regexes were added to extract size patterns in CSA.

  23. Item Features : Term Scores • Problem: Given an item title in a leaf category, compute the significance of the terms in the title • While assigning items to clusters, identify which terms in item title are more important that others • Issues: • Existing scoring models built as service • Inefficient for using them in batch mode on hadoop • Unigram models

  24. Mutual Information • Score of a term ‘t’ for a given item ‘i’ is computed using the mutual information of term ‘t’ and category ‘c’. • ‘c’ is the l2 category of item ‘i’. • Item titles from EDW are used as input data. • Scores are computed for the normalized tokens.

  25. K-Means 1/3 K-Means is a well known clustering Algorithm. Choose k initial cluster centroids: m1(1),…,mk(1) Assignment Step: Update Step: Optimize: Inter Cluster Distortion Intra-Cluster Similarity

  26. K-Means 2/3 1. Choose Random Cluster Centroids 2. Update centroids based on neighborhood We use a version of k-means called “Bisecting K-means” which tend to produce better quality results than standard k-means. 3. Final clusters

  27. K-Means 3/3 • Pros • Simple to understand and implement. • Easily parallelizable • Generally produce good quality clusters when K is small. • Cons • Slow to converge when K is large. • Cluster quality degrades with large K. • Need to decide K before hand, needs domain knowledge and tuning to find suitable K.

  28. K-Means Clustering : Cluster Description • Clusters are described using the centroids of the clusters. • Cluster 1: “L1=293 L2=56169 L3=168096 compatible brand = applecompatibleproduct = ipodtouch Phrase(4,g) clear film protectorscreen“ • Cluster 2: “L1=11450L2=3034L3=55793 brand = indigo by clarks shoestyle = pumps classics” • There are about x million clusters for US. • These x million clusters cover more than 92% of the US inventory.

  29. Shingling for cluster merging • Problem: Given a set of clusters, find a grouping of similar clusters. • Approach: • Represent each cluster as a “document” • Compute 5 min 3-shingles • Check for 80% match for belonging to the same group

  30. Shingling basics 1/3

  31. Shingling basics 2/3

  32. Shingling basics 3/3

  33. Cluster Assignment

  34. Cluster Assignment Inverted Index Cluster Dictionary Item Title, Attributes, Leaf Category, Site Meta-data Files Closed View Item Pre-processing Assignment Service Rank Clusters Recommended Similar Items Rank top N similar Items* Voyager Call for top N clusters Implemented using Lucene

  35. Cluster Assignment : Pre-processing new 2xfor canon lp-e8battery + charger + lens hood eos550d600d digital rebel t3i RTL Normalization new,2-x,for,canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital,rebel,t-3-i Concept Extraction new,2-x,forcanon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digitalrebel,t-3-i Stop Word Filtering 2-x,for canon,lp-e-8,battery,charger,lens,hood,eos,550-d,600-d,digital rebel,t-3-i STL Expansion 2-x,for canon,PHRASE(lp,e,8),OR(batteries,battery,batterys),OR(charger,chargers),lens,OR(hood,PHRASE(hood,s),hoods),eos,PHRASE(550,d),PHRASE(600,d),digital rebel,t-3-i Query Reduction / Unification 2-x,for canon,phrase(lp,e,8),batteries,charger,lens,phrase(hood,s),eos,phrase(550,d),phrase(600,d),digital rebel,t-3-i

  36. Cluster Assignment : Scoring • Indexing Fields: Title terms and Categories • Reward matching terms and penalize on non-matching terms • - Reward matching terms • Number of terms matching from input • - Importance of term in input • Query Time Boost • - Penalize non-matching terms from cluster ‘c’ • Index time boost: Field length normalization

  37. Cluster Assignment Cross Validation • Compute precision of recommending items from the “correct” cluster(s) • Clusters that generate purchases (BIDs and/or BINs • Labeled Data • View-Buy data generated from user session analysis • CVIP -> Bid/BIN in same user session • Same category

  38. Cluster Assignment Cross Validation : Method • For each and in , top k(5) clusters list and • Ignoring the position, compute precision in top k • Ignores • True dependent on ranking • Assume every item belonging to a cluster is equally likely to be recommended • Normalized Precision • where is the smallest cluster in

  39. Merchandising Applications

  40. Merchandising Applications • There are two kinds of recommendation systems that are using SIC: • Recommending similar items on CVIP-non winner page • Collaborative Filtering (CF) algorithms: • “Buy-Buy” – On Checkout Page • “View-Buy” – On AVIP

  41. Similar Item Recommendations • User bid on but lost an item • Show similar items as replacement items. • User was watching an item that has ended • Show similar items as replacement items • User viewed an item but did not make a purchase • Show similar items to showcase more choices. • Inject diversity in the recommendation.

  42. Similar Item Recommendations - Example

  43. Similar Item Recommendations ( contd..)

  44. Collaborative Filtering on SIC – “Buy-Buy” • Once a user has purchased an Item, what else can we recommend to the user to go with his purchase? • Drive incremental purchases • On check-out, recommend other items that “go-together” with the purchased item • E.g. for a cell-phone we may recommend a charger, case, screen protector. • For a dress shirt, we may recommend a tie, a dress shoe or a jacket.

  45. Collaborative Filtering on SIC – “Buy-Buy” • Non-productized item inventory with short lifetime makes any CF based approach difficult. • Map the items to a higher level abstraction (clusters) to handle data sparsity. • Re-use the item clusters generated for Similar Item Recommendation.

  46. Related Recommendations: Before & After Recommendations for Xbox 360 4GB on Checkout page

  47. Conclusion • SIC platform has proven its utility and is a critical component of merchandising algorithms • Future Work • Quality needs to be improved for long tail categorieslike Art, Collectibles, etc • Better distinguish between CVIP loser/browser • End-to-End Cross-Validation framework

  48. Cluster Assignment : Aspect Demand • Historical (6-7 months) user behavior data • Rank ordered lists of aspects used in • Search Queries • Left Navigation Filters • Combined using rank aggregation • Importance of aspect in category • Used as query time boost during cluster index lookup • Example : • Input : AIR JORDAN RETRO 4 IV MILITARY BLUE 2006 SIZE 9.5 USED • k:air jordan^2.0 k:retro^1.25 k:military^1.25 k:blue^1.2 • Also used in Concept Landing Pages (CLPs) and Popular Watches w/ Aspects

  49. Ranking • Aspect demand data based on the input item is used in ranking • Ex: material=‘leather’ may not be there in the cluster description. • Clarks Women Shoes • Format Bias based on seed item’s format

  50. Format Affinity • X%seed items are auction for CVIP non winner • High affinity towards the seed item's format

More Related