1 / 22

Catalog Integration

Catalog Integration. R. Agrawal, R. Srikant: WWW-10. Catalog Integration Problem. Integrate products from new catalog into master catalog. The Problem (cont.). After integration:. Desired Solution. Automatically integrate products: little or no effort on part of user. domain independent.

Download Presentation

Catalog Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Catalog Integration R. Agrawal, R. Srikant: WWW-10

  2. Catalog Integration Problem • Integrate products from new catalog into master catalog.

  3. The Problem (cont.) • After integration:

  4. Desired Solution • Automatically integrate products: • little or no effort on part of user. • domain independent. • Problem size: • Million products • Thousands of categories

  5. Model • Product descriptions consist of words • Products live in the leaf-level categories

  6. Basic Algorithm • Build classification model using product descriptions in master catalog. • Use classification model to predict categories for products in the new catalog.

  7. National Semiconductor Files

  8. National Semiconductor Files with Categories

  9. Accuracy on Pangea Data • B2B Portal for electronic components: • 1200 categories, 40K training documents. • 500 categories with < 5 documents. • Accuracy: • 72% for top choice. • 99.7% for top 5 choices.

  10. Enhanced Algorithm: Intuition • Use affinity information in the catalog to be integrated (new catalog): • Products in same category are similar. • Bias the classifier to incorporate this information. • Accuracy boost depends on quality of new catalog: • Use tuning set to determine amount of bias.

  11. Algorithm • Extension of the Naive-Bayes classification to incorporate affinity information

  12. Naive Bayes Classifier • Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule • Pr(d): same for all categories (ignore) • Pr(Ci) = #docs  Ci / #total docs • Pr(d|Ci) = Pwd Pr(w|Ci) • Words occur independently (unigram model) • Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|) • Maximum likelihood estimate smoothed with the Lidstone’s law of succession

  13. Enhanced Algorithm • Pr(Ci|d,S) //d existed in category S • Pr(Ci,d,S) / Pr(d,S) • Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S) • Pr(Ci)Pr(S,d|Ci) / Pr(d,S) • Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d) • Assuming d, S independent given Ci • Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d) • Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S) • Pr(Ci|S)Pr(d|Ci) / Pr(d|S) • Pr(S,d) = Pr(S)Pr(d|S) • Same as NB except Pr(Ci|S) instead of Pr(Ci) • Ignore Pr(d|S) as it is same for all classes

  14. Computing Pr(Ci|S) • Pr(Ci|S) = |Ci|(#docs in S predicted to be in Ci)w / j[1,n] |Cj|(#docs in S predicted to be in Cj)w • |Ci| = #docs in Ci in the master catalog • w determines weight of the new catalog • Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known • Choose one weight for the entire new catalog or different weights for different sections

  15. Superiority of the Enhanced Algorithm • Theorem: The highest possible accuracy achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm. • Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent. • The tune set method attempts to select a good value for weight, but there is no guarantee of success.

  16. Empirical Evaluation • Start with a real catalog M • Remove n products from M to form the new catalog N • In the new catalog N • Assign f*n products to the same category as M • Assign the rest to other categories as per some distribution (but remember their true category) • Accuracy: Fraction of products in N assigned to their true categories

  17. Improvement in Accuracy (Pangea)

  18. Improvement in Accuracy (Reuters)

  19. Improvement in Accuracy (Google.Outdoors)

  20. Tune Set Size (Pangea)

  21. Empirical Results

  22. Summary • Classification accuracy can be improved by factoring in the affinity information implicit in the data to be categorized. • How to apply these ideas to other types of classifiers?

More Related