1 / 26

Data Mining for Web Personalization

Data Mining for Web Personalization. Patrick Dudas. Outline. Personalization Data mining Examples Web mining MapReduce Data Preprocessing Knowledge Discovery Evaluation Information High. Personalization. Goal of data mining approach is for “automatic personalization”

phyre
Download Presentation

Data Mining for Web Personalization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining for Web Personalization Patrick Dudas

  2. Outline • Personalization • Data mining • Examples • Web mining • MapReduce • Data Preprocessing • Knowledge Discovery • Evaluation • Information High

  3. Personalization • Goal of data mining approach is for “automatic personalization” • Automatic Personalization: • Content-based • Collaborative • Rule-based

  4. Rule-based (Brief overview) • Create decision rules • Implicitly/Explicitly • Highly domain dependent • Rules nontransferable • Profiles are based on user input • Biased • Static • Degrade over time

  5. Content-based (Brief overview) • Profile based on users past experiences and their interest (ratings) • Think Amazon, Pandora, eBay.. • Vector similarities based on cosine similarity • Bayesian classification • Remember: Ratings = Profile = Recommendation

  6. Collaborative (Brief overview) • Creating groups of users based on ratings • Nearest neighbor approach • Once grouped, recommendation based on the other neighbors are presented • More users or items = more dimensions of data • Dynamic or real-time not applicable

  7. Data Mining • Data rich descriptions • Large volumes of data • reliable models • Automated data collection • Evaluate results/make decisions • Integration with existing data sources

  8. Examples of Large Datasets • http://aws.amazon.com/datasets • Featured data sets: • Illumina - Jay Flatley (CEO of Illumina) Human Genome Data Setcience 315(5814): 972. • 350 GB • YRI Trio Dataset • 700GB • Sloan Digital Sky Survey DR6 Subset • 160 GB • Genome, survey data, Google Books n-gram corpuses, traffic statistics, OpenStreetMap dataset, Wikipedia traffic…

  9. Data Mining Web Personalization • Recommendations based on Web objects: • Items • Pages • Documents • Navigation by links • Web mining • Pros: Personalization (duh.), real-time, more enriched datasets • Cons: Privacy issues, building complex systems that misrepresent the individual

  10. Extend the Data Mining Paradigm

  11. Data Preparation and Transformation • Web logs • Date/time usage • Site information • Resource requested (image, video, etc.) • Site files/meta-data • The power of the cookie Server-side cookies!

  12. Data Preparation and Transformation (cont.) • Pageview: • User actions (where they clicked and the path) • User events (what they are trying to accomplish) • Session: • Sequence of page views

  13. MapReduce • Google design • Hoodop implemented • C++, C#, Erlang, Java, Ocaml, Perl, Python, Ruby, F#, R..

  14. Example

  15. Usage Data Pre-Processing

  16. Pattern Discovery • We have data! Now what? • Cluster • Classification • Association Rule Discovery • Sequential pattern Discovery • Markov Models • Latent Variable Model

  17. Clustering • Partitioning • Split your data into groups • K-means • Hierarchical • Divisive (top-down) • Start with everything, find groups • Agglomerative (bottom-up) • Start with a cluster and add additional information • Model-based • Building a model for the data (best fit)

  18. K-means

  19. User-Based Clustering • Start with the user profile • Partition into k-groups of profiles • Based on similarity

  20. Association Discovery • Support • min(support) • Confidence • min(confidence)

  21. Evaluation (Personalization Model) • Challenges: • Recommendation algorithms may require unique set of evaluation metrics • Personalization actions may be different • Domain • Intended application • Data gathered • Check for overfitting data • Training set • ROC Curve

  22. "No" "Yes" SN .95 .05 .20 .80 N ROC Curve • TPR = TP / (TP + FN) • FPR = FP / (FP + TN)

  23. Information High • Information is addictive • Information can be misleading • Information ethics • Information is power, sometimes too powerful

  24. Personal Suggestions • Develop a hypothesis • Figure out what data is needed • Make informed decisions • Don’t trust just your judgment • Experts are experts for a reason! • Develop a way to validate based on experience • Then get more data if needed

  25. Sources • Kohavi, R. and F. Provost (2001). "Applications of data mining to electronic commerce." Data Mining and Knowledge Discovery 5(1): 5-10. • Dean, J. and S. Ghemawat (2008). "MapReduce: Simplified data processing on large clusters." Communications of the ACM 51(1): 107-113. • Mobasher, B. (2007). "Data mining for web personalization." The adaptive web: 90-135. • http://en.wikipedia.org/wiki/File:FrequentItems.png • Zaharia, M., A. Konwinski, et al. (2008). Improving mapreduce performance in heterogeneous environments, USENIX Association. • Witten, I. H. and E. Frank (2002). "Data mining: practical machine learning tools and techniques with Java implementations." ACM SIGMOD Record 31(1): 76-77. • Senthilkumar, A. (2011). Knowledge discovery practices and emerging applications of data mining : trends and new domains. Hershey, PA, Information Science Reference.

  26. Thank you! • Questions?

More Related