370 likes | 539 Views
Comp 3503 Web Mining. Daniel L. Silver. Introduction. Overview video: http://www.youtube.com/watch?v=I2p3JcAdtoI Detail video: http://www.youtube.com/watch?v=Dy5gddfa05E. Mining the World-Wide Web. The WWW is huge, widely distributed, global information service center for
E N D
Comp 3503 Web Mining Daniel L. Silver
Introduction • Overview video: • http://www.youtube.com/watch?v=I2p3JcAdtoI • Detail video: • http://www.youtube.com/watch?v=Dy5gddfa05E
Mining the World-Wide Web • The WWW is huge, widely distributed, global information service center for • Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. • Hyper-link information • Access and usage information • WWW provides rich sources for data mining • Challenges • Too huge for effective data warehousing and data mining • Too complex and heterogeneous: no standards and structure BUSI6522
Mining the World-Wide Web • Growing and changing very rapidly • Broad diversity of user communities • Only a small portion of the information on the Web is truly relevant or useful • 99% of the Web information is useless to 99% of Web users • How can we find high-quality Web pages on a specified topic? BUSI6522
Web search engines • Index-based: search the Web, index Web pages, and build and store huge keyword-based indices • Help locate sets of Web pages containing certain keywords • Deficiencies • A topic of any breadth may easily contain hundreds of thousands of documents • Many documents that are highly relevant to a topic may not contain keywords defining them (polysemy) BUSI6522
Web Mining: A more challenging task • Searches for • Web access patterns • Web structures • Regularity and dynamics of Web contents • Problems • The “abundance” problem • Limited coverage of the Web: hidden Web sources, majority of data in DBMS • Limited query interface based on keyword-oriented search • Limited customization to individual users BUSI6522
Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining General Access Pattern Tracking Customized Usage Tracking Search Result Mining Web Mining Taxonomy BUSI6522
Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining • Web Page Content Mining • Web Page Summarization • WebLog (Lakshmanan et.al. 1996),WebOQL(Mendelzon et.al. 1998) …: • Web Structuring query languages; • Can identify information within given web pages • Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages • ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages General Access Pattern Tracking Customized Usage Tracking Search Result Mining BUSI6522
Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining • Search Result Mining • Search Engine Result Summarization • Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): • Categorizes documents using phrases in titles and snippets General Access Pattern Tracking Customized Usage Tracking BUSI6522
Web Mining Mining the World-Wide Web Web Content Mining Web Usage Mining • Web Structure Mining • Using Links • PageRank (Brin et al., 1998) • CLEVER (Chakrabarti et al., 1998) • Use interconnections between web pages to give weight to pages. • Using Generalization • MLDB (1994), VWV (1998) • Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure. General Access Pattern Tracking Search Result Mining Web Page Content Mining Customized Usage Tracking BUSI6522
Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining Customized Usage Tracking • General Access Pattern Tracking • Web Log Mining (Zaïane, Xin and Han, 1998) • Uses KDD techniques to understand general access patterns and trends. • Can shed light on better structure and grouping of resource providers. Search Result Mining BUSI6522
Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining • Customized Usage Tracking • Adaptive Sites (Perkowitz and Etzioni, 1997) • Analyzes access patterns of each user at a time. • Web site restructures itself automatically by learning from user access patterns. Web Page Content Mining General Access Pattern Tracking Search Result Mining BUSI6522
Mining the Web's Link Structures • Finding authoritative Web pages • Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic • Hyperlinks can infer the notion of authority • The Web consists not only of pages, but also of hyperlinks pointing from one page to another • These hyperlinks contain an enormous amount of latent human annotation • A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the other page BUSI6522
Mining the Web's Link Structures • Problems with the Web linkage structure • Not every hyperlink represents an endorsement • Other purposes are for navigation or for paid advertisements • If the majority of hyperlinks are for endorsement, the collective opinion will still dominate • One authority will seldom have its Web page point to its rival authorities in the same field • Authoritative pages are seldom particularly descriptive • Hub • Set of Web pages that provides collections of links to authorities BUSI6522
User Profiling(Modeling) • Methods of capturing user preferences and behaviour • Handcrafted stereotypical models • Handcrafted models based on traditional questionnaires (expert systems) • Learned (statistical) models based on use of system • Can be Collaborative or Individual
Data Used to Develop User Profiles • Overt data: • Questions/answers • Preference history, options chosen • Covert data (less obtrusive): • Keystroke and mouse traces • Click stream from web browsing • Record of purchases/actions
User Profiling: The Basics • Collect overt and covert data • Develop a mathematical or logical model to predict user preferences or product interests • Use models to adapt the format or content of a system’s interface • Facilitate the users interaction with the system (effectiveness/efficiency) • Such systems are said to have adaptive user interfaces or intelligent user interfaces • Examples: Amazon, Lycos, Google, …
Experimenting with User Profiling. What is needed? • A host website • HTML, Java, JavaScript, C++ • Cookies, Applets • A machine learning system for user model development • an Artificial Neural Network • A specific problem and an approach
Artificial Neural Networks • Series of simple computing elements (nodes) tied together by weighted connections • Input/output mapping determined by connection weights Slab 3 Slab 2 Slab 1
Artificial Neural Networks • Training examples contain input/output pairs • Weights of connections are adjusted to fit training examples • The back propagation of error learning algorithm is most commonly used
Recurrent Networks • Based on the back propagation algorithm • Excellent at learning sequences • Used in stock prediction, voice recognition, event detection • Additional • hidden nodes • Recurrent • feedback loop
The Problem • Navigate.ca has 62 shopping categories, organized into several folders and subfolders • Hard to find product categories of related interest when shopping • Facilitate the use of Navigate.ca through user profiling and an adaptive interface
The Solution • Track the categories that users visit and use the information to predict the next best category • Adapt the content of the web pages so as to recommend the next best category • Assumption: Users will follow similar trajectories when shopping for similar purposes
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Training Examples Click stream interfile 5 12 12 27 27 3 3 61… NeuroShell2 Navigate.ca applet.java ourModel.c INPUT ourModel.c OUTPUT links • INPUT Math functions OUPUT
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Data Preparation & Collection Training Examples Click stream interfile 5 12 12 27 27 3 3 61… NeuroShell2 Navigate.ca applet.java ourModel.c INPUT ourModel.c OUTPUT links • INPUT Math functions OUPUT
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Training Examples Click stream interfile Build the Model 5 12 12 27 27 3 3 61… NeuroShell2 Navigate.ca applet.java ourModel.c INPUT ourModel.c OUTPUT links • INPUT Math functions OUPUT
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Training Examples Click stream interfile 5 12 12 27 27 3 3 61… NeuroShell2 Embed the “Intelligence” within Navigate.ca Navigate.ca applet.java ourModel.c INPUT ourModel.c OUTPUT links • INPUT Math functions OUPUT
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Data Preparation & Collection Training Examples Click stream interfile 5 12 12 27 27 3 3 61… NeuroShell2 Navigate.ca applet.java ourModel.c INPUT ourModel.c OUTPUT links • INPUT Math functions OUPUT
Data Collection & Preparation • Collecting click-streams • via browser Cookies • Preparing Training Examples • What will the training examples look like..? • 62 different categories, we want to predict the category a user is most likely to click NEXT
Data Collection & Preparation • User’s click stream: • 2 45 60 23 9 7 11 2 37 … • Training examples: • (2 45) (45 60) (60 23) (23 9) … • Transform for Neural Network: • E.g. 2 = 0 0 1 0 0 0 0 … 0 • Why? The product category variable is of type “nominal”
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Training Examples Click stream interfile Build the Model 5 12 12 27 27 3 3 61… NeuroShell2 Navigate.ca applet.java ourModel.c INPUT ourModel.c OUTPUT links • INPUT Math functions OUPUT
Building the Model • Used NeuroShell 2 • 3 layer network • 62 inputs/outputs • 84 hidden nodes and recurrent nodes • Examples split into training, tuning and production datasets • Weights are adjusted based on examples until the model reaches a lowest level of tuning set accuracy • Models test with production set
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Training Examples Click stream interfile 5 12 12 27 27 3 3 61… NeuroShell2 Embed the “Intelligence” within Navigate.ca Navigate.ca applet.java ourModel.c INPUT ourModel.c OUTPUT links • INPUT Math functions OUPUT
Making Navigate.ca “Intelligent” • The network architecture and weights represent the user model • NeuroShell 2 can output C code representative of the model • Convert to java and then add to applet code • Compiled applet class is part of appropriate Navigate.ca web pages
5 12 27 3 61 47 23 12 13 33 1 60 3 4 6 12 14 … 000001… 000000… 000000… 000100… Training Examples Click stream interfile 5 12 12 27 27 3 3 61… NeuroShell2 Embed the “Intelligence” within Navigate.ca Navigate.ca applet.java ourModel.c INPUT ourModel.java OUTPUT links • INPUT Math functions OUPUT
Displays category 4 calls Applet.java “0 0 0 0 1 0…0 0” “4” “0.2 0.9 0.7 0.4 0.3 0.9 … 0.8 0.1” ourModel (1, 5, 60) top 3 links The Intelligent Navigate.ca Navigate.ca User Selects category 4