Web Mining and Recommendation

Web Mining and Recommendation • CENG 514

Web Mining • Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services

Examples of Discovered Patterns • Association rules • 75% of Facebook users also have FourSquare accounts • Classification • People with age less than 40 and salary > 40k trade on-line • Clustering • Users A and B access similar URLs • Outlier Detection • User A spends more than twice the average amount of time surfing on the Web

Why is Web Mining Different? • The Web is a huge collection of documents except for • Hyper-link information • Access and usage information • The Web is very dynamic • New pages are constantly being generated • Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to • Exploit hyper-links and access patterns • Be incremental

Web Mining Applications • E-commerce • Generate user profiles • Targetted advertizing • Fraud … • Information retrieval (Search) on the Web • Automated generation of topic hierarchies • Web knowledge bases • Extraction of schema for XML documents … • Network Management • Performance management • Fault management

User Profiling • Important for improving customization • Provide users with pages, advertisements of interest • Generate user profiles based on their access patterns • Cluster users based on frequently accessed URLs • Use classifier to generate a profile for each cluster • Engage technologies • Tracks web traffic to create anonymous user profiles of Web surfers

Internet Advertizing • Ads are a major source of revenue for Web portals and E-commerce sites • Plenty of startups doing internet advertizing • Doubleclick, AdForce, AdKnowledge

Internet Advertizing • Scheme 1: • Manually associate a set of ads with each user profile • For each user, display an ad from the set based on profile • Scheme 2: • Automate association between ads and users • Use ad click information to cluster users (each user is associated with a set of ads that he/she clicked on) • For each cluster, find ads that occur most frequently in the cluster and these become the ads for the set of users in the cluster

? A1 A2 A3 Internet Advertizing • Use collaborative filtering (e.g. Likeminds, Firefly) • Each user Ui has a rating for a subset of ads (based on click information, time spent, items bought etc.) • Rij - rating of user Ui for ad Aj • Problem: Predict user Ui’s rating for an unrated ad Aj

Internet Advertizing • Key Idea: User Ui’s rating for ad Aj is set to Rkj, where Uk is the user whose rating of ads is most similar to Ui’s • User Ui’s rating for an ad Aj that has not been previously displayed to Ui is computed as follows: • Consider a user Uk who has rated ad Aj • Compute Dik, the distance between Ui and Uk’s ratings on common ads • Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik) • Display to Ui ad Aj with highest computed rating

Fraud • With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important • Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought) • If buying pattern changes significantly, then signal fraud • E.g. use of domain knowledge and neural networks for credit card fraud detection

Network Management • Performance management : Annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three. Result is frequent congestion. During a major event (World cup), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world • Fault management: Analyze alarm and traffic data to carry out root cause analysis of faults

Web Mining • Web Content Mining • Web page content mining • Search result mining • Web Structure Mining • Search • Web Usage Mining • Access patterns • Customized Usage patterns

Web Content Mining • Crawler • A program that traverses the hypertext structure in the Web • Seed URL: page/set of pages that the crawler starts with • Links from visited page saved in a queue • Build an index • Focused crawlers

Relevant Relevant & Retrieved Retrieved All Documents Basic Measures for Text Retrieval • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

Information Retrieval Techniques • Basic Concepts • A document can be described by a set of representative keywords called index terms. • Different index terms have varying relevance when used to describe document contents. • This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) • DBMS Analogy • Index Terms Attributes • Weights Attribute Values

Indexing • Inverted index • A data structure for supporting text queries, similar to index in a book • document_table: a set of document records <doc_id, postings_list> • term_table: a set of term records, <term, postings_list> • Answer query: Find all docs associated with one or a set of terms • + easy to implement • – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large)

Inverted index aalborg 3452, 11437, ….. . . . . . arm 4, 19, 29, 98, 143, ... armada 145, 457, 789, ... armadillo 678, 2134, 3970, ... armani 90, 256, 372, 511, ... . . . . . zz 602, 1189, 3209, ... indexing disks with documents inverted index

Vector Space Model • Documents and user queries are represented as m-dimensional vectors, where m is the total number of index terms in the document collection. • The degree of similarity of the document d with regard to the query q is calculated as the correlation between the vectors that represent them, using measures such as the Euclidian distance or the cosine of the angle between these two vectors.

Vector Space Model • Represent a doc by a term vector • Term: basic concept, e.g., word or phrase • Each term defines one dimension • N terms define a N-dimensional space • Element of vector corresponds to term weight • E.g., d = (x1,…,xN), xi is “importance” of term i

Starbucks C2 Category 2 Category 3 C3 Java new doc Microsoft C1 Category 1 VS Model: Illustration New document is assigned to the most likely category based on vector similarity.

Issues to be handled • How to select terms to capture “basic concepts” • Word stopping • e.g. “a”, “the”, “always”, “along” • Word stemming • e.g. “computer”, “computing”, “computerize” => “compute” • Latent semantic indexing • How to assign weights • Not all words are equally important: Some are more indicative than others • e.g. “algebra” vs. “science” • How to measure the similarity

Latent Semantic Indexing • Basic idea • Similar documents have similar word frequencies • Difficulty: the size of the term frequency matrix is very large • Use a singular value decomposition (SVD) techniques to reduce the size of frequency table • Retain the K most significant rows of the frequency table • Method • Create a term x document weighted frequency matrix A • SVD construction: A = U * S * V’ • Define K and obtain Uk ,, Sk , and Vk. • Create query vector q’ . • Project q’ into the term-document space: Dq = q’ * Uk * Sk-1 • Calculate similarities: cos α = Dq . D / ||Dq|| * ||D||

How to Assign Weights • Two-fold heuristics based on frequency • TF (Term frequency) • More frequent within a document  more relevant to semantics • IDF (Inverse document frequency) • Less frequent among documents  more discriminative

TF Weighting • Weighting: • More frequent => more relevant to topic • Raw TF= f(t,d): how many times term t appears in doc d • Normalization: • Document length varies => relative frequency preferred • e.g., Maximum frequency normalization

IDF Weighting • Ideas: • Less frequent among documents  more discriminative • Formula: IDF(t) = 1+ log (n/k) n: total number of docs k: # docs with term t appearing

TF-IDF Weighting • TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) • Frequent within doc  high tf  high weight • Selective among docs  high idf  high weight • Recall VS model • Each selected term represents one dimension • Each doc is represented by a feature vector • Its t-term coordinate of document d is the TF-IDF weight • Many complex and more effective weighting variants exist in practice

How to Measure Similarity? • Given two documents • Similarity definition • dot product • normalized dot product (or cosine)

text mining search engine text doc1 travel text map travel doc2 government president congress doc3 …… Illustrative Example Sim(newdoc,doc1)=2*4.8*2.4+4.5*4.5 Sim(newdoc,doc2)=2.4*2.4 Sim(newdoc,doc3)=0 text mining travel map search engine govern president congress IDF 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3 doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4) doc2 1(2.4 ) 2 (5.6) 1(3.3) doc3 1 (2.2) 1(3.2) 1(4.3) newdoc 1(2.4) 1(4.5)

Web Structure Mining • PageRank (Google ’00) • Clever (IBM ’99)

Ranking based on link structure analysis Search Importance Ranking Rank Functions (Link Analysis) Similarity based on content or text Relevance Ranking Backward Link Web Topology (Anchor Text) Graph Indexer Inverted Index Anchor Text Web Graph Generator Constructor Forward Forward URL Term Dictionary Meta Data Index Link Dictioanry (Lexicon) Web Page Parser Web Pages Search Engine – Two Rank Functions

The PageRank Algorithm • Intuition: PageRank can be seen as the probability that a “random surfer” visits a page • Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. • In Proc. WWW Conference, pages 107–117 • Basic idea: significance of a page is determined by the significance of the pages linking to it • Link i→j : • i considers j important. the more important i, the more important j becomes. • if i has many out-links: links are less important. • Initially: all importances pi = 1. • Iteratively, pi is refined. PR(A) = p + (1-p)(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) where, C(Ti) = # out-links of page i • Parameter p is probability that the surfer gets bored and starts on a new random page • (1-p) is the probability that the random surfer follows a link on current page

The HITS Algorithm • Hyperlink-induced topic search (HITS) • Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632. • Basic idea: Sufficiently broad topics contain communities consisting of two types of hyperlinked pages: • Authority: best source for requested info, highly-referenced pages on a topic • Hub: contains links to authoritative pages 1/1/2020 33

The HITS Algorithm • Collect seed set of pages S (returned by search engine) • Expand seed set to contain pages that point to or are pointed to by pages in seed set (removes links inside a site) • Iteratively update hub weight h(p) and authority weight a(p) for each page: • a (p )= ∑ h (q ), for all q  p • h (p )= ∑ a (q), for all p  q • After a fixed number of iterations, pages with highest hub/authority weights form core of community 1/1/2020 34

Problems with Web Search Today • Today’s search engines are plagued by problems: • the abundance problem (99% of info of no interest to 99% of people) • limitedcoverage of the Web (internet sources hidden behind search interfaces) Largest crawlers cover < 18% of all web pages • limitedquery interface based on keyword-oriented search • limitedcustomization to individual users

Problems with Web Search Today • Today’s search engines are plagued by problems: • Web is highly dynamic • Lot of pages added, removed, and updated every day • Very high dimensionality

Web Usage Mining • Pages contain information • Links are ‘roads’ • How do people navigate the Internet •  Web Usage Mining (clickstream analysis) • Information on navigation paths available in log files • Logs can be mined from a client or a server perspective

Website Usage Analysis • Why analyze Website usage? • Knowledge about how visitors use Website could • Provide guidelines to web site reorganization; Help prevent disorientation • Help designers place important information where the visitors look for it • Pre-fetching and caching web pages • Provide adaptive Website (Personalization) • Questions which could be answered • What are the differences in usage and access patterns among users? • What user behaviors change over time? • How usage patterns change with quality of service (slow/fast)? • What is the distribution of network traffic over time?

Website Usage Analysis

Website Usage Analysis • There are analysis services such as Analog • (http://www.analog.cx/), Google analytics • Gives basic statistics such as • number of hits • average hits per time period • what are the popular pages in your site • who is visiting your site • what keywords are users searching for to get to you • what is being downloaded

Web Usage Mining Process

Data Preparation • Data cleaning • By checking the suffix of the URL name, for example, all log entries with filename suffixes such as, gif, jpeg, etc • User identification • If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine • Other heuristics involve using a combination of IP address, machine name, browser agent, and temporal information to identify users • Transaction identification • All of the page references made by a user during a single visit to a site • Size of a transaction can range from a single page reference to all of the page references Data Mining: Concepts and Techniques

Sessionizing • Main Questions: • how to identify unique users • how to identify/define a user transaction • Problems: • user ids are often suppressed due to security concerns • individual IP addresses are sometimes hidden behind proxy servers • client-side & proxy caching makes server log data less reliable • Standard Solutions/Practices: • user registration – practical ???? • client-side cookies – not fool proof • cache busting - increases network traffic Data Mining: Concepts and Techniques

Sessionizing • Time oriented • By total duration of session not more than 30 minutes • By page stay times (good for short sessions) not more than 10 minutes per page • Navigation oriented (good for short sessions and when timestamps unreliable) • Referrer is previous page in session, or • Referrer is undefined but request within 10 secs, or • Link from previous to current page in web site Data Mining: Concepts and Techniques

Web Usage Mining Different Types of Traversal Patterns • Association Rules • Which pages are accessed together • Support(X) = freq(X) / no of transactions • Episodes • Frequent partially order set of pages • Support(X) = freq(X) / no of time windows • Sequential Patterns • Frequent ordered set of pages • Support(X) = freq(X) / no of sessions/customers • Forward Sequences • Removes backward traversals, reloads, refreshes • E.g. <A,B,A,C>  <A,B> and <A,C> • Support(X) = freq(X) / no of forward sequences • Maximal Forward Sequences • Support(X) = freq(X) / no of clicks • Clustering • User clusters (similar navigational behaviour) • Page clusters (grouping conceptually related pages) Data Mining: Concepts and Techniques

Recommender Systems Data Mining: Concepts and Techniques

Recommender Systems • RS – problem of information filtering • RS – problem of machine learning • seeks to predict the 'rating' that a user would give to an item she/he had not yet considered. • Enhance user experience • Assist users in finding information • Reduce search and navigation time

Types of RS Three broad types: • Content based RS • Collaborative RS • Hybrid RS

Types of RS – Content based RS Content based RS highlights • Recommend items similar to those users preferred in the past • User profiling is the key • Items/content usually denoted by keywords • Matching “user preferences” with “item characteristics” … works for textual information • Vector Space Model widely used

Types of RS – Content based RS Content based RS - Limitations • Not all content is well represented by keywords, e.g. images • Items represented by the same set of features are indistinguishable • Users with thousands of purchases is a problem • New user: No history available

Web Mining and Recommendation

Web Mining and Recommendation

Presentation Transcript

WEB MINING AND APPLICATIONS

Web Mining

WEB MINING AND nb

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web mining

Web Mining

Web Mining

Web Mining

Recommendation Systems and Web Search

WEB MINING

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

WEB MINING