Website Clustering

Website Clustering Combining Website Lexical Data and Query Semantic Data Nana Huang, Ray Li

Traditional Lexical Features • Traditional website clustering uses lexical data parsed from each webpage to classify the websites into different categories. • Regular text • <TITLE> tags • <META> tags (description, keywords, arthur) • What if the webpage consists of mainly automatically generated content from scripts? • What if the webpage is a empty frame page with two or more frame?

AOL Clickthrough Data • Back in August 2006, AOL released 2.2 GBs of search logs, which includes queries, clicked websites, and website page rank information. • brochures for business 5 http://www.hp.com • brochures for business 6 http://www.hansonmarketing.com • brochures for business 8 http://www.smallbusinessbrief.com • brochures for business 10 http://www.quickbrochures.com • brochures for business 9 http://www.smallbusinessbrief.com • brochures for business 7 http://www.printingforless.com

Query-Website Graph • We parsed a subset of this data to generate a query-document bipartite graph, where each link in the graph represents the number of times each query lead a website being clicked. Q1 Q2 Q3 Q4 Q5 Queries D1 D2 D3 D4 D5 Documents

Query-Website Graph • A graph like this is most likely too sparse to be useful. • There are a lot of unobserved ‘clicks’ between queries and other related webpages. • We use an iterative process to ‘smooth’ out the bipartite relationship between queries and websites, based on the observation that: • Documents are considered ‘similar’ to some extent if they have been seen by the same query. • Queries are considered ‘similar’ to some extent if they produce the same document.

Query-Website Graph • This will produce a more realistic query-website bipartite relationship • We can then use a list of queries associated with each website as a semantic feature vector. D1 D1 Q1 Q1 D2 D2 Q2 Q2 D3 D3

Combined Feature Vectors • We have three sets of feature vectors for each document: • Lexical features (consists of text and different html tags from the webpage itself) • Semantic features (consists of queries information related to each webpage) • Combination of both • There are 10000 words and 2000 queries – too many features.

Latent Semantic Analysis • We then apply Latent Semantic Analysis to reduce the 12000 features into a lower-ranked 30 ‘virtual concepts’ approximation • {Chicken, Beef, Apple, Oranges} -> {Meat, Fruits} • Each website is transformed from the original vector of features into a new vector of ‘virtual concepts’.

K-Means + Results • We then apply K-means on this new vector space to classify websites into different categories. • Results show that, while using only the semantic query vector performs worse than using the lexical feature vector, combining both features together results in a slightly better clustering performance. • Lexical + Semantic Query F1: 0.50 • Lexical only F1: 0.47 • Queries only F1: 0.30

Website Clustering

Website Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Website Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering