Clustering of Web pages

Clustering of Web pages Najlah Gali 21.3.2017

Web page clustering Organizing web pages into cohesive groups such that pages in the same cluster are more similar to each other than to those in other clusters. Entertainment Fitness

Motivation

Web search engines Finding similar or related web pages.

Web page classification

Queries’ similarity Two queries resulting in two different web pages within the same clusters can be recognized as being similar. Cluster Q 1 : Ravintola Q1 ≈ Q2 Q2: lounas

How to cluster?

Clustering components • Web page features • Words • Phrases • Links • Similarity measure • Semantic similarity • Syntactic similarity • Clustering algorithm • Partitional • Hierarchal • Graph based

Approaches to cluster web pages Two approaches exist: • Link based: depends on the link structure between the pages • Common neighbor • Co-citation • Text based: depends on the content of the web page • Hyper based: depends on text and link structure

Link-based clusteringcommon neighbor Two web pages are similar if they have neighbors in common. Similarity (a, b) = |O (a) ⋂ O |(b)| = |(c, d)| =2 a b In-link d f c e out-link

Link-based clusteringCo-citation Two web pages are similar if they are referenced (cited) by similar pages. c f d e c g d b a e a b

Co-citation analysis[Larson 1996] start Create a collection P1, P2, P3, P4… Construct co-citation frequency matrix Convert raw freq. into correlation matrix Multidimensional scaling technique Apply agglomerative clustering

Co-citation examplePart 1 Retrieval strategy Collection P1 |Pages cite P1 and P2| P2 P3 P4 |Pages cite P1 and P3| P5 P6 Co citation matrix Correlation matrix

Co-citation examplePart 2 Low correlation High correlation Correlation Matrix Cluster

Issues (link-based clustering) It is useful when a web page lacks text content. However • Web pages with insufficient in-links or out-links can not be clustered; • Two web pages might be linked because they share a minor topic; • Links can be noisy (adverts); • No common links → similarity = 0!

Text-based clustering • Content source • Entire text • Main content • Snippet • Keywords • Feature extraction • Binary • Term frequency (TF) • Term frequency-Inverse document frequency (TF-IDF) • Similarity measure • Character-based • Token-based • Clustering algorithm • Partitional (K-means) • Hierarchical (Agglomerative and divisive)

Content source Keywords Office Equipment Supplies Shredder laminators Main content Snippet Entire text

Feature extractionTokenization and stemming “Keep your office running smoothly with our wide…” • Tokenize into words Keep, your, office, running, smoothly, with, our, wide • Stem Keep, your, office, running, smoothli, with, our, wide

Feature extractionStop words removal “Keep your office running smoothly with our wide…” Remove stop words (in, on, your, with, at) keep, offic, run, smoothli, wide

Feature extractioncreation of feature vector Page 1: “Keep your office running smoothly with our wide…” Page 2: “..staffed office, keeping your office clean and staffed” Bag-of-words [keep, offic, run, smoothli, wide, staf, clean] • Binary vector : 1if occurs; 0 otherwise P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 1, 0, 0, 0, 1, 1] • TF vector: counts number of occurrence of a word win page p P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 2, 0, 0, 0, 2, 1]

Term frequency-Inverse document frequency

SimilarityMeasures • Character-based: treats strings as sequence of characters Single edit (insertion, deletion, substitution) is performed at a time to transfer a string into another • Q-gram: divides strings into substrings of length q • Token-based: treats strings as sequence of tokens Machine Learning mac, ach, chi, hin, ine, nel, ele, lea, ear, arn ... Machine Learning 1ifmatch 0otherwise Machine Learned • Hybrid: combines character- and token-based measures

Token-based measures

Results excellent good poor

K-means start Select K random pages as centroids Assign other pages to nearest centroid N Converge? Y Calculate new centroids Stop

Clustering algorithmsHierarchal 4 3 2 c d 1 a b 4 e 3 1 2 a b c d e

Issues (text-based clustering) • Developed for use in small, static and homogenous pages; • Web pages lack text can not be clustered.

Hyper-based clustering[Modha and Spangler 2000] Represent the page as a triple of unit vectors (D, F, B) • D : word frequencies in a page • F : Out-links • B : In-links Q e a g h m i j k c n l

Out-links vector Bag-of nodes: pages that are pointed to by at least two pages in Q [g, i, j, m] Q e a g h m i j k c n l

In-links vector Bag-of nodes: pages that points to least two pages in Q [e, h, k, c] Q e a g h m i k j c n l

Similarity between two pages Cosine similarity

References • Oikonomakou, N., & Vazirgiannis, M. (2009). A review of web document clustering approaches. In Data mining and knowledge discovery handbook (pp. 931-948). Springer US. • Larson, R. R. (1996, October). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Proceedings of the Annual Meeting-American Society for Information Science (Vol. 33, pp. 71-78). • McCain, K. W. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American society for information science, 41(6), 433. • Modha, D. S., & Spangler, W. S. (2000, May). Clustering hypertext with applications to web searching. In Proceedings of the eleventh ACM on Hypertext and hypermedia (pp. 143-152). ACM.

Thank you!

Clustering of Web pages

Clustering of Web pages

Presentation Transcript

Web Document Clustering

Visual Summarization of Web Pages

Helpful Web Pages

Web Document Clustering

Clustering Web Queries

Web Pages

Density link-based methods for clustering web pages

Web Pages

Web pages

Web Pages

Web Pages

DCF10 web pages

WEB PAGES:

Web Service Clustering

Web clustering Engines

WEB PAGES:

Web Document Clustering

Web pages

PHP Web Pages