Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information

Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.co / coboscarlos@gmail.com Advisor: Elizabeth León Ph.D. eleonguz@unal.edu.co Visiting scholar of Modern Heuristic Research Group LISI-MIDAS: Universidad Nacional de Colombia Sede Bogotá GTI : Universidad del Cauca Idaho Falls, October 5, 2011 Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information

Agenda • Preliminaries • Latent Semantic Indexing • Web Clustering Engines • Proposed Model

Documents Results Preliminaries Information Retrieval System Auto complete User Extended Query Query Retrieval Process Indexing Process Visualization and browsing Feedback Indexes

Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector space probabilistic Inference Network Belief Network Preliminaries Information Retrieval Models Retrieval

Preliminaries Classic Models – Basic Concepts • Each document is represented by a set of representative keywords or index terms • An index term is a document word useful for remembering the document main themes • Usually, index terms are nouns because nouns have meaning by themselves • However, some search engines assume that all words are index terms (full text representation) • Not all terms are equally useful for representing the document contents, e.g. less frequent terms allow identifying a narrower set of documents • The importance of the index terms is represented by weights associated to them

Preliminaries Indexing Process recognition of structure Document Structure Full text representation Tokenization Filters Stop words rem. Noun groups rem. Vocabulary rest. Stemming Key words

Preliminaries Indexing Process - Sample WASHINGTON - The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again. Original WASHINGTON The House of Representatives on Tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again Tokens washington the house of representatives on tuesday passed a bill that puts the government on stable financial footing for six weeks but does nothing to resolve a battle over spending that is likely to flare again Filters washington house representatives tuesday passed bill puts government stable financial footing weeks resolve battle spending flare Stop washington hous repres tuesdai pass bill put govern stabl financi foot week resolv battl spend flare Stem

Preliminaries Indexing Process - Sample TRENTON, New Jersey - New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race, in a move that sets up a battle between Mitt Romney and Rick Perry. Original TRENTON New Jersey New Jersey Governor Chris Christie dashed hopes on Tuesday he might make a late leap into the 2012 Republican presidential race in a move that sets up a battle between Mitt Romney and Rick Perry Tokens trenton new jersey new jersey governor chris christie dashed hopes on tuesday he might make a late leap into the 2012 republican presidential race in a move that sets up a battle between mitt romney and rick perry Filters trenton jersey jersey governor chris christie dashed hopes tuesday make late leap 2012 republican presidential race move sets battle mitt romney rick perry Stop trenton jersei jersei governor chri christi dash hope tuesdai make late leap 2012 republican presidenti race move set battl mitt romnei rick perri Stem

Preliminaries TF-IDF or Term-Document Matrix Stored in an Inverted Index Observed Frequency Term-Document Matrix (TDM)

Preliminaries Cosine Similarity

Preliminaries Sample 1: Vector Space Model t2 d7 d6 q d5 d4 d3 t1 d2 t3 d1

Preliminaries Sample 2: Vector Space Model t2 d7 d6 q d5 d4 d3 t1 d2 t3 d1

Preliminaries Vector Space Model • Advantages: • Simple model based on linear algebra • Term weights • Allows computing a continuous degree of similarity between queries and documents • Allows ranking documents according to their possible relevance • Allows partialmatching • Limitations: • Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) • Word substrings might result in a "falsepositive match" • Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "falsenegative match". • The order in which the terms appear in the document is lost in the vector space representation. • Assumes terms are statistically independent

Latent Semantic Indexing • It is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text • SVD: • Also, it can be used to reduce noise in the data (SVD moves data to a reduced dimension)

Latent Semantic Indexing SVD • Let A denote an m × n matrix of real-valued data and rank r, where without loss of generality m ≥ n, and therefore r ≤ n. • Where: • The columns of U are called the left singular and form an orthonormal basis for original columns • U is the eigenvectors of DDT (orthogonal) • The rows of VT contain the elements of the right singular vectors and form an orthonormal basis for original rows • V is the eigenvectors of DTD (orthogonal) • Ʃ is square root of eigenvalues of U and V put in the diagonal (so it’s a sorted diagonal matrix) Ʃi,i > Ʃ j,j where i<j y Ʃi,i=0 where i>=r … r ≤ n

Latent Semantic Indexing 0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32 mxn mxn Terms 3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3 34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35 Docs nxn 0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11 nxn

Latent Semantic Indexing • Using SVD to reduce noise • Take r instead of n in matrix Ʃ • What value of r? e.g. 90% of Frobenius norm • In this case r=5, where r < n (n=8)

Latent Semantic Indexing 0,29 0,64 -0,01 -0,29 -0,56 -0,19 0,12 0,15 0,30 0,06 0,28 -0,02 0,24 0,48 0,12 0,47 0,28 0,24 0,13 -0,12 0,59 -0,24 -0,42 -0,08 0,37 -0,44 -0,44 0,09 -0,14 -0,08 0,24 -0,15 0,31 0,15 -0,52 -0,03 0,08 0,61 -0,09 -0,01 0,33 -0,01 0,27 0,71 -0,35 0,06 -0,43 -0,07 0,26 0,28 0,09 0,39 0,33 -0,18 0,66 -0,27 0,37 -0,35 -0,02 -0,06 0,04 -0,41 0,04 0,62 0,38 -0,04 -0,12 -0,3 0,04 -0,18 -0,27 -0,40 0,26 -0,33 0,58 -0,38 -0,16 0,25 0,18 -0,32 mxr mxn Terms 3 5 5 5 4 3 1 2 3 4 5 4 3 3 4 3 3 4 5 3 3 4 3 2 5 5 3 5 4 5 5 5 3 4 4 4 5 4 3 4 4 5 4 5 4 3 5 2 2 5 4 3 3 3 3 2 5 5 4 5 3 5 5 4 5 5 5 5 4 5 4 4 4 3 4 4 1 2 4 3 34,89 0 0 0 0 0 0 0 0 4,63 0 0 0 0 0 0 0 0 3,36 0 0 0 0 0 0 0 0 2,33 0 0 0 0 0 0 0 0 2,21 0 0 0 0 0 0 0 0 1,73 0 0 0 0 0 0 0 0 1,22 0 0 0 0 0 0 0 0 0,35 Docs rxr 0,34 0,41 0,38 0,39 0,31 0,34 0,34 0,29 -0,35 0,28 0,48 0,03 0,38 -0,06 -0,55 -0,35 0,10 0,05 0,51 0,13 -0,53 -0,42 0,33 -0,38 -0,28 0,36 -0,38 -0,09 0,36 -0,16 0,56 -0,41 -0,3 -0,04 0,38 -0,68 -0,11 0,45 0,29 0,08 -0,29 -0,43 0,26 0,05 0,39 -0,50 0,24 0,44 -0,40 0,60 -0,10 0,02 -0,38 -0,22 -0,10 0,52 -0,58 -0,27 -0,04 0,60 0,20 0,41 0,12 -0,11 rxn

Latent Semantic Indexing Sum ← 0 For i ← 0 to n do Sum ← Sum + Ʃ(i,i) End for Percentage← Sum* 0.9 // 90% of Frobenius Norm r ← 0 Temp ← 0 For i ← 0 to n do Temp ← temp + S(i, i) r ← r + 1 IF temp ≥ Percentage then break end if End for Return r Value of r?

Latent Semantic Indexing • Retrieved documents in latent space • Documents in the latent space: • Terms in latent space:

Latent Semantic Indexing • Query in the latent space: • Cosine similarity

Web Clustering Engines

Web Clustering Engines • The search aspects where WCE can be most useful in complementing the output of plain search engines are: • Fast subtopic retrieval: documents can be accessed in logarithmic rather than linear time • Topic exploration.: Clusters provides a high-level view of the whole query topic including terms for query reformulation (particularly useful for informational searches in unknown or dynamic domains) • Alleviating information overlook: Users may review hundreds of potentially relevant results without the need to download and scroll to subsequent pages

Web Clustering Engines • WDC pose new requirements and challenges to clustering technology: • Meaningful labels • Computational efficiency (response time) • Short input data description (snippets) • Unknown number of clusters • Work with noise data • Overlapping clusters

General Model Visualization Search results acquisitions Cluster construction and labeling Preprocesing Query Features Snippets Clusters

Proposed Model Visualization Search results acquisitions Cluster construction and labeling Preprocesing Query Query Expansion Concepts instead of Terms Evolutionary approach: Online and Offline Features Snippets Clusters Feedback Taxonomy, Ontologies and User Information

Query Expansion Process Specific Ontology Query by keywords Extended Query Query by keywords A registered user requests a query (based on keywords in a common graphics interface like Google). He/she receives help on-line (auto complete) based on his/her user profile User 3. External service 1. Pre-processing and semantic relationship Auto complete Dropdown List General Taxonomy of Knowledge 2. Related Concepts with user profile Inverted Index of Concepts Query Expansion Process Concepts, relations (is-a, is-part-of) and instances User Profile 0 … * 1

Query Expansion Process (B) Query by keywords Extended Query • GTK and Specific ontologies are multilingual (collaborative edition process) • User profile has: • Nodes from GTK used for the user • A relation with the Inverted Index of concepts (ontologies), to support rating process: • Manage concepts that have been previously evaluated for an ontology specific (good/bad) User General Taxonomy of Knowledge Query Expansion Process 1

Term-Document Matrix - Observed Frequency - TDM-OF Building Process Independent Threads • Extended query: Original keyword+ other concepts + selected nodes from GTK (ontologies) • In parallel, each web search results is processed: • Pre-processing • Tokenization • Filters (Special characters and lower case) • Stop words removal • Define the language • Stemming (English/ Spanish) • For each document, accumulate the observed frequency of each term • Mark the document as processed Google API Yahoo! API Bing API Term-Document Matrix (Observed Frequency) TDM-OF Building Process 2

Concept-Document Matrix - Observed Frequency - CDM-OF Building Process Specific Ontology In parallel, for each document marked as processed: Join terms belonging to the same concept in the selected specific ontologies (from extended query) Accumulate the observed frequency for terms who joined in the same concept End this process when all web search results are processed - thread synchronization - Concept-Document Matrix (Observed Frequency) CDM-OF Building Process Thread Synchronization 3

Concept-Document Matrix (CDM) Building Process Calculate weigh (TF-IDF) of concepts in documents Concept-Document Matrix (CDM) 4 CDM-OF Building Process

Clustering Process Three own algorithms A Hybridization of the Global-Best Harmony Search, with the K-means algorithm A Memetic Algorithm with Niching Techniques (restricted competition replacement and restrictive mating) A Memetic Algorithm (Roulette wheel, K-means, and Replace the worst) All Algorithms: Define the number of clusters automatically (BIC) Can use a standard Term-Document Matrix (TDM), Frequent Term-Document Matrix (FTDM), Concept-Document Matrix (CDM) or Frequent Concept-Document Matrix (FTDM) Test with data sets based on Reuters-21578 and DMOZ Test by users Clustered Documents 5 Clustering Process

Labeling Process Statistically Representative Terms: Initialize algorithm parameters Building of the "Others” label and cluster Candidate label induction Eliminate repeated terms Visual improving of labels Frequent Phrases: Conversion of the representation Document concatenation Complete phrase discovery Final selection Building of the "Others” label and cluster Cluster label induction Overlapping clusters Clustered Documents and Labeled 6 Labeling Process

Visualization and Rating Process • On experimentation → for each cluster, the user answered whether or not: • (Q1) the cluster label is in general representative of the cluster (much, little, or nothing) • (Q2) the cluster is useful, moderately useful or useless. • Then, for each document in each cluster, the user answered whether or not: • (Q3) the document matches with the cluster (very well matching, moderately matching, or not-matching) • (Q4) the document relevance (location) in the cluster was adequate (adequate, moderately suitable, or inadequate). Clustered Documents and Labeled Visualization and Rating Process User Profile

Visualization and Rating Process Specific Ontology On production → the user can answer if each document is useful (relevant) or not Clustered Documents and Labeled General Taxonomy of Knowledge 0 … * Inverted Index of Concepts Visualization and Rating Process User Profile User Profile

Proposed model

Collaborative Editing Process of Ontologies Specific Ontology 1. Select node (ontology associated) WordNet Editor 4. Supported by concepts used for user 2. Edit the ontology Concepts, synonyms in different languages, relations, instances 3. Supported by general ontologies General Taxonomy of Knowledge 0 … * Inverted Index of Concepts User Profile 5. Update Index automatically when save Can be automatically

Carlos Cobos-Lozada MSc. Ph.D. (c) ccobos@unicauca.edu.co / coboscarlos@gmail.com Questions? Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information

Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information

Model of Web Clustering Engine Enrichment with a Taxonomy, Ontologies and User Information

Presentation Transcript

Model of Taxonomy Development

A Taxonomy for Metadata and Information Architecture

A User-Programmable Vertex Engine

Ontologies and User Needs in Publishing

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Information Ontologies

Names, Ranks, Clades, and Taxonomy Ontologies

Web Search with Variable User Model

Mixture model clustering for mixed data with missing information

Academic Enrichment Model

Leveraging ontologies to transform insect taxonomy and phylogenetics

OWL: Web Ontologies

Clustering User Queries of a Search Engine

Schoolwide Enrichment Model

Taxonomy model

Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

Semantic Web - Ontologies

Tagging with a Taxonomy

Getting Started with User-Centered Taxonomy Design

Clustering of Web pages

A Taxonomy of Web Searches