1 / 17

Creating Concept Hierarchies in a Customer Self-Help System

Creating Concept Hierarchies in a Customer Self-Help System. Bob Wall CS 535 04/29/05. Outline. Introduction / motivation Background Algorithm Feature selection / feature vector generation Hierarchical agglomerative clustering (HAC) Tree partitioning Results / conclusions. Introduction.

minnie
Download Presentation

Creating Concept Hierarchies in a Customer Self-Help System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS 535 04/29/05

  2. Outline • Introduction / motivation • Background • Algorithm • Feature selection / feature vector generation • Hierarchical agglomerative clustering (HAC) • Tree partitioning • Results / conclusions

  3. Introduction • Application – customer self-help (FAQ) system • RightNow Technologies’ Customer Service module • Need ways to organize Knowledge Base (KB) • System already organizes documents (answers) using clustering • Desirable to also organize user queries

  4. Goals • Create concept hierarchy from user queries • Domain-specific • Self-guided (no human intervention / guidance required) • Present hierarchy to help guide users in navigating KB • Demonstrate the types of queries that can be answered by system • Automatically augment searches with related terms

  5. Background • Problem – cluster short text segments • Inadequate information in queries to provide context for clustering • Need some source of context • Possible solution – use Web as source of info • Cilibrasi and Vitanyi proposed mechanism to extract meaning of words using Google searches • Chuang and Chien presented more detailed algorithm for clustering short segments by using text snippets returned by search engine

  6. Algorithm • Use each text segment as input query to search engine • Process resulting text snippets using stemming, stop word lists to extract related terms (keywords) • Select set of keywords, build feature vectors • Cluster using Hierarchical Agglomerative Clustering (HAC) • Compact tree using min-max partitioning

  7. KB-Specific Version – HAC-KB • Choose set of user queries, corresponding answers • Find list of keywords corresponding to those answers • Trim down list to reasonable length • Generate feature vectors • HAC clustering • Min-max partitioning

  8. Available Data • Answers • Documents forming the KB – actually question and answer, plus keywords and other information like product and category associations • Ans_phrases • Extracted from answers, using stop word lists and stemming • One-, two-, and three-word phrases • Counts of occurences in different parts of answer • Keyword_searches • List of user queries – also filtered by stop word lists and stemmed • List of answers matching query

  9. Feature Selection • Select N most frequent user queries • Select set of all answers matching those queries • Select set of all keywords found in those answers • Reduce to list of K keywords • Avoid removing all keywords associated with a query (would generate empty feature vector) • Try to eliminate keywords that provide little discrimimination (ones associated with many queries) • Also eliminate keywords that only map to a single query

  10. Feature Vector Generation • Generate map from queries to keywords, and inverse map from keywords to queries • Use the TF-IDF (term frequency / inverse document frequency) metric for weighting • vi,j is weight of jth keyword for ith query • tfi,j is the number of times that keyword j occurred in list of answers associated with query i • nj is number of queries associated with keyword j • Now have a N x K feature matrix

  11. Standard HAC Algorithm • Initialize clusters – one cluster per query • Initialize similarity matrix • Using the average linkage similarity metric and cosine distance measure • Matrix is upper-triangular

  12. HAC (cont.) • For N – 1 iterations • Pick two root-node clusters with largest similarity • Combine into new root-node cluster • Add new cluster to similarity matrix – compute similarity with all other root-level clusters • Generates tall binary tree of clusters • 2N – 1 nodes • Not particularly usable by humans

  13. Min-Max Partitioning • Need to combine nodes in cluster tree, produce a shallow, bushy multi-way tree • Recursive partitioning algorithm • MinMaxPartition(Cluster sub-tree) • For each possible cut level in tree, compute quality of cut • Choose best-quality cut level • For each subtree cut off, recursively process • Stop at max depth or max cluster size

  14. Cut Levels in Tree

  15. Choosing Best Cut • Goal is to maximize intra-cluster similarity, minimize inter-cluster similarity • Quality = Q(C) / N(C) • Cluster set quality (smaller is better) • Cluster size preference (gamma distribution)

  16. Issues / Further Work • Resolve issues with data / implementation • Outstanding problem – generating meaningful labels for clusters in hierarchy • Means of measuring performance • Incorporate other KB data, like relevance scores of search results, products/categories • Better feature selection • Fuzzy clustering – query can belong to multiple clusters (Frigui & Masraoui)

  17. References • S.-L. Chuang and L.-F. Chien, “Towards Automatic Generation of Query Taxonomy: A Hierarchical Query Clustering Approach, “Proceedings of ICDM’02, Maebashi City, Japan, Dec. 9-12, 2002, pp. 75–82, 2002. • S.-L. Chuang and L.-F. Chien, “A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments,” Proceedings of CIKM’04, Washington, DC, Nov., 2004, pp. 127-136. • R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google,” published on Web, available at http://arxiv.org/abs/cs/0412098. • H. Frigui and O. Masraoui, “Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents,” in Survey of Text Mining: Clustering, Classification, and Retrieval, Michael W. Berry, ed., Springer-Verlag, New York, 2004, pp. 45-72.

More Related