Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques— Chapter 10 —10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Slides by students at CS512 (Spring 2009) 9/10/2014 1

Outline • Probabilistic Topic Models (Yue Lu) • Opinion Mining(Hyun Duk Kim) • Mining Query Logs for Personalized Search (Yuanhua Lv) • Online Analytical Processing on Multidimensional Text Database(Duo Zhang)

Probabilistic Topic Models Yue LU Department of Computer Science University of Illinois, Urbana-Champaign Many slides are adapted/taken from different sources, including presentations by ChengXiang Zhai, Qiaozhu Mei and Tom Griffiths

Intuition • Documents exhibit multiple topics. topic: Social network website topic: education topic: criticism 4

What is a Topic? Representation: a multinomial distribution of words, i.e., a unigram language model retrieval 0.2 information 0.15 model 0.08 query 0.07 language 0.06 feedback 0.03 …… Topic: A broad concept/theme, semantically coherent, which is hidden in documents 5

Organize Information with Topics price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203 fuel 0.0188 company 0.0182 … Resolution How many in a document? Categories Natural hazards 1 government response oil price, Topics several loss statistics, … new orleans, president bush.. 50~100 Entities … new orleans, put together, .. Phrases hundreds oil, new, put, … Words thousands orleans, is, … Patterns Many ... 6

The Usage of Topic Models Usage of a topic model: Summarize themes/aspects Navigate documents Retrieve documents Segment documents Document classification Document clustering [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … government 0.3 response 0.2... Topic 1 donate 0.1relief 0.05help 0.02 ... Topic 2 … city 0.2new 0.1orleans 0.05 ... Topic k is 0.05the 0.04a 0.03 ... Background B 7

General Idea of Probabilistic Topic Models • Cast intuition into a generative probabilistic process (Generative Process) • Each document is a mixture of corpus-wide topics (multinomial distribution/unigram LM) • Each word is drawn from one of those topics • Since we only observe the documents, need to figure out (Estimation/Inference) • What are the topics? • How are the documents divided according to those topics? • Two basic models: PLSA and LDA

Probabilistic Latent Semantic Analysis/Indexing [Hofmann 99]

Topics B PLSA: Generation Process Document [Hofmann 99], [Zhai et al. 04] Generate a word in a document battery 0.3 life 0.2.. 1 d1 2 design 0.1screen 0.05 d2 w … dk k price 0.2purchase 0.15 B Collection background Is 0.05the 0.04a 0.03 .. Parameters: B=noise-level (manually set) ’s and ’s need to be estimated

Topics ? ? B ? PLSA: Estimation [Hofmann 99], [Zhai et al. 04] Document Generate a word in a document Estimated with Maximum Likelihood Estimator (MLE) through an EM algorithm battery ? life ? 1 d1 2 design ?screen ? d2 w … Log-likelihood of the collection dk k price ?purchase ? Collection background B Is ?the ?a ?

Problems with PLSA • “Documents have no generative probabilistic semantics” • i.e., document is just a symbol • Model has many parameters • linear in number of documents • need heuristic methods to prevent overfitting • Cannot generalize to new documents

Latent Dirichlet Allocation [Blei et al. 03]

Topics Basic Idea of LDA [Blei et al. 03], [Griffiths&Steyvers 02, 03, 04] β β • Adding a Dirichlet Prior α on topic distribution in documents • Adding a Dirichlet Prior β on word distribution in topics • α, β can be vectors, but for convenience, α = α1= α2=…; β = β1 = β2=… (Smoothed LDA)   1 d1 Document 2 d2 w … dk k

Dirichlet Hyperparameters α, β • Generally have a smoothing effect on multinomial parameters • Large α, β : more smoothed topic/word distribution • Small α, β: more skewed topic/word distribution (e.g. bias towards a few words for each topic) • Common settings: α=50/K, β=0.01 • PLSA is maximum a posteriori estimated LDA when using uniform prior: α=1, β=1

Inference • Exact inference is intractable • Approximation techniques: • Mean field variational methods (Blei et al., 2001, 2003) • Expectation propagation (Minka and Lafferty, 2002) • Collapsed Gibbs sampling(Griffiths and Steyvers, 2002) • Collapsed variational inference (Teh et al., 2006)

Would like to know more? • “Parameter estimation for text analysis” by Gregor Heinrich • “Probabilistic topic models” by Mark Steyvers

Opinion Mining Hyun Duk Kim Data Mining: Principles and Algorithms

Agenda • Overview • Opinion finding & sentiment classification • Opinion Summarization • Other works • Discussion & Conclusion Data Mining: Principles and Algorithms

Web 2.0 • “ Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as a platform, and an attempt to understand the rules for success on that new platform.” [Wikipedia] • Users participate in content creation • ex. Blog, review, Q&A forum Data Mining: Principles and Algorithms

Opinion Mining • Huge volume of opinions on the Web • Ex. Product reviews, Blog posts about politic issues • Need a good technique to summarize them • Example of commercial system (MS live search) Data Mining: Principles and Algorithms

Usefulness of opinion mining • Individuals • Purchasing a product/ service • Tracking political topics • Other decision making tasks • Businesses and organizations • product and service benchmarking • survey on a topic • Ads placements • Place an ad when one praises an product • Place an ad from a competitor if one criticizes a product [Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008] Data Mining: Principles and Algorithms

Subtasks • Opinion finding & sentiment classification • Opinion finding • If the target text is opinion or fact • Sentiment classification • If the opinion is positive or negative • In detail, ‘positive/negative/mixed’ • Methods • Lexicon based method • Machine learning • Opinion Summarization • How to show opinion finding/classification results effectively • Methods • Basic statistics showing • Feature level summary [Hu & Liu, KDD'04/ Hu & Liu, AAAI'04] • Summary paragraph generation [Kim et al, TAC'08] • Probabilistic analysis [Mei et al, WWW'07] • Other works Data Mining: Principles and Algorithms

Opinion Finding • Lexicon-based method • Prepare opinion word list • Ex. Word: ‘good’, ‘bad’ / Phrase: ‘I think’, ‘In my opinion’ • Check special part of speech expressing opinions • Ex. Adjective: ‘excellent’, ‘horrible’ / Verb: ‘like’, ‘hate’ • Decision based on the those words occurrences • Lexicon sources • Manually classified word lists • WordNet • External sources: Wikipedia (objective), review data (subjective) • Machine learning • Train with tagged examples • Main features • Opinion lexicons • Part-of-speech tag, Punctuation (ex. ! ), Modifiers (ex. not, very) Word tokens, Dependency Data Mining: Principles and Algorithms

Opinion Sentiment Classification • Method • Similar to opinion finding • Lexicon based method • Machine learning • Instead of using ‘opinionated word/examples’, use ‘positive and negative’ word/examples • If positive/negative dominant -> positive or negative Both positive and negative dominantly exist -> mixed Data Mining: Principles and Algorithms

Opinion Sentiment Classification • Query dependent sentiment classification [Lee et al, TREC '08/ Jia et al, TREC '08] • Motivation: • Sentiments are expressed differently in different queries • Ex. Small can be good for ipod size, but can be bad for LCD monitor size • Use external web sources to obtain positive and negative opinionated lexicons • Key Ideas • Objective words: Wikipedia, product specification part of Amazon.com • Subjective words: Reviews from Amazon.com, Rateitall.com and Epinions.com • Reviews rated 4 or 5 out of 5: positive words • Reviews rated 1 or 2 out of 5: negative words • Top ranked in Text Retrieval Conference [Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008] Data Mining: Principles and Algorithms

Opinion Summarization • Basic statistics • Show how many numbers of opinions • Ex. Opinions about ipod Data Mining: Principles and Algorithms

Opinion Summarization (cont.) • Feature-based summary [Hu & Liu, KDD '04/ Hu & Liu, AAAI '04] • Find lower level of features and analyze. • Ex. Opinions about ipod • Feature extraction • Usually nouns / noun phrases • Frequent feature identification • Association mining • Feature pruning and infrequent feature identificationbased on heuristic rules • Sentiment summary for each features Data Mining: Principles and Algorithms

Opinion Summarization (cont.) • Summary paragraph generation [Kim et al, TAC '08] • General NLP summarization techniques • Sentence extraction based summary • Opinion filtering • Show sentences opinionated. • Show sentences having the same polarity to the goal of the summary • Opinion ordering • Paragraph division by opinion polarity • [Paragraph1] … Following are positive opinions…Following are negative opinions… [Paragraph2] … Following are mixed opinions… … Data Mining: Principles and Algorithms

Opinion Summarization (cont.) • Probabilistic analysis • Topic sentiment mixture model [Mei et al, WWW '07] • Topic modeling with opinion priors Figure. The generation process of the topic- sentiment mixture model Data Mining: Principles and Algorithms

Other works • Comparative analysis Focus on texts having contradiction or comparison. • Finding comparative sentences [Jindal & Liu, SIGIR '06] • Comparison indicator such as ‘than’ or ‘as well as’. • Ex. ‘Ipod’ is better than ‘Zune’. • Sequential patterns showing comparative sentences ex. 〈{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}〉 comparative • Finding preferred entity [Murthy & Liu, COLING '08] • Rule based approach • Context-dependent orientation finding using Pros and Cons reviews. Data Mining: Principles and Algorithms

Other works • Opinion Integration [Lu & Zhai, WWW '08] • Integrate expert reviews with arbitrary text collection • Expert reviews: well structured, easy to find features, not often updated • Arbitrary: not structured, various & updated data • Semi-supervised topic model • Extract structure aspects (features) data from the expert review to cluster general documents • Add supplementary opinions from general documents Data Mining: Principles and Algorithms

Challenges in opinion mining • Polarity terms are context sensitive • Ex. Small can be good for ipod size, but can be bad for LCD monitor size • Even in the same domain, use different words depending on target feature • Ex. Long ‘ipod’ battery life vs. long ‘ipod’ loading time • Partially solved (query dependent sentiment classification) • Implicit and complex opinion expressions • Rhetoric expression, metaphor, double negation • Ex. The food was like a stone • Need both good IR and NLP techniques for opinion mining. • Cannot divide into pos/neg clearly • Not all opinions can be classified into two categories • Interpretation can be changed based on conditions • Ex. 1) The battery life is ‘long’ if you do not use LCD a lot (pos) 2) The battery life is ‘short’ if you use LCD a lot (neg)Current system classify the first one as positive and second one as negative. However, actually both are saying the same fact. [Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008] Data Mining: Principles and Algorithms

Discussion • A difficult task • Essential for many blog or review mining techniques • Current stage of opinion finding • Good performance in sentence level, specific domain, sub-problem. • Still low accuracy in general case • MAP score of TREC ‘08 top performed system • Opinion finding: 0.4569 • Polarity finding: 0.2297~0.2723 • A lot of margin to improve ! Data Mining: Principles and Algorithms

References • I. Ounis, C. Macdonald and I. Soboroff, Overview of the TREC 2008 Blog Track , TREC, 2008. • Opinion Mining and Summarization: Sentiment Analysis. Tutorial given at WWW-2008, April 21, 2008 in Beijing, China. • Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the 16th International World Wide Web Conference (WWW' 07), pages 171-180, 2007. • Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". To appear in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004. • Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews." To appear in Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), San Jose, USA, July 2004. • Yue Lu and ChengXiang Zhai. "Opinion Integration Through Semisupervised Topic Modeling", In Proceedings of the 17th International World Wide Web Conference (WWW'08) • Kavita Ganesan, Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008 • Hyun Duk Kim, Dae Hoon Park, V.G.Vinod Vydiswaran, and ChengXiang Zhai,Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization: UIUC at TAC 2008 Opinion Summarization Pilot, Text Analysis Conference (TAC), Maryland, USA. Data Mining: Principles and Algorithms

References • Y. Lee, S.-H. Na, J. Kim, S.-H. Nam, H.-Y. Jung and J.-H. Lee , KLE at TREC 2008 Blog Track: Blog Post and Feed Retrieval , TREC, 2008. • L. Jia, C. Yu and W. Zhang, UIC at TREC 208 Blog Track, TREC, 2008. • Nitin Jindal and Bing Liu. "Identifying Comparative Sentences in Text Documents" To appear in Proceedings of the 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR-06), Seattle 2006. • Opinion Mining and Summarization (including review spam detection), tutorial given at WWW-2008, April 21, 2008 in Beijing, China. • Murthy Ganapathibhotla and Bing Liu, Mining opinions in comparative sentences, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 241–248, Manchester, August 2008 Data Mining: Principles and Algorithms

Thank you Data Mining: Principles and Algorithms

Mining User Query Logs for Personalized Search Yuanhua Lv (Some slides are taken from Xuehua Shen, Bin Tan, and ChengXiang Zhai’s presentation)

Animal Car Apple Software Chemistry Software Problem of Current Search Engines Jaguar Suppose we know: • Short-term query logs: previous query = “racing cars”. [Shen et al. 05] • Long-term query logs: “car” occurs far more frequently than “Apple” in the user’s query logs of the recent 2 months. [Tan et al. 06]

? User Information Need User Query Q1 User Clickthrough {C1,1, C1,2 ,C1,3 ,…} C1 {C2,1, C2,2 ,C2,3 ,… } C2 Q2 Qk Problem Definition e.g., Apple software e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, a screenshot gallery, latest software downloads, and a directory of ... … How to model and mine user query logs?

Mining query logs to update query model U Retrieval Model Basis: Unigram language model + KL divergence Similarity Measure Qk θQk Results D θD Query Logs

Average user’s previous clickthrough C1 … Combine previous clickthrough and previous queries Ck-1 Average user’s previous queries Q1 … Qk-1 Linearly interpolate current query and history model Mining Short-term User Query Logs [Shen et al. 05] Qk

Four Heuristic Variants • FixInt: fixed coefficient interpolation

Average user’s previous clickthrough C1 … Combine previous clickthrough and previous queries Ck-1 Average user’s previous queries Q1 … Qk-1 Linearly interpolate current query and history model Mining Short-term User Query Logs [Shen et al. 05] Fixed α? Qk

Four Heuristic Variants • FixInt: fixed coefficient interpolation • BayesInt: adapt the interpolation coefficient to different query length • Intuition: if the current query Qk is longer, we should trust Qk more

Average user’s previous clickthrough C1 … Combine previous clickthrough and previous queries Ck-1 Average user’s previous queries Q1 … Qk-1 Linearly interpolate current query and history model Mining Short-term User Query Logs [Shen et al. 05] Average? Fixed α? Qk

Four Heuristic Variants • FixInt: fixed coefficient interpolation • BayesInt: adapt the interpolation coefficient to different query length • Intuition: if the current query Qk is longer, we should trust Qk more • OnlineUp: assign more weight to more recent records. • BatchUp: the user becomes better and better at query formulation as time goes on, but we do not need to “decay” the clickthrough.

Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II)