1 / 31

A Survey of Opinion Mining

A Survey of Opinion Mining. Dongjoo Lee Intelligent Database Systems Lab. Dept. of Computer Science and Engineering Seoul National University. Introduction. The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites

kata
Download Presentation

A Survey of Opinion Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey of Opinion Mining Dongjoo Lee Intelligent Database Systems Lab. Dept. of Computer Science and Engineering Seoul National University

  2. Introduction • The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites • A few problems • What is the general opinion on the proposed tax reform? • How is popular opinion on the presidential candidates evolving? • Which of our customers are unsatisfied? Why? • Opinion Mining (OM) • a recent discipline at the crossroads of information retrieval and computational linguistics which is concerned not with the subject of a document, but with opinion it expresses • Related Areas • Data Mining(DM), Information Retrieval (IR), Text Classification (TC), Text Summarization (TS) Center for E-Business Technology

  3. Agenda • Introduction • Development of Linguistic Resource • Conjunction Method • PMI Method • WordNet Expanding Method • Gloss Use Method • Sentiment Classification • PMI Method • Machine Learning Method • NLP Combined Method • Extracting and Summarizing Opinion Expression • Statistical Approach • NLP Based Approach • Discussion Center for E-Business Technology

  4. Development of Linguistic Resource (1) • Linguistic resources can be used to extract opinion and to classify the sentiment of text • Appraisal Theory • Sentiment related properties are well-defined • A framework of linguistic resources which describes how writers and speakers express inter-subjective and ideological position • underlying linguistic foundation of OM • Tasks • Determining the subjectivity of a term • Determining term orientation • Determining the strength of term attitude • Example • Objective: vertical, yellow, liquid • Subjective • Positive: good < excellent • Negative: bad < terrible Center for E-Business Technology

  5. Development of Linguistic Resource (2) • Conjunction Method • PMI Method • Orientation • Subjectivity • WordNet Expansion Method • Gloss Use Method • Orientation • Subjectivity • SentiWordNet Center for E-Business Technology

  6. Conjunction Method - overview • Hatzivassiloglou and McKeown, 1997 • Hypothesis • Adjectives in ‘and’ conjunctions usually have similar orientation, while ‘but’ is used with opposite orientation. • Process • Randomly selected adjectives with positive and negative orientation seed terms were used to predict orientation. negative • All conjunction of adjectives are extracted from the corpus. • A log-linear regression model combines information from different conjunctions to determine if each two conjoined adjectives are of same or different orientation. • A clustering algorithm separates the adjectives into two subsets of different orientation. It places as many words of same orientation as possible into the same subset. • The average frequencies in each group are compared and the group with the higher frequency is labeled as positive. positive seed terms corpus and but Center for E-Business Technology

  7. Conjunction Method –objective function and constraints • Select pmin that minimizes Φ(p) • dissimilarity between adjectives in same cluster is minimized and dissimilarity between adjectives in different cluster is maximized. • Experiments • HM term set : 1,336 adjectives • 657 positive, 679 negative terms • Methods to improve performance of orientation prediction • But rule : Most conjunctions had same orientation, while some conjunctions linked by ‘but’ had almost opposite orientation • log-linear regression model • morphological relationship • adequate-inadequate or thoughtful –thoughtless • log-linear model with morphological relationship : 82.5% accuracy |Ci| : the cardinality of cluster i d(x, y): the dissimilarity between adjectives x , y Center for E-Business Technology

  8. PMI Method - overview • Pointwise Mutual Information (PMI) • a measure of association used in information theory and statistics • Orientation • Turney and Littman, 2003 • terms with similar orientation tend to co-occur in documents • Subjectivity • Baroni and Vegnaduzzo, 2004 • subjective adjectives tend to occur in the near of other subjective adjectives Center for E-Business Technology

  9. PMI Method – predicting semantic orientation • Modified PMI was measured using the number of results returned by the AltaVista search engine with NEAR operator • Predicting semantic orientation of a term SO(t) • Experiments • With HM term set and three corpora • With small corpus, accuracy isn’t higher than conjunction method. • With large corpus, accuracy is higher than conjunction method. t : target term • ti : paradigmatic term Center for E-Business Technology

  10. WordNet Expansion Method • Hu et al., 2004 • used synonym and antonym relationship between words • Hypothesis • adjectives usually share the same orientation as their synonyms and opposite orientation as their antonyms • By using a set of seed adjectives, orientation of all adjectives in WordNet can be assigned through a procedure exploring on the cluster graphs. Center for E-Business Technology

  11. Gloss Use Method - overview • Esuli et al., 2005, 2006 • Hypothesis • Orientation • terms with similar orientation have similar glosses • Subjectivity • terms with similar orientation have similar glosses • terms without orientation have non-oriented glosses • SentiWordNet • All words in the WordNet have three scores • positivity, negativity, and objectivity • Term Sense is positioned in reversed triangle Center for E-Business Technology

  12. Gloss Use Method – classification process • Process • A seed set (Lp, Ln) is provided as input • Lexical relations (e.g. synonymy) from a thesaurus, or online dictionary, are used to extend seed set. Once added to the original ones, the new terms yield two new, richer sets Trp and Trn; together they form the training set for the learning phase of Step 4. • For each term ti in Trp∪Trn or in the test set, a textual representation of ti is generated by collating all the glosses of ti as found in a machine-readable dictionary. Each such representation is converted into vectorial form by standard text indexing techniques. • A binary text classifier is trained on the terms in Trp∪Trn and then applied to the terms in the test set. • Experiments • Classifier : NB, SVM, PrTFIDF • 87.38% Accuracy Center for E-Business Technology

  13. Development of Linguistic Resource - Summary Center for E-Business Technology

  14. Sentiment Classification • The process of identifying the sentiment – or polarity – of a piece of text or a document. • Document-level • Sentence-level, phrase-level • Feature-level • Define target of the opinion and assign the sentiment of the target • Document-level Sentiment Classification Method • PMI method • Machine Learning Method • Default Classifiers • Enhanced Classifier • NLP Combined Method • A Two-Step Classification • Combining Appraisal Theory Center for E-Business Technology

  15. PMI Method • Turney et al., 2002 • Process • Only two-word phrases containing adjectives or adverbs are extracted • Semantic orientation of a phrase • SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”) • Semantic orientation is an average semantic orientation of the phrases • Experiments • 410 reviews from Epinions (epinion.com): 170 positive, 240 negative • calculating the PMI of 10,658 phrases from 410 reviews consume about 30 hours Center for E-Business Technology

  16. ML - Default Classifier • Pang and Lee, 2002 • A special case of text categorization with sentiment- rather than topic-based categories • Document modeling • standard bag-of-features framework • Experiments • Data : movie reviews (Internet Movie Database), rating -> negative, neutral, positive • Naïve Bayes, Maximum Entropy, Support Vector Machine • In terms of relative performance, Naïve Bayes tends to do the worst and SVM tends to do the best, although the differences aren’t very large. Center for E-Business Technology

  17. ML - Using Only Subjective Sentences • Pang and Lee, 2004 • improved polarity classification by removing objective sentences • A subjectivity detector determines whether each sentence is subjective or not • Standard subjectivity classifier • Subjectivity classifier using proximity relationship • The use of subjectivity extracts can improve the polarity classification at least no loss of accuracy. Center for E-Business Technology

  18. NLP Combined Method– A Two-Step Classification • Wilson et al., 2005 • A Two-Step Contextual Polarity Classification • employ machine learning and 28 linguistic features • document polarity : the average polarity of phrases Step 1. Neutral-polar classifier classifies each phrase containing a clue as neutral or polar Step 2. Polarity classifier takes all phrases marked in step 1 as polar and disambiguates their contextual polarity (positive, negative, both, or neutral). • 28 Features : were extracted using NLP techniques with a dependency parser • 4 Word Features, 8 Modification Features, 11 Structure Features, 3 Sentence Features, 1 Document Feature • Experiments • Data : Multi-perspective Question Answering (MPQA) Opinion Corpus neutral-polar classification (%) polarity classification (%). Center for E-Business Technology

  19. NLP Combined Method- Combining Appraisal Theory • Whitelaw et al., 2005 • applied the appraisal theory to the machine learning methods of Pang and Lee • Structure of an appraisal • An example “not very happy” • Experiments • a lexicon of 1329 appraisal entities have been produced semi-automatically from 400 seed terms in around twenty man-hours • combining attitude type and orientation : accuracy 90.2%. Center for E-Business Technology

  20. Sentiment Classification - Summary Center for E-Business Technology

  21. Extracting and Summarizing Opinion Expression • Goal • Extract the opinion expression from large reviews and present it with an effective way • Tasks • Feature Extraction • Sentiment classification at the feature-level requires the extraction of features that are the target of opinion words • Sentiment Assignment • Each feature is usually classified as being either favorable or unfavorable. • Visualization • Extracted opinion expression are summarized and visualized. • Methods • Statistical Approaches • ReviewSeer (2003) • Opinion Observer (2004) • Red Opal (2007) • NLP-Based Approaches • Kanayama System (2004) • WebFountain (2005) • OPINE (2005) product Summarize Extract Features Assign Sentiment product reviews Center for E-Business Technology

  22. Opinion Observer - Overview • Hu and Liu, 2005 • Extract and summarize opinion expression from customer reviews on the Web. • Only mines the features of the product on which the customers have expressed their opinions and whether the opinion are positive or negative • Overall process • Review crawling • Feature extraction • Sentiment assignment • Opinion word extraction • Opinion orientation identification • Summary generation Overall process Center for E-Business Technology

  23. Opinion Observer - Tasks • Feature Extraction • Product features are extracted from the noun or noun phrase by the association miner CBA • Compactness pruning, redundancy pruning • Sentiment Assignment • Opinion sentence : a sentence contains one or more product features and one or more opinion words • Adjectives are the only opinion words • Prior polarity of adjectives was identified by WordNet expansion methods with seed terms • Infrequent features are extracted by using frequent opinion words • Polarity of a sentence is assigned as a dominant orientation • Extracted form : (product feature, # of positive sentences, # of negative sentences) • Experiments • Large collection of reviews of 15 electronic products • 86.3% recall, 84.0% precision Center for E-Business Technology

  24. Opinion Observer - Visualization • Features of products are compared by the bar graph • Number of positive and negative sentences of each feature are normalized Positive portion Negative portion Center for E-Business Technology

  25. Web Fountain - Overview • Yi et al., 2005 • Extracts target features of the sentiment from the various resources and assigns polarity to the features • System Architecture • Sentiment Miner • Analyzes grammatical sentence structures and phrases by using NLP techniques Center for E-Business Technology

  26. Web Fountain – Tasks • Feature Extraction • Candidate features • a part-of relationship with the given topic • an attribute-of relationship with the given topic. • an attribute-of relationship with a known feature of the given topic • bBNP (Beginning definite Base Noun Phrase) heuristic is used • Select bnp (base noun phrase) that has high likelihood ratio • Experiments • Precision - digital camera: 97%, music reviews: 100% • Sentiment Assignment • Parse and traverse with two linguistic resources • Sentiment lexicon: define the sentiment polarity of terms • Sentiment pattern database: contain the sentiment assignment patterns of predicates • Experiments • Product review • Recall 56%, Precision 87% Center for E-Business Technology

  27. Web Fountain – Visualization • Web interface listing sentiment bearing sentences about a given product Center for E-Business Technology

  28. Extracting and Summarizing Opinion Expression - Summary Center for E-Business Technology

  29. Discussion • OM is a growing research discipline related to various research areas, such as IR, computational linguistics, TC, TS, and DM. • Surveyed three topics and summarized it. • For Korean OM? • There isn’t any published research into the Korean OM. • Language differences may impose some limits on the methods used in the OM subtasks. • Structural differences between English and Korean may mean that the same heuristics cannot be applied to extract features from text • The lack of Korean thesaurus similar to WordNet limits the methods of obtaining the prior polarity of words for the PMI or conjunction methods. • Research into Korean OM must be conducted in conjunction with other related areas. Center for E-Business Technology

  30. Discussion - Research Map of OM Center for E-Business Technology

  31. Thank you Center for E-Business Technology

More Related