1 / 29

WEB MINING AND nb

oshin
Download Presentation

WEB MINING AND nb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. WEB MINING AND APPLICATIONS Pallavi Tripathi 105956127 Vaishali Kshatriya 105951122 Mehru Anand 106113525 Minnie Virk 106113516

    2. CSE:634 Web Mining 2

    3. CSE:634 Web Mining 3 CITATIONS Amir H. Youssefi, David J. Duke, Mohammed J. Zaki, Ephraim P. Glinert, Visual Web Mining 13th International World Wide Web Conference (poster proceedings), New York, NY, May 2004. Amir H. Youssefi, David Duke, Ephraim P. Glinert, and Mohammed J. Zaki, Toward Visual Web Mining, 3rd International Workshop on Visual Data Mining (with ICDM'03), Melbourne, FL, November 2003.

    4. CSE:634 Web Mining 4 With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

    5. CSE:634 Web Mining 5 WHAT IS WEB MINING? Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World­Wide Web.

    6. CSE:634 Web Mining 6 AREAS OF CLASSIFICATION WEB CONTENT MINING is the process of extracting knowledge from the content of documents or their descriptions. WEB STRUCTURE MINING is the process of inferring knowledge from the World­Wide Web organization and links between references and referents in the Web. WEB USAGE MINING, also known as WEB LOG MINING, is the process of extracting interesting patterns in web access logs In addition to these three web mining types, there are other helpful approaches for web knowledge discovery, such as information visualization which helps us to understand the complex relationships and structures of many search results.

    7. CSE:634 Web Mining 7 TOPICS COVERED In today’s presentation we would be covering the following algorithms related to the various aspects of Web Mining : Spade Algorithm and its applications in Visual Web Mining Sentiment Classification Community Trawling Algorithm

    8. CSE:634 Web Mining 8 VISUAL WEB MINING Application of Information visualization techniques on results of Web Mining in order to further amplify the perception of extracted patterns and visually explore new ones in web domain. Application Domain is Web Usage Mining and Web Content Mining

    9. CSE:634 Web Mining 9 APPROACH USED Make personalized results for targeted web surfers Use data mining algorithms for extracting new insight and measures Employ a database server and relational query language as a means to submit specific queries against data Utilize visualization to obtain an overall picture

    10. CSE:634 Web Mining 10 SPADE OVERVIEW Proposed by Mohammed J Zaki Sequential PAttern Discovery Using Equivalent Class An algorithm based on Apriori for fast discovery of frequent sequences Needs three database scans in order to extract sequential patterns Given: A database of customer transactions, each of which having the following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction. The aim is to obtain typical behaviors according to the user's viewpoint.

    11. CSE:634 Web Mining 11 DEFINITIONS Item : Can be considered as the object bought by a customer, or the page requested by the user of a website, etc. Itemset: An itemset is the set of items that are grouped by timestamp. Data Sequence: Sequence of itemsets associated to a customer. Sequential Mining: Discovering frequent sequences over time of attribute sets in large databases. Frequent Sequential Pattern: Sequence whose statistical significance in the database is above user-specified threshold.

    12. CSE:634 Web Mining 12 SPADE ALGORITHM In the first scan ,find frequent items The second scan aims at finding frequent sequences of length 2 The last scan associates to frequent sequences of length 2, a table of the corresponding sequences id and itemsets id in the database Based on this representation in main memory, the support of the candidate sequences of length k is the result of join operations on the tables related to the frequent sequences of length (k-1) able to generate this candidate

    13. CSE:634 Web Mining 13

    14. CSE:634 Web Mining 14

    15. CSE:634 Web Mining 15

    16. CSE:634 Web Mining 16

    17. CSE:634 Web Mining 17 The visual Web Mining Framework provides prototype implementation for applying information visualization techniques on these results.

    18. CSE:634 Web Mining 18

    19. CSE:634 Web Mining 19

    20. CSE:634 Web Mining 20 We extract user sessions from web logs , this yields results of roughly related to a specific user The user sessions are converted into format suitable for Sequence Mining Outputs are frequent contiguous sequence with given minimum support. These are imported into a database Different queries are executed against this data.

    21. CSE:634 Web Mining 21 APPLICATIONS Designing different visualization diagrams and exploring frequent patterns of user access on a website Classification of web pages into two classes : hot and cold : attracting high and low number of visitors. A webmaster can make exploratory changes to website structure and analyze the change in user access patterns in real world.

    22. Sentiment Classification Vaishali Kshatriya 105951122

    23. CSE:634 Web Mining 23 References The Sentimental Factor: Improving Review Classification via Human-Provided Information. - Philip Beineke , Shivakumar Vaithyanathan and Trevor Hastie Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (July 2002) http://wing.comp.nus.edu.sg/chime/050427/SentimentClassification3_files/frame.htm http://www.cse.iitb.ac.in/~cs621/seminar/SentimentDetection.ppt#267,12,Recent Advances Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web" Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.

    24. CSE:634 Web Mining 24 Sentiment Classification It is a task of labeling a review document according to the polarity of its prevailing opinion. Motivation:Motivation:

    25. CSE:634 Web Mining 25 Online Shopping

    26. CSE:634 Web Mining 26 Topical vs. Sentimental Classification Topical Classification Classifying documents into various subjects for example : Mathematics, Sports etc comparing individual words (unigrams) in various subject areas (Bag-of-Words approach). Example : “score”, “referee”, “football” => Sports Sentiment Classification classifying documents according to the overall sentiment positive vs. negative E.g. like vs. dislike; Recommended vs. not recommended More difficult compared to traditional topical classification. May need more linguistic processing E.g. “you will be disappointed” and “it is not satisfactory”

    27. CSE:634 Web Mining 27 Challenges Dependence of context on the document – “unpredictable” plot, “unpredictable” performance Negations have to be captured The movie was not that bad. The pictures taken by the cell is not of best quality. Subtle Expressions: “How can someone sit through the entire movie?”

    28. CSE:634 Web Mining 28 Unsupervised review classification (Turney ACL -02) Input: Written review Output: classification (i.e. positive or negative) Step 1: Use part-of-speech tagger to identify phrases Step 2: Estimate the semantic orientation of extracted phrase Step 3: Assign the given review to a class (either recommended or not recommended)

    29. CSE:634 Web Mining 29 Step 1: Extract the phrases Part-of-speech tagger is applied to the review Two consecutive words are extracted from the review if their tags conform to any of the patterns in the table

    30. CSE:634 Web Mining 30 Step 2: Estimate the semantic orientation Uses PMI-IR (Pointwise Mutual Information and Information Retrieval) PMI between 2 words, word1 and word2 can be defined as : The Semantic Orientation (SO) of a phrase is calculated as : SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”) SO is positive when the phrase is more strongly associated with excellent and negative when it is more strongly associated with poor.

    31. CSE:634 Web Mining 31 Step 2 (cont’d) PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR) and noting the number of hits (matching documents). The experiment uses AltaVista

    32. CSE:634 Web Mining 32 Step 3: Assign a Class Calculate the average of the SO of the phrases and classify them as recommended if the average is positive and not recommended if the average is negative.

    33. CSE:634 Web Mining 33 Drawbacks Sentiment classification is useful but it does not find what the reviewer liked or disliked. A negative sentiment on an object does not imply that the user did not like anything about the product Similarly a positive sentiment does not imply that the user liked everything about the product The solution is to go to sentence and feature level

    34. CSE:634 Web Mining 34 Feature based Opinion mining and summarization (Hu and Liu ‘04) Interested in what reviewers liked and disliked Since the number of reviews of an object can be large, the goal was to produce simple summary of the reviews The summary can be easily visualized and compared

    35. CSE:634 Web Mining 35 Three main tasks: Step1 : Identify and extract object features that have been commented on in each review Step 2: Determine whether the opinion on the review is positive, negative or neutral Step 3: Group synonyms of features Produce a feature-based summary!!

    36. CSE:634 Web Mining 36 Online Shopping

    37. CSE:634 Web Mining 37 Summary Classification of reviews as good or bad: sentimental classification Unsupervised review classification extracts the phrases from the review, estimates the semantic orientation and assigns a class to the review The solution for the short-comings of the sentimental classification is feature-based opinion extraction

    38. Discovering Web communities on the web

    39. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 39 References Inferring Web Communities from Link Topology (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on Hypertext. Trawling the web for emerging cyber-communities (1999)  Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / Computer Networks.   Finding Related Pages in the World Wide Web (1999)  Jeffrey Dean, Monika R. Henzinger, WWW8 / Computer Networks.   A System for Collaborative Web Resource Categorization and Ranking Maxim Lifantsev. Web Mining : A Bird’s Eye View by Sanjay Kumar Madria Department of Computer Science,University of Missouri-Rolla, MO ,madrias@umr.edu  

    40. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 40

    41. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 41

    42. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 42

    43. CSE:634 Web Mining 43

    44. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 44

    45. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 45

    46. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 46

    47. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 47

    48. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 48 Trawling the Web for emerging cyber-communities Proceeding of the eighth international conference on World Wide Web Toronto, Canada Pages: 1481 - 1493 Year of Publication: 1999 ISSN:1389-1286 Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins

    49. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 49

    50. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 50

    51. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 51

    52. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 52

    53. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 53

    54. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 54 Main idea: pruning

    55. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 55

    56. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 56

    57. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 57

    58. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 58

    59. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 59

    60. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 60

    61. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 61

    62. Mining Topic-Specific Concepts and Definitions on the Web Minnie Virk May 2003,  Proceedings of the 12th International conference on World Wide Web, ACM Press Bing Liu, University of Illinois at Chicago, 851 S. Morgan Street Chicago IL 60607-7053 Chee Wee Chin, Hwee Tou Ng, National University of Singapore 3 Science Drive 2 Singapore

    63. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 63 References Agrawal, R. and Srikant, R. “Fast Algorithm for Mining Association Rules”, VLDB-94, 1994. Anderson, C. and Horvitz, E. “Web Montage: A Dynamic Personalized Start Page”, WWW-02, 2002. Brin, S. and Page, L. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, WWW7, 1998. Web Mining : A Bird’s Eye View by Sanjay Kumar Madria Department of Computer Science,University of Missouri-Rolla, MO ,madrias@umr.edu

    64. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 64 Introduction When one wants to learn about a topic, one reads a book or a survey paper. One can read the research papers about the topic. None of these is very practical. Learning from web is convenient, intuitive, and diverse.

    65. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 65 Purpose of the Paper This paper’s task is “mining topic-specific knowledge on the Web”. The goal is to help people learn in-depth knowledge of a topic systematically on the Web.

    66. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 66 Learning about a New Topic One needs to find definitions and descriptions of the topic. One also needs to know the sub-topics and salient concepts of the topic. Thus, one wants the knowledge as presented in a traditional book. The task of this paper can be summarized as “compiling a book on the Web”.

    67. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 67 Proposed Technique First, identify sub-topics or salient concepts of that specific topic. Then, find and organize the informative pages containing definitions and descriptions of the topic and sub-topics.

    68. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 68 Why are the current search tecnhiques not sufficient? For definitions and descriptions of the topic: Existing search engines rank web pages based on keyword matching and hyperlink structures. NOT very useful for measuring the informative value of the page. For sub-topics and salient concepts of the topic: A single web page is unlikely to contain information about all the key concepts or sub-topics of the topic. Thus, sub-topics need to be discovered from multiple web pages. Current search engine systems do not perform this task.

    69. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 69 Related Work Web information extraction wrappers Web query languages User preference approach Question answering in information retrieval Question answering is a closely-related work to this paper. The objective of a question-answering system is to provide direct answers to questions submitted by the user. In this paper’s task, many of the questions are about definitions of terms.

    70. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 70 The Algorithm WebLearn (T) 1) Submit T to a search engine, which returns a set of relevant pages 2) The system mines the sub-topics or salient concepts of T using a set S of top ranking pages from the search engine 3) The system then discovers the informative pages containing definitions of the topic and sub-topics (salient concepts) from S 4) The user views the concepts and informative pages. If s/he still wants to know more about sub-topics then for each user-interested sub-topic Ti of T do WebLearn (Ti);

    71. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 71 Sub-Topic or Salient Concept Discovery Observation: Sub-topics or salient concepts of a topic are important word phrases, usually emphasized using some HTML tags (e.g., <h1>,...,<h4>,<b>). However, this is not sufficient. Data mining techniques are able to help to find the frequent occurring word phrases.

    72. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 72 Sub-Topic Discovery After obtaining a set of relevant top-ranking pages (using Google), sub-topic discovery consists of the following 5 steps. 1) Filter out the “noisy” documents that rarely contain sub-topics or salient-concepts. The resulting set of documents is the source for sub-topic discovery.

    73. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 73 Sub-Topic Discovery 2) Identify important phrases in each page (discover phrases emphasized by HTML markup tags). Rules to determine if a markup tag can safely be ignored Contains a salutation title (Mr, Dr, Professor). Contains an URL or an email address. Contains terms related to a publication (conference, proceedings, journal). Contains an image between the markup tags. Too lengthy (the paper uses 15 words as the upper limit)

    74. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 74 Sub-Topic Discovery Also, in this step, some preprocessing techniques such as stopwords removal and word stemming are applied in order to extract quality text segments. Stopwords removal: Eliminating the words that occur too frequently and have little informational meaning. Word stemming: Finding the root form of a word by removing its suffix.

    75. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 75 Sub-Topic Discovery 3) Mine frequent occurring phrases: - Each piece of text extracted in step 2 is stored in a dataset called a transaction set. - Then, an association rule miner based on Apriori algorithm is executed to find those frequent itemsets. In this context, an itemset is a set of words that occur together, and an itemset is frequent if it appears in more than two documents. - We only need the first step of the Apriori algorithm and we only need to find frequent itemsets with three words or fewer (this restriction can be relaxed).

    76. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 76 Sub-Topic Discovery 4) Eliminate itemsets that are unlikely to be sub-topics, and determine the sequence of words in a sub-topic. (postprocessing) Heuristic: If an itemset does not appear alone as an important phrase in any page, it is unlikely to be a main sub-topic and it is removed.

    77. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 77 Sub-Topic Discovery 5) Rank the remaining itemsets. The remaining itemsets are regarded as the sub-topics or salient concepts of the search topic and are ranked based on the number of pages that they occur.

    78. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 78 Definition Finding This step tries to identify those pages that include definitions of the search topic and its sub-topics discovered in the previous step. Preprocessing steps: Texts that will not be displayed by browsers (e.g., <script>...</ script >,<!—comments-->) are ignored. Word stemming is applied. Stopwords and punctuation are kept as they serve as clues to identify definitions. HTML tags within a paragraph are removed.

    79. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 79 Definition Finding After that, following patterns are applied to identify definitions:

    80. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 80 Definition Finding Besides using the above patterns, the paper also relies on HTML structuring and hyperlink structures. 1) If a page contains only one header or one big emphasized text segment at the beginning in the entire document, then the document contains a definition of the concept in the header. 2) Definitions at the second level of the hyperlink structure are also discovered. All the patterns and methods described above are applied to these second level documents.

    81. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 81 Definition Finding Observation: Sometimes no informative page is found for a particular sub-topic when the pages for the main topic are very general and do not contain detailed information for sub-topics. In such cases, the sub-topic can be submitted to the search engine and sub-subtopics may be found recursively.

    82. Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria 82 Conclusions The proposed techniques aim at helping Web users to learn an unfamiliar topic in-depth and systematically. This is an efficient system to discover and organize knowledge on the web, in a way similar to a traditional book, to assist learning.

    83. CSE:634 Web Mining 83

    84. CSE:634 Web Mining 84 Thank You!

More Related