1 / 16

Mining Topic-Specific Concepts and Definitions on the Web

CS591CXZ Web mining: Lexical relationship mining. Mining Topic-Specific Concepts and Definitions on the Web. Bing Liu, etc KDD03. Lexical relationship mining.

Patman
Download Presentation

Mining Topic-Specific Concepts and Definitions on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS591CXZWeb mining:Lexical relationship mining Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03

  2. Lexical relationship mining • A lexical relationship is a relationship between words, such as synonym, antonym, hypernym (“poodle” <-- “dog”), and hyponym (“poodle” --> “dog”) • A lexical relationship is a connection between the meanings of two words in a text which helps the text to hold together. Relevant connections include (rough) synonymy (e.g. woman - person, win - victory) and connections in a field of meaning (e.g. plane - pilot). Thus, subtopic mining is in this category, but definition mining is not.

  3. Information Extraction • MUC http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ Information Extraction: the extraction or pulling out of pertinent information from large volumes of texts Items of Information Percentile Reliability Entities 90 Attributes 80 definition falls here Facts 70 Events 60 Attribute: a property of an entity such as its name, alias, descriptor, or type

  4. Mining Topic-Specific Concepts and Definitions on the Web • Goal : Systematically learn an unfamiliar topic from Web • Definitions • Topic hierarchy • Input : a term “data mining”, “Web mining” • Tasks • Identify sub-topics or salient concepts • Like building ontology, but no clear hierarchy E.g.: Genetic Algorithm • Algorithms • Find and organize definition pages • Definition question answering • Concept disambiguation

  5. Techniques • A lot of heuristics • Simple linguistic patterns {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … … • Web page tags <h1>,…,<h4> <b> <em> <li> … • Frequent pattern mining • A classic data mining technique

  6. Algorithm WebLearn(T) • Submit T to a search engine, get relevant pages • Mines subtopics or salient concepts of T • Finds definition pages • Output the concepts and definition pages to users. If a user wants to know more about subtopics T’ do WebLearn(T’)

  7. Mining subtopic/salient concept(1) Input: a set of top-ranked relevant document Steps: • Filter out “noisy” documents • Publication listing pages “in proceeding”, “journal” • Forum discussion pages “previous message”, “reply to” • Pages that do not contain all query terms

  8. Mining subtopic/salient concept(2) 2. Identify important phrases in each page • Extract text segments in HTML emphasizing tags <h1>,…,<h4> <b> <em> <li> … • Except those containing: • Salutation title (Mr. Dr. Professor) • URL or email address • “conference”, “journal” … • Digits ( KDD2004) • Images • Too many words (15 words as limit)

  9. Mining subtopic/salient concept(3) 3. Mine frequent phrases • Input: emphasized text segments • Mine frequent word sets using associate rule mining technique 4. Eliminate word sets unlikely to be subtopics • Heuristic: those that do not appear alone in emphasizing tags in any page “process” • Remove generic words from result set “abstract”, “introduction”, “conclusion”, “research”,… 5. Rank result sets According to number of pages they occur

  10. Definition Finding • Definition identification patterns suitable for Web pages {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … • HTML structuring clues and hyperlinks • If only one header <h1>, <h2>,… or one big emphasized segment at the beginning => definition page • Look up definition pages up to the second level of the hyperlinks, and only hyperlinks with anchor text matching the concept

  11. Subtopic disambiguation • By adding context terms • usually parent topic or subtopics • context terms tend to dominate results • cannot work for the first (root) topic • Heuristics to combat domination of context terms • only consider text segments containing the topic or subtopic • identify pages with topic hierarchy HTML list tag <li> The hierarchy should also contain other subtopics of the parent topic • shallow linguistic phenomena Topic + “approaches” / ”techniques” + ( + “e.g” / “such as” / “including” + subtopics ) Then, how does this help disambiguate?

  12. Evaluation • Use Google to get the initial set of relevant pages • Result 1: subtopics / salient concepts Looks pretty good, terms are closely relevant More salient concepts than subtopics • Result 2: definition discovery comparison Precision: WebLearn vs Google vs AskJeeves • Result 3 : disambiguation Seem to be useful

  13. Analysis • Interesting topic Potentially to be used in practice • A complete system • Techniques • Avoid NLP, Machine Learning • Apply heuristics of shallow text structures

  14. Limitations • Research topics, not much ambiguity • Techniques: • Heuristics are empirical, by no means being flawless or exhaustive, and hard to applied to other domains

  15. How to improve? -- discussion • Better research: • do you think it is a good research topic? • Better techniques: • what techniques would you like to try to solve the problme?

  16. Thank you!

More Related