1 / 35

ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval

ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval. Padhraic Smyth Department of Information and Computer Science University of California, Irvine. Lecture Topics in Text Mining. Information Retrieval Text Classification Text Clustering Information Extraction.

reuel
Download Presentation

ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 278: Data MiningLecture 14: Text Mining and Information Retrieval Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  2. Lecture Topics in Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  3. Text Mining Applications • Information Retrieval • Query-based search of large text archives, e.g., the Web • Text Classification • Automated assignment of topics to Web pages, e.g., Yahoo, Google • Automated classification of email into spam and non-spam • Text Clustering • Automated organization of search results in real-time into categories • Discovery clusters and trends in technical literature (e.g. CiteSeer) • Information Extraction • Extracting standard fields from free-text • extracting names and places from reports, newspapers (e.g., military applications) • extracting resume information automatically from resumes • Extracting protein interaction information from biology papers Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  4. Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  5. General concepts in Information Retrieval • Representation language • typically a vector of d attribute values, e.g., • set of color, intensity, texture, features characterizing images • word counts for text documents • Data set D of N objects • Typically represented as an N x d matrix • Query Q: • User poses a query to search D • Query is typically expressed in the same representation language as the data, e.g., • each text document is a set of words that occur in the document • Query Q is also expressed as a set of words, e.g.,”data” and “mining” Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  6. Query by Content • traditional DB query: exact matches • e.g. query Q = [level = MANAGER] AND [age < 30] • or, Boolean match on text • query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun” • Not useful when there are many matches • E.g., “data mining” in Google returns 60 million documents • query-by-content query: more general / less precise • e.g. what record is most similar to a query Q? • for text data, often called “information retrieval (IR)” • can also be used for images, sequences, video, etc • Q can itself be an object (e.g., a document) or a shorter version (e.g., 1 word) • Goal • Match query Q to the N objects in the database • Return a ranked list (typically) of the most similar/relevant objects in the data set D given Q Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  7. Issues in Query by Content • What representation language to use • How to measure similarity between Q and each object in D • How to compute the results in real-time (for interactive querying) • How to rank the results for the user • Allowing user feedback (query modification) • How to evaluate and compare different IR algorithms/systems Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  8. The Standard Approach • fixed-length (d dimensional) vector representation • for query (1-by-d Q) and and database (n-by-d X) objects • use domain-specific higher-level features (vs raw) • image • “bag of features”: color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), … • text • “bag of words”: freq count for each word in each document, … • Also known as the “vector-space” model • compute distances between vectorized representation • use k-NN to find k vectors in X closest to Q Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  9. Text Retrieval • document: book, paper, WWW page, ... • term: word, word-pair, phrase, … (often: 50,000+) • query Q = set of terms, e.g., “data” + “mining” • NLP (natural language processing) too hard, so … • want (vector) representation for text which • retains maximum useful semantics • supports efficient distance computes between docs and Q • term weights • Boolean (e.g. term in document or not); “bag of words” • real-valued (e.g. freq term in doc; relative to all docs) ... • notice: loses word order, sentence structure, etc. Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  10. Practical Issues • Tokenization • Convert document to word counts • word token = “any nonempty sequence of characters” • for HTML (etc) need to remove formatting • Canonical forms, Stopwords, Stemming • Remove capitalization • Stopwords: • remove very frequent words (a, the, and…) – can use standard list • Can also remove very rare words • Stemming (next slide) • Data representation • e.g., 3 column: <docid termid position> • Inverted index (faster) • List of sorted <termid docid> pairs: useful for finding docs containing certain terms • Equivalent to a sparse representation of term x doc matrix Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  11. Stemming • Want to reduce all morphological variants of a word to a single index term • e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document) • Stemming - reduce words to their root form • e.g. fish – becomes a new index term • Porter stemming algorithm (1980) • relies on a preconstructed suffix list with associated rules • e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE • BINARIZATION => BINARIZE • Not always desirable: e.g., {university, universal} -> univers (in Porter’s) • WordNet: dictionary-based approach Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  12. Toy example of a document-term matrix Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  13. Document Similarity • Measuring similarity between 2 documents x and y: • wide variety of distance metrics: • Euclidean (L2) = sqrt(i(xi - yi)2) • L1 = I |xi - yi | • ... • weighted L2 = sqrt(i(wixi - wiyi)2) • Cosine distance between docs • often gives better results than Euclidean • normalizes relative to document length Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  14. Distance matrices for toy document-term data Euclidean Distances TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 Cosine Distances Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  15. TF-IDF Term Weighting Schemes • Not all terms in a query or document may be equally important... • TF (term freq): term weight = number of times in that document • problem: term common to many docs => low discrimination • IDF (inverse-document frequency of a term) • nj documents contain term j, N documents in total • IDF = log(N/nj) • Favors terms that occur in relatively few documents • TF-IDF: TF(term)*IDF(term) • No real theoretical basis, but works well empirically and widely used Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  16. TF-IDF Example TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9) TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7) Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  17. Baseline Document Querying System • Queries Q = binary term vectors • Documents represented by TF-IDF weights • Cosine distance used for retrieval and ranking Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  18. Baseline Document Querying System TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... Q=(1,0,1,0,0,0) TFTF-IDF d1 0.70 0.32 d2 0.77 0.51 d3 0.58 0.24 d4 0.60 0.23 d50.79 0.43 ... Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  19. Synonymy and Polysemy • Synonymy • the same concept can be expressed using different sets of terms • e.g. bandit, brigand, thief • negatively affects recall • Polysemy • identical terms can be used in very different semantic contexts • e.g. bank • repository where important material is saved • the slope beside a body of water • negatively affects precision Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  20. Latent Semantic Indexing • Approximate data in the original d-dimensional space by data in a k-dimensional space, where k << d • Find the k linear projections of the data that contain the most variance • Principal components analysis or SVD • Also known as “latent semantic indexing” when applied to text • Captures dependencies among terms • In effect represents original d-dimensional basis with a k-dimensional basis • e.g., terms like SQL, indexing, query, could be approximated as coming from a single “hidden” term • Why is this useful? • Query contains “automobile”, document contains “vehicle” • can still match Q to the document since the 2 terms will be close in k-space (but not in original space), i.e., addresses synonymy problem Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  21. Toy example of a document-term matrix Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  22. SVD • M = U S VT • M = n x d = original document-term matrix (the data) • U = n x d , each row = vector of weights for each document • S = d x d diagonal matrix of eigenvalues • Columns of VT = new orthogonal basis for the data • Each eigenvalue represents how much information is of the new “basis” vectors • Typically select just the first k basis vectors, k << d Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  23. Example of SVD Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  24. v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19] v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31] D1 = database x 50 D2 = SQL x 50 Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  25. Probabilistic Approaches to Retrieval • Compute P(q | d) for each document d • Intuition: relevance of d to q is related to how likely it is that q was generated by d, or “how likely is q under a model for d?” • Simple model for P(q|d) • Pe(q|d) = empirical frequency of words in document d • “tuned” to d, but likely to be sparse (will contain many zeros) • 2-stage probabilistic model (or linear interpolation model) • P(q|d) = l Pe (q | d) + (1- l ) Pe (q | corpus) • l can be fixed, e.g., tuned to a particular data set • Or it can depend on d, e.g., l = nd/ (nd + m)where nd = number of words in doc d, and m = a constant (e.g., 1000) • Can also use more sophisticated models for P(q|d), e.g., topic-based models Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  26. Evaluating Retrieval Methods • predictive models (classify/regress) objective • score = accuracy on unseen test data • evaluation more complex for query by content • real score = how “useful” is retrieved info (subjective) • e.g. how would you define real score for Google’s top 10 hits? • towards objectivity, assume: • 1) each object is “relevant” or “irrelevant” • simplification: binary and same for all users (e.g. committee vote) • 2) each object labelled by objective/consistent oracle • these assumptions suggest classifier approach possible • rather different goals: want nearest to Q, not separability per se • but would require learning classifier at query time (Q = pos class) • which is why k-NN type approach seems so appropriate … Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  27. Precision versus Recall • Rank documents (numerically) with respect to query • Compute precision and recall by thresholding the rankings • precision • fraction of retrieved objects that are relevant • recall • fraction of retrieved relevant objects / total relevant objects • Tradeoff: high precision -> low recall, and vice-versa • Very similar to ROC in concept • For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”). Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  28. Precision-Recall Curve (form of ROC) alternative (point) values: precision where recall=precision or precision for fixed number of retrievals or average precision over multiple recall levels C is universally worse than A & B Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  29. TREC evaluations • Text Retrieval Conference (TReC) • Web site: trec.nist.gov • Annual impartial evaluation of IR systems • e.g., D = 1 million documents • TREC organizers supply contestants with several hundred queries Q • Each competing system provides its ranked list of documents • Union of top 100 ranked documents or so from each system is then manually judged to be relevant or non-relevant for each query Q • Precision, recall, etc, then calculated and systems compared Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  30. Other Examples of Evaluation Data Sets • Cranfield data • Number of documents = 1400 • 225 Queries, “medium length”, manually constructed “test questions” • Relevance = determined by expert committee (from 1968) • Newsgroups • Articles from 20 Usenet newsgroups • Queries = randomly selected documents • Relevance: is the document d in the same category as the query doc? Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  31. Performance on Cranfield Document Set Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  32. Performance on Newsgroups Data Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  33. Related Types of Data • Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g., • “transaction data” • Rows = customers • Columns = products • Web log data (ignoring sequence) • Rows = Web surfers • Columns = Web pages • Recommender systems • Given some products from user i, suggest other products to the user • e.g., Amazon.com’s book recommender • Collaborative filtering: • use k-nearest-individuals as the basis for predictions • Many similarities with querying and information retrieval • e.g., use of cosine distance to normalize vectors Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  34. Web-based Retrieval • Additional information in Web documents • Link structure (e.g., PageRank: to be discussed later) • HTML structure • Link/anchor text • Title text • Etc • Can be leveraged for better retrieval • Additional issues in Web retrieval • Scalability: size of “corpus” is huge (10 to 100 billion docs) • Constantly changing: • Crawlers to update document-term information • need schemes for efficient updating indices • Evaluation is more difficult – how is relevance measured? How many documents in total are relevant? Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

  35. Further Reading • Text: Chapter 14 • General reference on text and language modeling • Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, 1999. • Very useful reference on indexing and searching text: • Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition, Morgan Kaufmann,1999, by Ian H. Witten, Alistair Moffat, and Timothy C. Bell, • Web-related Document Search: • An excellent resource on Web-related search is Chapter 3, Web Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003. • Information on how real Web search engines work: • http://searchenginewatch.com/ • Latent Semantic Analysis • Applied to grading of essays: The debate on automated grading, IEEE Intelligent Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf Data Mining Lectures Lecture 14: Text Mining Padhraic Smyth, UC Irvine

More Related