500 likes | 637 Views
INFO 624 -- Week 5 Text Properties and Operations. Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University. Objectives of Assignment 1. Practice basic Web skills Get familiar with a few search engines Learn to describe features of search engines
E N D
INFO 624 -- Week 5Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University
Objectives of Assignment 1 • Practice basic Web skills • Get familiar with a few search engines • Learn to describe features of search engines • Learn to compare search engines
Grading Sheet for Assignment 1 • Memo • Selection of Search engines • Is it downloadable • Can it be under controlled of the small business? • Quality of reviews • Formats of review pages, including metadata. • Appropriate links in reviews and in the registered page
What’s missing? • Who are the current users of the selected search engine? • Hand-on experience with the selected search engines • Personal observation or experience • Some testing on demos or customer sites. • Convincing statements on the differences between the search engines.
Properties of Text • Classic theories • Zipf’s Law • Information Theory • Benford's Law • Bradford's Law • Heaps’ Law • English letter/word frequencies
Zipf’s Law (1945) • in a large, well-written English document, r * f = c where r is the ranking number, f is the number of times the given word is used in the document; c is a constant. • Difference collections may have different c. English text tends to have c = N/10 where N is the number of words in the collection.
Zipf’s Law is an observation of a fact in proximity. • Examples: • Word frequencies in Alice in Wonderland • Time magazine collection • Zipf’s Law has been verified for many many years on many different collections. • There are also many revised version of Ziph’s Law.
Example: • The word "the" is the most frequently occurring word in the novel "Moby Dick," occurring 1450 times. • The word "with" is the second-most frequently occurring word in that novel. • How many times would we expect "with" to occur? • How many times would we expect the third most frequently occurring word to appear?
Information Theory • Entropy (1948) • Use the distribution of symbols to predict the amount of information in a text. • Quantified measure for information • Useful for (physical) data transfer • And compression • Not directly applicable to IR • Example: • Which letter is likely to appear after a letter “c” is received?
English Letter Usage Statistics • Letter use frequencies: • E: 72881 12.4% • T: 52397 8.9% • A: 47072 8.0% • O: 45116 7.6% • N: 41316 7.0% • I: 39710 6.7% • H: 38334 6.5%
Doubled letter frequencies: • LL: 2979 20.6% • EE: 2146 14.8% • SS: 2128 14.7% • OO: 2064 14.3% • TT: 1169 8.1% • RR: 1068 7.4% • --: 701 4.8% • PP: 628 4.3% • FF: 430 2.9%
Initial letter frequencies: • T: 20665 15.2% • A: 15564 11.4% • H: 11623 8.5% • W: 9597 7.0% • I: 9468 6.9% • S: 9376 6.9% • O: 8205 6.0% • M: 6293 4.6% • B: 5831 4.2%
Ending letter frequencies: • E: 26439 19.4% • D: 17313 12.7% • S: 14737 10.8% • T: 13685 10.0% • N: 10525 7.7% • R: 9491 6.9% • Y: 7915 5.8% • O: 6226 4.5%
Benford's Law • If we randomly select a number from a table of statistical data, the probability that the first digit will be a "1" is about 0.301, rather than 0.1 as we might expect if all digits were equally likely.
Bradford's Law • On a given subject, a few core journals will provide 1/3 of the articles on that subject, a medium number of secondary journals will provide another 1/3 of the articles on that subject, and a large number peripheral journals will provide the final 1/3 of the articles on that subject.
For example • If you found 300 citations for IR, • 100 of those citations likely came from a core group of 5 journals, • another 100 citations came from a group of 25 journals, • and the final 100 citations came from 125 peripheral journals. • Bradford expressed his law with this formula: 1:n:n 2
Heaps’ Law • The relationship of the size of vocabulary and the size of collections are: V = K * n b Number of unique words Text size
Computerized Text Analysis • Word (token) extraction • Stop words • Stemming • Frequency counts • Clustering
Word Extraction • Basic problems • Digits • Hyphens • Punctuation • Cases • Lexical analyzer • Define all possible characters into finite state machine • Specify what states should cause the break of tokens. • Example: • Parser.c
Stop words • Many of the most frequently used words in English are worthless in the indexing – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • Why do we need to remove stop words? • Reduce indexing file size • stopwords accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching • stop words always have a large number of hits
Stop words • Potential problems of removing stop words • small stop list does not improve indexing much • large stop list may eliminate some words that might be useful for someone or for some purposes • stopwords might be part of phrases • needs to process for both indexing and queries. • Examples: • Lommoncommon.c • commonwords
Stemming • Techniques used to find out the root/stem of a word: • lookup “user engineering” • user 15 engineering 12 • users 4 engineered 23 • used 5 engineer 12 • using 5 • stem: use engineer
Advantages of stemming • improving effectiveness • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. • Criteria for stemming • correctness • retrieval effectiveness • compression performance
Basic stemming methods • Use tables and rules • remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …...
transform the remaining word • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
Example 1: Porter stem Algorithm • A set of condition/action rules • condition on the stem • condition on the suffix • condition on the rules • different combination of conditions will activate different rules. • Implementation: • stem.c • Stem(word) • …….. • ReplaceEnd(word, step1a_rule); • rule=ReplaceEnd(word, step1b_rule); • if (rule==106) || (rule ==107) • ReplaceEnd(word, 1b1_rule); • … …
Example 2: Sound-based stemming • Soundex rules: • letter Numeric equivalent • B, F, P, V 1 • C, G, J, K, Q, S, X, Z 2 • D, T, 3 • L 4 • M, N, 5 • R, 6 • A, E, I, O, U, W, Y not coded • Words sound similar often have same codes • The code is not unique • high compression rate
Frequency counts • The idea • The best a computer can do is counting numbers • counts the number of times a word occurred in a document • counts the number of documents in a collection that contains a word • Using occurrence frequencies to indicate relative importance of a word in a document • if a word appears often in a document, the document likely “deals with” subjects related to the word.
Using occurrence frequencies to select most useful words to index a document collection • if a word appears in every document, it is not a good indexing word • if a word appears in only one or two documents, it may not be a good indexing word • If a word appears in titles, each occurrence should be count 5(or 10) times.
Automatic indexing 1. Parse individual words (tokens) 2. Remove stop words. 3. Stemming 4. Use frequency data • decide heading threshold • decide tail threshold • decide variance of counting
5. Create indexing structure • invert indexing • other structures
Term Associations • Counting word pairs • If two words appear together very often, they are likely to be a phrase • Counting document pairs • if two documents have many common words, they are likely related
More Counting • Counting citation pairs • If documents A and B both cite document C, D, then A and B might be related. • If documents C and D often be cited together, they are likely related. • Counting link patterns • Get all pages that have links to my pages. • Get all pages that contain similar links to my pages
Google Search Engine • Link analysis • PageRank --The ranking of web pages are based on the number of links that refer to that web page • If page A has a link to B, page A has one vote to B. • The more votes a page get, the more useful the page is. • If page A itself receives many votes, its vote to B will count more heavily • Combining link analysis with word matching.
ConceptLink • Use terms’ co-occurring frequencies • to predict semantic relationships • to build concept clusters • to suggest search terms • Visualization of term relationships • Link displays • Map displays • Drag-and drop interface for searching
Document clustering • Grouping similar documents to different sets • Create similarity matrix • Apply a hierarchical clustering algorithm: 1 Identify the two closet documents and combine them into a cluster 2 Identify the next two closet documents and clusters and combine them into a clusters 3 If more then one cluster remains, return to step 1
Application of Document Clustering • Vivisimo • Cluster search results on the fly • Hierarchical categories for drill-down capability • AltaVista • Refine search: • Cluster related words into different groups based on their co-occurrence rates in documents.
Document Similarity • Documents • D1={t11, t12, t13, …, t1n} • D2={t21, t22, t23, …, t2n} tik is either 0 or 1. • Simple measurement of difference/ similarity: • w=the number of times t1k=1, t2k=1. • x=the number of times t1k=1, t2k=0. • y=the number of times t1k=0, t2k=1. • z=the number of times t1k=0, t2k=0.
Similarity Measure • Cosine Coefficient: • The same as:
D1’s terms only: n1=w+x (the number of times t1k=1) • D2’s terms only: n2=w+y (the number of times t2k=1) • Sameness count: sc =(w+z)/(n1+n2) • Difference count: dc =(x+y)/(n1+n2) • Rectangular Distance: rd = MAX(n1, n2) • Conditional probability: cp=min(n1, n2) • mean: mean =(n1+n2)/2
Similarity Measure • Dice’s Coefficient: • Dice(D1, D2)= 2w/(n1+n2) • where w is the number of terms that D1, and D2 have in common; n1, n2 are the number of terms in D1and D2. • Jaccard Coefficient: • Jaccard(D1, D2) = w/(N-z) = w/(n1+n2-w)
Similarity Metric • A metric has three defining properties • It’s value are non-negative • It’s symmetric • It satisfies the triangle inequality: |AC||AB|+|BC|
Similarity Matrix • Pairwise coupling of similarities among a group of documents S11 S12 S13 S14 S15 S16 S17 S18 S21 S22 S23 S24 S25 S26 S27 S28 S31 S32 S33 S34 S35 S36 S37 S38 S41 S42 S43 S44 S45 S46 S47 S48 S51 S52 S53 S54 S55 S56 S57 S58 S61 S62 S63 S64 S65 S66 S67 S68 S71 S72 S73 S74 S75 S76 S77 S78 S81 S82 S83 S84 S85 S86 S87 S88
MetaData • Data about data • Descriptive Data • External to the meaning of the document • Dublin Core Metadata Element Set • Author, title, publisher, etc. • Semantic Metadata • Subject indexing • Challenge: automatic generation of metadata for documents
Markup Languages SGML XSL XML HyTime Metalanguage Languages HTML CSS RDF MathML SMIL Semantic Web? Stylesheet
Midterm • Concepts • What is information retrieval? • Data, information, text, and documents • Two abstractions principles • User’s information needs • Queries and query formats • Precision and Recall • Relevance • Zipf’s Law, Benford's Law
Midterm • Procedures & problem solving • How to translate a request into a query? • How to expand queries • for better recall or better precision? • How to create an inverted indexing? • How to create a vector space ? • How to calculate similarities of documents? • How to match a query to documents in a vector space?
Discussions • Challenges of IR • Advantages and disadvantages of Boolean search (vector space, automatic indexing, association-based queries, etc.) • Evaluation of IR systems • With or without using precision/recall. • Difference between data retrieval and information retrieval