CS 430: Information Discovery

CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting

Course Administration • Laptop computers - Everybody who has (a) signed up and (b) submitted an assignment should receive an email about collecting a laptop - Laptop surveys will be handed out in class

Building an Index documents Documents assign document IDs text document numbers and *field numbers break into words words stoplist non-stoplist words stemming* *Indicates optional operation stemmed words term weighting* terms with weights Index database from Frakes, page 7

What is a token? • A token is a group of characters that are treated as a single term for indexing and in queries • Exact definition of a token affects performance - digits (parts of names, data in tables) - hyphens (line break, compound words and phrases) - punctuation characters (part of names) - upper or lower case (proper nouns, acronyms) • Impact on retrieval - broad definition of a token enhances recall - narrow definition of a token enhance precision

Lexical analysis • Performance -- must look at every character • Conveniently implemented as a finite state machine • Definition of token in queries should be consistent with definition used in indexing • Lexical analysis may be combined with: - removal of stop words - stemming

Transition Diagram letter, digit 1 2 ( letter 3 space ) & 4 0 | 5 ^ 6 eos 7 other 8

Stop list Stop list: A list of common words that are ignored in building an index and when they occur in a query • Reduces the length of inverted lists • Saves storage space (10 most common words are 20% to 30% of tokens) • The aim is to have no impact on retrieval effectiveness Typical stop lists have between 10 and 500 words.

Example: the WAIS stop list(first 84 of 363 multi-letter words) about above according across actually adj after afterwards again against all almost alone along already also although always among amongst an another any anyhow anyone anything anywhere are aren't around at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both but by can can't cannot caption co could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything

Stop list policies How many words should be in the stop list? • Long list lowers recall Which words should be in list? • Some common words may have retrieval importance: -- war, home, life, water, world • In certain domains, some words are very common: -- computer, program, source, machine, language There is very little systematic evidence to use in selecting a stop list.

Choice of stop words

Implementation of stop lists Filter stop words from output of lexical analyzer • Create perfect hash table of stop words (no collisions) • Calculate hash value of each token from lexical analyzer • Reject token if value found in hash table Remove stop words as part of the lexical analysis [See Frake, Section 7.3.]

A generated finite state machine  n nd a an and in into to  d n q4 q1 L0 d a  to n nto {} i n q5 q6 q2 q0 t o t {o} q3

Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts.

Inverse document frequency weight Notation For term k number of documents = n document frequency (number of documents in which term k occurs) = dk One possible measure is n/dk Inverse document frequency: ik = log2 (n/dk) + 1

Inverse document frequency weight Example n = 1,000 documents term kdkik A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00 From: Salton and McGill

Information in a term The search for more precise methods of weighting The higher the probability of occurrence of a word, the less information in provides for retrieval. probability of occurrence of word k = pk information, ik = - log2 pk Average information that each word provides is -  pk log2 pk

Average information term pk-pk log2 pk A 0.5 0.50 B 0.2 0.46 C 0.2 0.46 D 0.1 0.33 average information 1.75 term pk-pk log2 pk A 0.25 0.50 B 0.25 0.50 C 0.25 0.50 D 0.25 0.50 average information 2.00 Average information is maximized when each term occurs with the same probability.

Noise Notation total frequency of term k in a collection = tk frequency of term k in document i = fik noise of term k is defined as nk =  (fik/tk) log2 (tk/fik) (fik = 0) i Noise measures the concentration of a term in a collection (a) if term k occurs once in each document (i.e., all fik = 1) nk =  (1/n) log2 (n/1) = log2n (b) if term k occurs only in one document with frequency tk nk = (tk/tk) log2(tk/tk) = 0

Signal Signal is the inverse of noise sk = log2 (tk) - nk Considerable research has been carried out using weights based on signal, ranking words in decreasing order of the signal value. This favors terms that distinguish a few specific documents for the remainder of the collection. However, practical experience has been mixed. After: Shannon

CS 430: Information Discovery