1 / 40

Term Selection

Term Selection. LBSC 708A/CMSC 838L Session 8, October 30, 2001 Philip Resnik. Agenda. Questions Character sets Terms as units of meaning Strings and segments Tokens and words Phrases and entities Senses and concepts Two-minute paper. IR System. Query Formulation. Query. Search.

rosie
Download Presentation

Term Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Term Selection LBSC 708A/CMSC 838L Session 8, October 30, 2001 Philip Resnik

  2. Agenda • Questions • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • Two-minute paper

  3. IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection

  4. Representing Electronic Texts • A character set specifies semantic units • Characters are the smallest units of meaning • Abstract entities, separate from their representation • A font specifies the printed representation • What each character will look like on the page • Different characters might be depicted identically • An encoding is the electronic representation • What each character will look like in a file • One character may have several representations • An input method is a keyboard representation

  5. The character ‘A’ • ASCII encoding: 7 bits used per character 0 0 0 0 0 1 0 1 = 65 DEC (decimal) 0 1 0 0 0 0 0 1 = 65 DEC (decimal) • Number of representable characters: 27 = 128 distinct characters including 0 (NUL) • Some character codes used for non-visible characters, e.g. 7 = control-G = BEL

  6. | 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | ASCII • Widely used in the U.S. • American Standard Code for Information Interchange • ANSI X3.4-1968 | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL |

  7. Geeky Joke for the Day • Why do computer geeks confuse Halloween and Christmas? • Because 31 OCT = 25 DEC! • 031 OCT = 0*82 + 3*81 + 1*80 octal = 0*102 + 2*101 + 5*100 decimal P.S. Happy Halloween!

  8. The Latin-1 Character Set • ISO 8859-1 8-bit characters for Western Europe • French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1

  9. Other ISO-8859 Character Sets -2 -6 -7 -3 -4 -8 -9 -5

  10. East Asian Character Sets • More than 256 characters are needed • Two-byte encoding schemes (e.g., EUC) are used • Several countries have unique character sets • GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam • Many characters appear in several languages • Research Libraries Group developed EACC • Unified “CJK” character set for USMARC records

  11. Unicode • Goal is to unify the world’s character sets • ISO Standard 10646 • Character set and encoding scheme separated • Full “code space” is used by character codes • Extends Latin-1 • UTF-7 encoding will pass through email • Originally designed for 64 printable ASCII characters • UTF-8 encoding works with disk file systems

  12. Limitations of Unicode • Produces much larger files than Latin-1 • Fonts are hard to obtain for many characters • Some characters have multiple representations • e.g., accents can be part of a character or separate • Some characters look identical when printed • But they come from unrelated languages • The sort order may not be appropriate

  13. Agenda • Questions • Character sets • Terms as units of meaning • Strings and segments • Tokens and words • Phrases and entities • Senses and concepts • Two-minute paper

  14. Strings and Segments • Retrieval is (often) a search for concepts • But what we index are character strings • What strings best represent concepts? • In English, words are often a good choice • But well chosen phrases can be even better • In German, compounds may need to be split • Otherwise queries using constituent words would fail • In Chinese, word boundaries are not marked • Thissegmentationproblemissimilartothatofspeech • This segmentation problem is similar to that of speech

  15. Longest Substring Segmentation • A greedy segmentation algorithm • Based solely on lexical information • Start with a list of every possible term • Dictionaries are a handy source for term lists • For each unsegmented string • Remove the longest single substring in the list • Repeat until no substrings are found in the list • Can be extended to explore alternatives

  16. Longest Substring Example • Possible German compound term: • washington • List of German words: • ach, hin, hing, sei, ton, was, wasch • Longest substring segmentation • was-hing-ton • A language model might see this as bad • Roughly translates to “What tone is attached?”

  17. Probabilistic Segmentation • For an input word c1 c2 c3 …cn • Try all possible partitions into w1 w2w3 … • c1c2 c3 …cn • c1c2 c3 c3 …cn • c1 c2 c3 …cnetc. • Choose the highest probability partition • E.g., compute Pr(w1 w2w3) using a language model • Challenges: search, probability estimation

  18. Non-Segmentation: N-gram Indexing • Consider a Chinese document c1 c2 c3 …cn • Don’t segment (you could be wrong!) • Instead, treat every character bigram as a term • _c1 c2 ,c2 c3 ,c3 c4 ,… , cn-1 cn • Break up queries the same way

  19. Tokens and Words • What is a word? • Kindergarten • Aux armes! • Doug’s running • Realistic review resubmit • Morphology: • How morphemes combine to make words • Morphemes are units of meaning • Remember antidisestablishmentarianism? • Anti (disestablishmentarian) ism

  20. Morphemes and Roots • Inflectional morphology • Preserves part of speech • Destructions = Destruction+PLURAL • Destroyed = Destroy+PAST • Derivational morphology • Relates parts of speech • Destructor = AGENTIVE(destroy) • Can help IR performance, but expensive • Getting derivational morphology right is hard • {peninsula,insulate}:insula (Lat. “island”) ???

  21. Stemming • Stem: in IR, a word equivalence class that preserves the main concept. • Often obtained by affix-stripping (Porter, 1980) • {destroy, destroyed, destruction}: destr • Inexpensive to compute • Usually helps IR performance • Can make mistakes! (over-/understemming) • {centennial,century,center}: cent • {acquire,acquiring,acquired}: acquir {acquisition}: acquis

  22. Roots and Stems: beyond English • Arabic: alselam • Stem: selam • Root: SLM (peace) • Semantic families: altaliban • Stem: taliban (student) • Root: TLB (question) • Current research on best level of analyis

  23. Phrases and Entities • Multi-word combinations identify entities • The president, Dubya, George W. Bush • Can also identify relationships of interest • Derek Jones, CEO of SadAndBankrupt.com,… • Entity roles, filling slots in templates

  24. Named Entity Identification • Major categories of named entities • Influenced by text genres of interest… mostly news • Person, organization, location, date, money, … • Decent algorithms based on finite automata • Best algorithms based on supervised learning • Annotate a corpus identifying entities and types • Train a probabilistic model • Apply the model to new text

  25. Example: Predictive Annotation for Question Answering In reality, at the time of Edison’s 1879 patent, the light bulb PERSON TIME had been in existence for some five decades …. Who patented the light bulb? patent light bulb PERSON When was the light bulb patented? patent light bulb TIME In what year was the light bulb patented? ??? What did Thomas Edison patent?

  26. General Phrase Identification • Two types of phrases • Compositional: meaning derived from parts • Noncompositional: idiomatic expressions • e.g., “kick the bucket” or “buy the farm” • Three sources of evidence • Dictionary lookup • Parsing • Co-occurrence

  27. Known Phrases • Same idea as longest substring match • But look for word (not character) sequences • Compile a term list that includes phrases • Technical terminology can be very helpful • Index any phrase that occurs in the list • Most effective in a limited domain • Otherwise hard to capture most useful phrases

  28. Syntactic Phrases • Automatically construct sentence diagrams • Fairly good parsers are available • Index the noun phrases • Assumes that queries will focus on objects Sentence Prepositional Phrase Noun Phrase Noun phrase Det Adj Adj Noun Verb Prep Det Adj Adj Noun The quick brown fox jumped over the lazy dog’s back

  29. Syntactic Variations • The “paraphrase problem” • Prof. Douglas Oard studies information access patterns. • Doug studies patterns of user access to different kinds of information. • Transformational variants (Jacquemin) • Coordinations • lung and breast cancer  lung cancer • Substitutions • inflammatory sinonasal disease  inflammatory disease • Permutations • addition of calcium  calcium addition

  30. Phrase Discovery: Collocations • Compute observed occurrence probability • For each single word and each word n-gram • “buy” 10 times in 1000 words yields 0.01 • “the” 100 times in 1000 words yields 0.10 • “farm” 5 times in 1000 words yields 0.005 • “buy the farm” 4 times in 1000 words yields 0.004 • Compute n-gram probability if truly independent • 0.01*0.10*0.005=0.000005 • Compare with observed probability • Record phrases that occur more often than expected

  31. Phrase Indexing Lessons • Poorly chosen phrases hurt effectiveness • And some techniques can be slow (e.g., parsing) • Better to index phrases and words • Want to find constituents of compositional phrases • Better weighting schemes  less benefit • Negligible improvement in some TREC systems • Very helpful for cross-language retrieval • Noncompositional translation, reduced ambiguity

  32. Cross-Language IR and Phrases • Poser: quite ambiguous (Langenscheidt) • Place, put (a question, a motion) • Lay down (a principle) • Hang (curtains) • Set (a problem) • Poser une question: meaning is clear! • Ask a question • In this case, better to use the phrase • But is this really about phrases?

  33. Senses and Concepts • What is a word sense? • Entry in a dictionary or thesaurus • Position or cluster in a semantic space • What is word sense disambiguation? • Identifying intended sense(s) from context • Goal for IR • Match on the intended concept, not just the words

  34. Problems With Word Matching • Word matching suffers from two problems • Synonymy: paper vs. article • Homonymy: bank (river) vs. bank (financial) • Disambiguation in IR: seek to resolve homonymy • Index word senses rather than words • Synonymy usually addressed by • Thesaurus-based query expansion • Latent semantic indexing

  35. Word Sense Disambiguation • Context provides clues to word meaning • “The doctor removed the appendix.” • For each occurrence, note surrounding words • Typically +/- 5 non-stopwords • Group similar contexts into clusters • Based on overlaps in the words that they contain • Separate clusters represent different senses

  36. Disambiguation Example • Consider four example sentences • The doctor removed the appendix • The appendix was incomprehensible • The doctor examined the appendix • The appendix was removed • What clusters can you find? • Can you find enough word senses this way? • Might you find too many word senses?

  37. Why Disambiguation Hurts • Bag-of-words techniques already disambiguate • When more words are present, documents rank higher • So a context for each term is established in the query • Formal disambiguation tries to improve precision • But incorrect sense assignments would hurt recall • Hard to distinguish homonymy from fine-grained polysemy • Average precision balances recall and precision • But the possible precision gains are small • And current techniques substantially hurt recall

  38. Where Could Disambiguation Help? • Categorization of whole documents • Identifying location(s) in a topic hierarchy • Visualization • People are good at seeing signal amidst noise • Probabilistic models • Combining different sources of evidence • (Requires n-best rather than 1-best responses)

  39. Summary • The goal is to index the right meaning units • Start by finding fundamental features • Characters or shape codes (for OCR) etc . • Combine them into easily recognized units • Words where possible, character n-grams otherwise • Consider alternatives to splitting or forming phrases • But stemming is generally a good idea • Usually best to match those units directly • Disambiguation strategies hurt more than they help

  40. Two Minute Papers • If you were indexing a collection of Arabic documents that concern biological terrorism, what term extraction strategie(s) would you use? • What was the muddiest point in today’s class?

More Related