1 / 95

Digital Libraries: Steps toward information finding

Digital Libraries: Steps toward information finding . Dr. Lillian N. Cassel Villanova University. But first,. and. A brief introduction to Information Retrieval.

bambi
Download Presentation

Digital Libraries: Steps toward information finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Libraries: Steps toward information finding Dr. Lillian N. Cassel Villanova University

  2. But first,

  3. and

  4. A brief introduction to Information Retrieval • Resource: Christopher D. Manning, PrabhakarRaghavan and HinrichSchütze, Introduction to Information Retrieval, Cambridge University Press. 2008. • The entire book is available online, free, at http://nlp.stanford.edu/IR-book/information-retrieval-book.html • I will use some of the slides that they provide to go with the book.

  5. Author’s definition • Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). • Note the use of the word “usually.” We will see examples where the material is not documents, and not text. • We have already seen that the collection may be distributed over several computers

  6. Examples and Scaling • IR is about finding a needle in a haystack – finding some particular thing in a very large collection of similar things. • Our examples are necessarily small, so that we can comprehend them. Do remember, that all that we say must scale to very large quantities.

  7. Just how much information? • Libraries are about access to information. • What sense do you have about information quantity? • How fast is it growing? • Are there implications for the quantity and rate of increase?

  8. Yotta Zetta Exa Peta Tera Giga Mega Kilo How much information is there? Soon most everything will be recorded and indexed Everything Recorded ! Data summarization, trend detection anomaly detection are key technologies Most bytes will never be seen by humans. All Books MultiMedia These require algorithms, data and knowledge representation, and knowledge of the domain All books (words) A Movie The smaller scale: 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli A Photo See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ A Book Slide source Jim Gray – Microsoft Research (modified)

  9. Where does the information come from? • Many sources • Corporations • Individuals • Interest groups • News organizations • Accumulated through crawling

  10. Basic Crawl Architecture WWW DNS URL set Doc FP’s robots filters Parse Fetch Content seen? URL filter Dup URL elim URL Frontier Ref: Manning Introduction to Information Retrieval

  11. Crawler Architecture • Modules: • The URL frontier (the queue of URLs still to be fetched, or fetched again) • A DNS resolution module (The translation from a URL to a web server to talk to) • A fetch module (use http to retrieve the page) • A parsing module to extract text and links from the page • A duplicate elimination module to recognize links already seen Ref: Manning Introduction to Information Retrieval

  12. Crawling threads • With so much space to explore, so many pages to process, a crawler will often consist of many threads, each of which cycles through the same set of steps we just saw. There may be multiple threads on one processor or threads may be distributed over many nodes in a distributed system.

  13. Politeness • Not optional. • Explicit • Specified by the web site owner • What portions of the site may be crawled and what portions may not be crawled • robots.txt file • Implicit • If no restrictions are specified, still restrict how often you hit a single site. • You may have many URLs from the same site. Too much traffic can interfere with the site’s operation. Crawler hits are much faster than ordinary traffic – could overtax the server. (Constitutes a denial of service attack) Good web crawlers do not fetch multiple pages from the same server at one time.

  14. Robots.txt • Protocol nearly as old as the web • See www.rototstxt.org/robotstxt.html File: URL/robots.txt • Contains the access restrictions • Example: User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: All robots (spiders/crawlers) Robot named searchengineonly Nothing disallowed • Source: www.robotstxt.org/wc/norobots.html

  15. Another example User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/

  16. Processing robots.txt • First line: • User-agent – identifies to whom the instruction applies. * = everyone; otherwise, specific crawler name • Disallow: or Allow: provides path to exclude or include in robot access. • Once the robots.txt file is fetched from a site, it does not have to be fetched every time you return to the site. • Just takes time, and uses up hits on the server • Cache the robots.txt file for repeated reference

  17. Robots <META> tag • robots.txt provides information about access to a directory. • A given file may have an html meta tag that directs robot behavior • A responsible crawler will check for that tag and obey its direction. • Ex: • <META NAME=“ROBOTS” CONTENT = “INDEX, NOFOLLOW”> • OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW See http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 and http://www.robotstxt.org/meta.html

  18. Crawling • Pick a URL from the frontier • Fetch the document at the URL • Parse the URL • Extract links from it to other docs (URLs) • Check if URL has content already seen • If not, add to indices • For each extracted URL • Ensure it passes certain URL filter tests • Check if it is already in the frontier (duplicate URL elimination) Which one? E.g., only crawl .edu, obey robots.txt, etc. Ref: Manning Introduction to Information Retrieval

  19. Basic Crawl Architecture WWW DNS URL set Doc FP’s robots filters Parse Fetch Content seen? URL filter Dup URL elim URL Frontier Ref: Manning Introduction to Information Retrieval

  20. DNS – Domain Name Server • Internet service to resolve URLs into IP addresses • Distributed servers, some significant latency possible • OS implementations – DNS lookup is blocking – only one outstanding request at a time. • Solutions • DNS caching • Batch DNS resolver – collects requests and sends them out together Ref: Manning Introduction to Information Retrieval

  21. Parsing • Fetched page contains • Embedded links to more pages • Actual content for use in the application • Extract the links • Relative link? Expand (normalize) • Seen before? Discard • New? • Meet criteria? Append to URL frontier • Does not meet criteria? Discard • Examine content

  22. Content • Seen before? • How to tell? • Finger Print, Shingles • Documents identical, or similar • If already in the index, do not process it again Ref: Manning Introduction to Information Retrieval

  23. Distributed crawler • For big crawls, • Many processes, each doing part of the job • Possibly on different nodes • Geographically distributed • How to distribute • Give each node a set of hosts to crawl • Use a hashing function to partition the set of hosts • How do these nodes communicate? • Need to have a common index Ref: Manning Introduction to Information Retrieval

  24. Communication between nodes • The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes WWW To other hosts DNS URL set Doc FP’s robots filters Parse Fetch Host splitter Content seen? URL filter Dup URL elim From other hosts URL Frontier Ref: Manning Introduction to Information Retrieval

  25. URL Frontier • Two requirements • Politeness: do not go too often to the same site • Freshness: keep pages up to date • News sites, for example, change frequently • Conflicts – The two requirements may be directly in conflict with each other. • Complication • Fetching URLs embedded in a page will yield many URLs located on the same server. Delay fetching those. Ref: Manning Introduction to Information Retrieval

  26. Now that we have a collection • How will we ever find the needle in the haystack? The one bit of information needed? • After crawling, or other resource acquisition step, we need to create a way to query the information we have • Next step: Index • Example content: Shakespeare’s plays

  27. Searching Shakespeare • Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? • See http://www.rhymezone.com/shakespeare/ • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Why is that not the answer? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the word Romans nearcountrymen) not feasible • Ranked retrieval (best documents to return)

  28. Term-document incidence BrutusANDCaesarBUTNOTCalpurnia First approach – make a matrix with terms on one axis and plays on the other All the plays  All the terms  1 if play contains word, 0 otherwise

  29. Incidence Vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. • 110100 AND 110111 AND 101111 = 100100.

  30. Answer to query • Antony and Cleopatra • Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. • Hamlet • Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

  31. Try another one • What is the vector for the query • Antony and mercy • What would we do to find Antony OR mercy?

  32. Basic assumptions about information retrieval • Collection: Fixed set of documents • Goal: Retrieve documents with information that is relevant to the user’s information needand helps the user complete a task

  33. The classic search model TASK Ultimately, some task to perform. Info Need Some information is required in order to perform the task. Verbal form The information need must be expressed in words (usually). Query The information need must be expressed in the form of a query that can be processed. It may be necessary to rephrase the query and try again SEARCHENGINE QueryRefinement Results Corpus

  34. Misconception? Mistranslation? Misformulation? The classic search model Potential pitfalls between task and query results TASK Get rid of mice in a politically correct way Info Need Info about removing mice without killing them Verbal form How do I trap mice alive? Query mouse trap SEARCHENGINE QueryRefinement Results Corpus

  35. How good are the results? • Precision: How well do the results match the information need? • Recall: What fraction of the available correct results were retrieved? • These are the basic concepts of information retrieval evaluation.

  36. Size considerations • Consider N = 1 million documents, each with about 1000 words. • Avg 6 bytes/word including spaces/punctuation • 6GB of data in the documents. • Say there are M = 500K distinct terms among these.

  37. The matrix does not work • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s. • matrix is extremely sparse. • What’s a better representation? • We only record the 1 positions. • i.e. We don’t need to know which documents do not have a term, only those that do. Why?

  38. 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Inverted index • For each term t, we must store a list of all documents that contain t. • Identify each by a docID, a document serial number • Can we used fixed-size arrays for this? Brutus 174 Caesar Calpurnia 31 54 101 2 What happens if we add document 14, which contains “Caesar.”

  39. 1 2 4 11 31 45 173 1 2 4 5 6 16 57 132 Dictionary Postings Inverted index • We need variable-size postings lists • On disk, a continuous run of postings is normal and best • In memory, can use linked lists or variable length arrays • Some tradeoffs in size/ease of insertion Posting Brutus 174 Caesar Calpurnia 2 31 54 101 Sorted by docID (more later on why).

  40. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend friend roman countryman roman Indexer 2 4 countryman Inverted index. 1 2 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Stop words, stemming, capitalization, cases, etc.

  41. Indexer steps: Token sequence • Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

  42. Indexer steps: Sort • Sort by terms • And then docID Core indexing step

  43. Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added.

  44. Where do we pay in storage? Lists of docIDs Terms and counts Pointers

  45. Storage • A small diversion • Computer storage • Processor caches • Main memory • External storage (hard disk, other devices) • Very substantial differences in access speeds • Processor caches mostly used by the operating system for rapid access to data that will be needed soon • Main memory. • Limited quantities. High speed access • Hard disk • Much larger quantities, speed restricted, access in fixed units (blocks)

  46. Some size examples • From the MacBook Pro • Memory • 8GB (two 4GB SO-DIMMs) of 1333MHz DDR3 SDRAM • Hard drive • 500GB 7200-rpm Serial ATA hard drive • The Cloud • 2 TB PogoPlug • Dropbox, iCloud, etc. These are big numbers, but the potential size of a significant collection is larger still. The steps taken to optimize use of storage are critical to satisfactory response time.

  47. Implications of size limits virtualmemory (on disk) realmemory (RAM) virtual page 2 generates a “page fault” when referencing virtual page 71 reference to page 71 virtual page 71 is brought from disk into real memory

  48. How do we process a query? • Using the index we just built, examine the terms in some order, looking for the terms in the query.

  49. 2 4 8 16 32 64 1 2 3 5 8 13 21 Query processing: AND • Consider processing the query: BrutusANDCaesar • Locate Brutus in the Dictionary; • Retrieve its postings. • Locate Caesar in the Dictionary; • Retrieve its postings. • “Merge” the two postings: 128 Brutus Caesar 34

  50. Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. What does that mean? Crucial: postings sorted by docID.

More Related