1 / 78

Information retrieval

Information retrieval. 201 9 /20 20. crawler. web crawler , Web spider , Web robot Starts from one/several sources ( url ) Stores documents cache / retrieved data Looks for new urls within documents Stores new url to the stack Visits next url (recursively / from stack). example.

chance
Download Presentation

Information retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information retrieval 2019/2020

  2. crawler • web crawler, Web spider, Web robot • Starts from one/several sources (url) • Stores documents cache / retrieved data • Looks for new urls within documents • Stores new url to the stack • Visits next url (recursively / from stack)

  3. example Hyperlinks are underlined Depth-first: 1,3,2,4,5,6 Breadth-first: 1,3,6,4,2,5

  4. architecture

  5. strategies • Breadth-first • Depth-first • Partial PageRank • Restrictions: • Max number of downloaded pages • Max depth • Max time • Documents type • Selected domains • Restricted URL – based on regexp • Download only static documents

  6. crawling policies • selection policy • Which page should be downloaded • re-visit policy • When to visit page again • politeness policy • Do not irritate your collogues • parallelization policy • How to perform parallel crawl

  7. selection policy • breadth-first • Most used? • High PageRank ranked pages will be visited first • Can be improved by partial PageRank • backlink-count • Number of links pointing to the page • partial PageRank • Computed based on already collected urls • OPIC (On-line PageImportanceComputation) • each page is given an initial sum of "cash" which is distributed equally among the pages it points to

  8. deep web • Sometimes dynamic pages ?&… • Sometimes only “through search” available: • No links pointing to the site • Sitemaps • …

  9. re-visitpolicy • uniform • we synchronize all elements at the same rate, regardless of how often they change. That is, all elements are synchronized at the same frequency. • proportional • we synchronize element e with a frequency f that is proportional to its change frequency λ. • freshness of copy • freshness is the fraction of the local database that is up-to-date • “Best strategy” – based on thedomain • (weighted) proportional + ignore of high dynamic pages

  10. re-visitpolicy • Junghoo Cho and Hector Garcia-Molina. 2003. Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28, 4 (December 2003), 390-426. • „we prove that the uniform policy is better than the proportional policy under any distribution of λ values“ • more than 20% of pages had changed whenever we visited them • more than 40% of pages in the com domain changed every day • pages in edu and gov domain are very static

  11. politenesspolicy • Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. • Server overload, especially if the frequency of accesses to a given server is too high. • Poorly written crawlers, which can crash servers or routers. • Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

  12. politenesspolicy • Time interval • Identification – User-agent HTTP req. • Crawler trap • “is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash.” https://fleiner.com/bots/

  13. crawler trap • http://example.com/bar/foo/bar/foo/bar/foo/bar/... • dynamic pages with infinite number of pages (e.g., calendar) • http://www.example.org/calendar/events?&page=1&mini=2015-09&mode=week&date=2021-12-04 • extremely long pages (lot of text causing lexical analyzer to crash) • …

  14. parallelization policy • Dynamic assignment • Central server is balancingload, URLs • A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed downloaders. • A large crawler configuration, in which the DNS resolver and the queues are also distributed. • Staticassignment • Nodes inform others which pages are downloaded • Hash URL websites

  15. problem of similar sources • URL normalization, hash, page fingerprint • Identical content is rare • Crawler tries to detect site differences and makes decision

  16. crawler vs. scraper https://www.quora.com/What-are-the-biggest-differences-between-web-crawling-and-web-scraping

  17. text operations

  18. ir process

  19. parsing complications • What format is it in? • pdf/word/excel/html? • What language is it in? • What character set encoding is in use? • Each of these is a classification problem, which we will study later in the course • But these tasks are often done heuristically: • The classification is predicted with simple rules • Example: "if there are many “the” then it is English".

  20. parsing complications • Documents being indexed can include docs from manydifferentlanguages • A single index may have to contain terms of severallanguages • Sometimes a document or its components can containmultiplelanguages/formats • French email with a German pdf attachment

  21. segmentation • Header, • Footer, • Menu and navigation, • Main content. • Sentences, • Paragraphs, • Bullets, • Chapters with headline.

  22. emails segmentation • Header, • Email text, • Replied or forwarded text, • Attachments, • Signature.

  23. segmentation approaches • Statistic approaches • No. of words, links comparing to other segments • Machine learning • Supervised learning • Features engineering • Patterns • Regexp, trees, graphs.. • Visual approaches

  24. segmentation approaches https://www.ics.uci.edu/~lopes/teaching/cs221W15/slides/WebCrawling.pdf

  25. to text conversion • HTML: NekoHTML • http://nekohtml.sourceforge.net/ • DOC: MS Word - Apache POI. • http://poi.apache.org/ • PDF: OS Linux - pdftotext. Java – PDFBox • http://pdfbox.apache.org/ • Emails: formateml, mail server, Thunderbird (not MS Outlook) libraryJavaMail. • http://www.oracle.com/technetwork/java/javamail/index.html • Apache Tika • Unified API

  26. tokenization • (Garabík et al., 2004): Token je arbitrárna jednotka textu, ktorá rozširuje lingvistický význam pojmu slovo. Za token sa v automatickej segmentácii textu považuje akýkoľvek reťazec znakov medzi dvoma medzerami (whitespace), aj jednotlivé znaky interpunkcie, ktoré nemusia byť oddelené medzerou od predchádzajúceho alebo nasledujúceho tokenu. Textsa teda z formálneho hľadiska skladá z tokenov a medzier (whitespace).

  27. tokenization • Input: “Friends, Romans and Countrymen” • Output: Tokens • Friends, • Roman • and • Countrymen • A token is an instance of a sequence of characters • Each such token is now a candidate for an index entry, after further processing • But what are valid tokens to emit?

  28. tokenization • Issues in tokenization: • Finland’scapital → • Finland? Finlands? Finland’s? • Hewlett-Packard → Hewlett and Packard • as twotokens? • state-of-the-art: break uphyphenatedsequence • co-education • lowercase, lower-case, lowercase ? • San Francisco: one token or two? • How do youdecideitisone token?

  29. general idea • If you consider 2 tokens (e.g. splitting words with hyphens) then queries containing only one of the twotokenswillmatch • Ex1. Hewlett-Packard – a query for "packard“ will retrieve documents about "Hewlett-Packard" OK? • Ex2. San Francisco – a queryfor "francisco“ will match docs about "San Francisco" OK? • If you consider 1 token then query containing only one of the two possible tokens will not match • Ex3. co-education – a query for "education“ will not match docs about "co-education".

  30. numbers • 3/20/91 Mar. 12, 1991 20/3/91 • 55 B.C. • B-52 • My PGP key is 324a3df234cb23e • (800) 234-2333 • Often have embedded spaces (but we should not split the token) • Older IR systems may not index numbers • But often very useful: think about things like looking up error codes/stacktraces on the web • Will often index “meta-data” separately • Creation date, format, etc.

  31. LuceneAnalysistokenization http://lucene.apache.org/ • XY&Z Corporation – xyz@example.com • WitespaceAnalyzer • [XY&Z] [Corporation] [–] [xyz@example.com] • SimpleAnalyzer – kills numbers • [XY] [Z] [corporation] [xyz] [example] [com] • StopAnalyzer • [XY] [Z] [corporation] [xyz] [example] [com] • StandardAnalyzer • [XY&Z] [corporation] [xyz@example.com]

  32. ElasticAnalysistokenization https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html • "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.„ • Standard analyzer • the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone • Simpleanalyzer • the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone • Stop analyzer • quick, brown, foxes, jumped, over, lazy, dog, s, bone • Patternanalyzer • regexp

  33. lexical analysis • [cesta ~ WORD]; • [9 ~NUMBER]; • [, ~ COLON]; • [1.2.2005 ~ DATE]; • [www.fiit.stuba.sk ~ LINK] • CIT je ... pracovisko ... zriadené k 1.2.2005 • [cit ~ WORD]; [je ~ WORD]; [pracovisko ~ WORD]; [zriadené ~ WORD]; [k ~ WORD]; [1.2.2005 ~ DATE]

  34. lexical tags to terms • compound words (one or several) • inserting words (notebook, laptop) • spell correction • not in documents • when users interact • necessary when queries are text • documents without punctuation (sms, chat, emails)

  35. languageissues • French • L'ensemble-one token or two? • L ? L’ ? Le ? • Wantl’ensemble to matchwithun ensemble • Until at least 2003, itdidn’t on Google • Internationalization!

  36. languageissues • Germannouncompounds are notsegmented • Lebensversicherungsgesellschaftsangestellter • ‘lifeinsurancecompanyemployee’ • Germanretrievalsystems benefit greatlyfrom a compoundsplitter module • Cangive a 15% performanceboostforGerman

  37. Katakana Hiragana Kanji Romaji languageissues • Chinese and Japanese have no spaces between words: • 莎拉波娃现在居住在美国东南部的佛罗里达。 • Not always guaranteed a unique tokenization • Further complicated in Japanese, with multiple alphabets intermingled • Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

  38. languageissues • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right • Words are separated, but letter forms within a word form complex ligatures ← → ← → ← start • “Algeria achieved its independence in 1962 after 132 years of French occupation.” • With Unicode, the surface presentation is complex, but the stored form is straightforward

  39. language detection • statistics approaches • N-grams

  40. terms

  41. Na Slovenskomieriobrielietadlo Ked sme ho objednávali, panovalavosveteúplneinásituácia. „Predpokladáme, žebudemepotrebovatpremiestnovatjednotkynaväcšievzdialenosti. (...) Viete, že SR jeaktívna v niekolkýchmisiách, operáciáchužaj v súcasnosti. Tietopotrebujemeneustálezásobovat, prepravovatludí, rotovat.“ Výrokzaznel z ústniekdajšiehoministraobrany Martina Fedora koncommája 2006. PrávevtedypredvádzalizahranicnívýrobcovianavojenskomletiskuKuchynanaZáhorívelkédopravnélietadlá, z ktorýchsimaloSlovenskovybratnáhraduzadosluhujúcestrojeAntonov. Z ponukysmesinapokonvybralidvelietadlá Spartan C-27J. Prvé z nich by maloprístnasledujúcimesiac – viacnež 11 rokov od propagacnejakcie v Kuchyni. MedzicasomsazmenilasituáciavovzdialenomIraku a aj v eštevzdialenejšomAfganistane. Využijemeeštevôbecobjednanélietadlá? Ministerstvomájasnúodpoved. Milióny a miliardy Zaprvélietadlo Spartan talianskejfirmyAleniaAermacchismemalipodladohodyzaplatit 34,5 miliónaeur, dalších 25 miliónoveursi mala vyžiadatpodpora a výcvik. Kedževýrobca s dodávkoumešká, môžemežiadatkompenzácie. Lietadlo by malodorazit v case, ked sa u násdiskutuje o omnohoväcšíchnákupnýchplánoch v armáde. Na obnovuvojenskejtechniky by chcelrezortmiliardyeur. Do akejmierysúplányreálne, by samaloukázatužonedlhopripredstavovaníverejnéhorozpoctunanasledujúceroky. K dodávkeSpartanusanedávnovyjadrilnácelníkgenerálnehoštábuozbrojnýchsíl Milan Maxim. „Urciteneostane bez využitia,“ ubezpecovalnastretnutí s novinármi.

  42. normalization to terms • We need to “normalize” words in indexed text as well as query words into the same form • We want to match U.S.A. and USA • Result is a term: a term is a (normalized) word type, which is an entry in our IR system dictionary • We define equivalence classes of terms by, e.g., • deleting periods to form a term • U.S.A., USA ∈ [USA] • deleting hyphens to form a term • anti-discriminatory, antidiscriminatory ∈[antidiscriminatory]

  43. other languages • Accents: e.g., French résumé vs. Resume • Umlauts: e.g., German: Tuebingen vs. Tübingen • Should be equivalent • Most important criterion: • How are your users like to write their queries for these words? • Even in languages that standardly have accents, users often may not type them • Often best to normalize to a de-accented term • Tuebingen, Tübingen, Tubingen ∈[Tubingen]

  44. Is this German “mit”? other languages • Tokenization and normalization may depend on the language and so is intertwined with language detection • Crucial: Need to “normalize” indexed text as well as query terms identically Morgen will ich in MIT …

  45. case folding • Reduce all letters to lower case • exception: upper case in mid-sentence? • e.g., General Motors • Fed vs. fed • SAIL vs. sail • Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… • Longstanding Google example: [fixed in 2011…] • Query C.A.T. • #1 result is for “cats” (well, Lolcats) not Caterpillar Inc.

  46. normalization to terms • do we handle synonyms and homonyms? • E.g., by hand-constructed equivalence classes • car = automobile color = colour • We can rewrite to form equivalence-class terms • When the document contains automobile, index it under car-automobile (and vice-versa) • Or we can expand a query • When the query contains automobile, look under car as well • what about spelling mistakes? • One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics

More Related