1 / 66

Collaborative Search

This article explores the use of collaborative search and intelligent crawlers in traditional information retrieval systems. It discusses the challenges and solutions in representing information needs, indexing documents, and formulating queries. The importance of ranking and factors influencing it are also analyzed. Additionally, the indexing of web pages and the role of crawlers in discovering and downloading documents from the web are examined.

hurdr
Download Presentation

Collaborative Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collaborative Search Zheng Zhen

  2. Traditional IR • Web search • Crawlers parallel crawler intelligent crawler • Collaborative Search • References

  3. Traditional IR System User Acquisition documents, objects Problem information need Representation question Representation indexing, ... Database of Indexed documents Query search formulation Matching searching Feedback Retrieved objects

  4. Classic Information Retrieval Homogenous documents Well categorized ‘Small’ well-controlled collection Closed, static environment Controlled collection growth

  5. Web Search • Web: - open, dynamic environment - vast uncontrolled collection of PAGES • Web page: - heterogeneous: various formats, languages … - content may change over time ! • Importance of LINKS • Existing Search Facilities: • Generic: yahoo, askjeeves, google etc. • Specialized: Pluribus,Collaborative Spider

  6. Common operations • Indexing - identifies potential index terms in documents • Query processing - form keywords • Search - access indexed file • Ranking

  7. Ranking • Ranking is important • Factors which influence rank • Term location or frequency • Proximity to query terms • Date of Publication • Length • Popularity • Heuristics: Proper nouns may have higher weights • WWW: Link analysis Popularity (ex. Google)

  8. The Web: indexing • Web pages are heterogenous documents • Contain both text information and meta information • External meta information can be inferred • Must be processed before the pertinence can be established

  9. Indexing WWW documents • Web pages require Preprocessing to get uniform data structure - Normalizes the document stream to a predefined format - Breaks the document stream into desired retrievable units - Isolates and metatags subdocument pieces Web1 page1 Uniform format Web2 page2 preprocessing Web n Page n

  10. Computing weights • Assign weight to each descriptor for document & add to index • Weights are based on: • term frequency within the document (tf) • Global term frequency within the corpus • This will be a problem when using parallel independent agents to do indexing

  11. IR on Web Query Search & match Indexed files Query Processor Page ranking Document Processor Responses Browse Web Crawlers Web pages

  12. Web: Document discovery • Corpus is very large • Dynamic • Open • Documents must be discovered • …. use Web crawler

  13. Web Crawler • What is a Crawler? initinitial urls get next urlscheduled urls Web get pagevisited urls extract urls web pages

  14. Parallel Crawler Advantages: • Faster…. • Imperative for large-scale crawling • Can be run on cheaper machines • Network load dispersion • Network load reduction Crawler1 Crawler2 Downloaded Web pages Web CrawlerN *Parallel Crawlers by Cho, Junghooet al. University of California, WWW2002, Honolulu, Hawaii, USA

  15. Evaluation Metrics • Overlap 1 - (# of unique pages downloaded / # of page downloaded by team of crawler) • Coverage # of pages downloaded by the parallel crawler / Total # of reachable pages • Communication overhead # of exchanged messages / # of page downloads

  16. Assignment of search areas • Partitioning the Web • Address division: .net, .ca , UdeM.ca • Topic • Static assignment ( see next page) • Dynamic assignment (see multi-agent collaborative search)

  17. Partition function Multitude of ways to partition the web • Site-hashing Based on the hash value of the site name of a URL • URL – hashing Based on the hash value of all the URL • Hierarchical partition the web hierarchically based on the URLs of the pages Partitionning will come up again with Agents !

  18. a f Crawling modes (Examples) * Firewall mode, Cross-over mode, Exchange mode Site1 (Crawler1)Site2(Crawler2) *Parallel Crawlers by Cho, Junghooet al. University of California, Los Angeles WWW2002, Honolulu, Hawaii, USA b c g d h i e

  19. Firewall mode:download within partitions Crawler1: ab, ac Crawler2:fg, gh, gi Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e D and E are overlooked !

  20. Cross over mode:download between partitions Crawler1: ab, ac; ag, gh, hd, de, gi Crawler2: fg, gh, gi; hd, de Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e Duplication of work !

  21. Exchange mode:download within partitions, exchange info. Crawler1: ab, ac; then g  Crawler2 Crawler2: fg, gh, gi; then d  Crawler1 Site1 (Crawler1)Site2(Crawler2) a f g b c d h i e Requires communication

  22. Minimizing communication inExchange Mode • Batch communication • Allow replication 1) Because links to pages follows a Zipf distribution (... 20-80 factor) 2) Replicate some popular URLs at each Crawlers Zipf distribution incoming links incoming links page page

  23. Evaluating quality • We want important pages • Quality measure:| Pages  Top_k| / | Top_k| • Pages: downloaded k pages • Topk: top k most important pages* *Indication of importance: backlink count

  24. Comparison[2] From experiments[2]: 1) firewall mode : parallel crawler number < 4 & less quality 2) exchange mode: small network traffic & maximize quality 3) replicating between 10,000 – 100,000 (sic) popular URLs reduces 40% commu. overhead

  25. Intelligent crawling* • Indiscriminate crawlers ( i.e. for Google) • Any new page is good • Topic-oriented crawlers • I.e. Call for tenders • We just want new pages on a topic of interest • Intelligent crawler * Intelligent Crawling on the WWW with Arbitrary Predicates, C. Aggarwal,et al., IBM TJ Watson Res. Ctr., WWW10, Hong-Kong 2001

  26. Focused Crawling • Which node to explore next ? • Depth-first ? Breadth-first ? • Best-first ! But what is best? • Focused crawling is best, how to establish focus ? -- Linkage locality -- Sibling locality topicY X topic X topicY X topicY ... Y Y ? Y Y

  27. Focused Crawling • Objective: given a specific query, find: -- Good sources of content (authorities)... many links TO -- Good sources of links (hubs) ... many links FROM authoritieshubs • Given a arbitrary query, can we auto-focus ? -- learning capability -- learning model

  28. Learning Model • Analyze links from pages on the search periphery • Learning how to pick good links to follow visited web page to visit page hyperlink 1 2 C 3 4

  29. Learning Model • Clues based on - content - URL tokens - linkage info - sibling structure • Different needs require different learning - crawler need learning during the crawl - reuse learning information • The Crawler should be intelligent

  30. Intelligent Crawling • Priority list of URLs to be explored (Plist) • User defined predicate to compute interest of page (= processed query) • KB: knowledge base

  31. Intelligent Crawling • Algorithm Intelligent-Crawler(); • Begin • Priority-List (PList )= {Starting Seeds }; • While not (termination) do • begin • Reorder URLs on PList using KB • Drop unimportant items from PList • W <= pop the first element on PList; • Fetch the Web page W; • Parse W and add all the outlinks in W to PList; • If W satisfies the user-defined predicate, then store W; • Update KB using content and link information for W; • end • End

  32. Intelligent Crawler During the crawling process, we can accumulate some information Like: • number of URLs crawled, N1 • number of URLs crawled which satisfy predicate , N2 • # pages in which word i occurs which satisfy the predicate, N3 • # pages with keyword in URL which satisfy (or not) predicate …. • How to create a KB? A later example will illustrate URL based learning

  33. Intelligent Crawler Example: User is interested in ‘online malls’ BUT only 0.1% web pages contain ‘online malls’ HOWEVER if word  ’eshop’ is in URL then prob of page containing ‘online malls’ = 5% Thus we should add to KB fact that ‘ eshop ’ in URL is useful criterion in choosing pages to explore.

  34. Formal view * C: a crawled web page satisfies the given predicate P(C): probability of event C, P(C) = N2 / N1; E: a fact that we know about a candidate URL Knowledge of the event E may increase the probability P(C) thus P(C|E) = P(C  E) / P(E) P(C|E) / P(C) = P(C  E) / (P(C) * P(E)) Calculate the interest ratio for the event C given event E as IR(C,E) IR(C,E) = P(C|E) / P(C) = P(C  E) / (P(C) * P(E)) The value of P(C  E), P(E) can be calculated during the crawling * from: Intelligent Crawling on the WWW with Arbitrary Predicates, C. Aggarwal,et al.,

  35. Mall example Example: • 0.1% web pages contain ‘online malls’ & satisfy ( P(C)) • if word  ’eshop’ occur ( E ) then the probability (P(C|E)) of satisfying increase to 5% • So interest ratio = 5% / 0.1% =50 IR(C,E) = P(C|E) / P(C)

  36. Collaborative Search • 3 ways search for information Browsing, querying and filtering • Collaborative type [10] Collaborative browsing Mediated searching Collaborative information filtering Collaborative agents Collaborative reuse of results

  37. Collaborative Search • What do we mean by collaboration ? • Human  computer  Human • Human  Computer • Computer agent  Computer agent

  38. Collaborative Search • Man - machine Collaborative browsing --- Ariadne system[23] Collaborative reuse of results --- Pluribus[21] (2000) Collaborative information filtering --- Collaborative filtering[25] Mediated searching --- DIAMS [22] (2000) • Machine - machine ( … Collaborative agents ) meta-search engines: Meta Crawler, Mamma, Metagopher, Copernic topic-oriented collaborative crawler [11] (2002) Collaborative spider [16] (2002) UbiCrawler[5] (2003) Collaborator [19] (under development)

  39. Existing systems meta-search engines • Meta Crawler, Mamma, Metagopher, Copernic query --------- passes ----- to other search engines collect ------ results -------- from other search engines combine ----- results ------user

  40. Topic-oriented collaborative crawlers[11] (2002) • Each crawler is given a specific topic • It knows the topics of its colleagues • It sends URLs of pages it doesn’t care about to the one responsible for the topic Problems: • static predefined topic categories • static assignment partition function, • controller assign sites to each crawler

  41. Collaborative spiders[16](2002) JATLite (Java Agent Template Lite), uses KQML, User agents + ONE scheduler agent , Collaborator agent (as a mediator) search, content mining, post-retrieval analysis system group user sharing information

  42. UbiCrawler[5](2003) consistent hashing partition function buckets are agents, keys are hosts failure detector --- only synchronous component each agent keeps track of the visited URLs in a hash table pure Java application, RMI based, multi-thread agent

  43. Collaborator[19](under development) a shared workspace framework for virtual teams 3 tier architecture, J2EE+Agent ( BlueJADE ), client tier, middle tier, enterprise information systems tier personal agents, session management agents desktop or wireless device Jade, FIPA

  44. Conclusion Current collaborative search: - collaborative - dynamic - adaptive exploring - intelligent - decentralized Trend Agent

  45. Multi-agent collaborative search Challenges ? agent_1 agent_2 agent_n Query? …. DataStore …. DataStore Web …. DataStore

  46. Challenges Partition dynamic ? - dynamic assigning the web domain to agents Load balancing ? - each cache stores roughly the same # of pages Content look up ? - an agent can easily locate the storage that storing particular content Solution: Web Cache & Consistent Hashing

  47. Web Caching • Content (URL -> content) • For download efficiency • Indexing information (Keyword -> URL) • Search efficiency

  48. Browser caching 1.For efficiency www.abc.com 2. Each client has own cache caches clients

  49. Proxy caches 1.each cache stores a subset of all pages www.abc.com 2. each client knows several caches Domain caches clients

  50. Agent’s web cache communication User User Web agent agent agent Web cache Web cache Web cache

More Related