1 / 21

WebGather Design and Implementation

WebGather Design and Implementation. Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000 Email: yhf@net.cs.pku.edu.cn http://net.cs.pku.edu.cn/~yhf. Outline. Introduction of searchengine WebGather Conclusion. Introduction : http://www.yahoo.com/. Introduction : http://sohu.com/.

Download Presentation

WebGather Design and Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WebGather Design and Implementation Hongfei Yan Network Group,CST,PKU,Dec. 15, 2000 Email: yhf@net.cs.pku.edu.cn http://net.cs.pku.edu.cn/~yhf

  2. Outline • Introduction of searchengine • WebGather • Conclusion

  3. Introduction: http://www.yahoo.com/

  4. Introduction: http://sohu.com/

  5. Introduction: http://sina.com.cn/

  6. Introduction: http://www.google.com/

  7. Introduction: http://e.pku.edu.cn/

  8. Introduction: Search Engine Sizes--searchenginewatch in Nov 8, 2000 • GG=Google • WT=WebTop.com • AV=AltaVista, • FAST=FAST • NL=Northern Light • EX=Excite • INK=Inktomi, • Go=Go (Infoseek)

  9. Introduction: a new study-- Inktomi and the NEC Research Institute, Inc. In Feb. 2000 • Number of indexable pages on the web : over 1 billion • Number of servers discovered: 6,409,521 • Number of mirrors in servers discovered: 1,457,946 • Number of sites (total servers minus mirrors): 4,951,247 • Number of good sites (reachable over 10 day period): 4,217,324 • Number of bad sites (unreachable): 733,923 Web pages on a site: 1000,000,000/4,217,324 = 237.1

  10. Introduction: Inktomi Search Engine cluster In the picture 9*8*2=144

  11. WebGather: Introduction •     由北大计算机系网络与分布式系统研究室研制开发的“天网”中英文搜索引擎系统是国家“九五”重点科技攻关项目“中文编码和分布式中英文信息发现”的研究成果,并于1997年10月29日正式在CERNET上向广大Internet用户提供web信息导航服务。在“天网”系统对外提供服务期间,广泛采纳用户的意见和建议,不断地改进其服务质量,到目前为止访问量已突破800万人次。2000年初新成立的“天网”搜索引擎课题组在国家973重点基础研究发展规划项目基金资助下,秉承老的开发队伍的优良传统,将致力于探索和研究中英文搜索引擎系统的关键技术,以便向广大用户提供更为快速、准确、全面、时新的海量web信息导航服务。欢迎广大用户给我们提出更好的意见和建议。 • http://e.pku.edu.cn/身无彩凤双飞翼,心有灵犀一点通

  12. WebGather:in Dec. 1, 2000 • 2.5 million scale • Index 2.5 million web pages • More than 200,000 web pages everyday • Ten day to update all data • three PCs

  13. 238 X 40,000 = 9,520,000 WebGather: Design goals for a distributed web-crawling system for WebGather • collect all the web pages in China • keep pace with the rapid growth of Chinese web information

  14. WWW User behavior Indexer Gatherer Retriever Client Client logdatabase Gather Database Retrieve Database WebGather 2.0: architecture

  15. WebGather 1.2:architecture of gather subsystem 1/4 Gather2 … GatherN Gather1 Main Control

  16. WebGather 2.0:architecture of gather subsystem 1/4

  17. WebGather :technologiesin gather subsystem 1/4 • Distributed system architecture • High availability • …… • Load balance • Low bandwidth • Scalability • Re-configurability • …… • Cut words • Position relativity • Anchor text, Link popularity

  18. WebGather :architecture of indexer subsystem 2/4 feature1 feature1 webpage1 webpage1 webpage2 webpage2 feature2 feature2 … … webpageK featureK feature1 webpage1 … webpage2 … feature2 webpageN featureN feature3 webpage3 A B

  19. WebGather :technologiesin retriever subsystem 3/4 • Traditional IR (VSM ) • Query cache, hot click • Cut words • Anchor text, Link popularity

  20. WebGather :technologiesin user behavior subsystem 4/4 • Link popularity • Replica popularity • User popularity

  21. Conclusion : • Searchengine is More and more important. • Web is a good experimental object, we can do a lot R&D on it.

More Related