1 / 19

The Invisible Web

The Invisible Web. INLS 200-001, Week 5, Session 9 Instructor: Sanghee Oh. Today. Logistics Group Presentation BSIS major/minor orientation (Stephanie at SILS) Topics Visible vs. Invisible Web Activities Invisible Web Search Exercise. The Invisible Web.

walter
Download Presentation

The Invisible Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Invisible Web INLS 200-001, Week 5, Session 9 Instructor: Sanghee Oh

  2. Today • Logistics • Group Presentation • BSIS major/minor orientation (Stephanie at SILS) • Topics • Visible vs. Invisible Web • Activities • Invisible Web Search Exercise

  3. The Invisible Web

  4. What makes sites/pages “invisible”? • Different search engine algorithms order results differently. • No search engine indexes everything on the web. • Some sites deliberately exclude bots. • Sites change & move frequently, so every search engine db is out of date. • Some file formats can’t be read by spiders. • Some sites and parts of sites require a login. • Some sites and parts of sites are created dynamically.

  5. No search engine indexes everything on the web.

  6. 4 searches conducted in 10 search engines. 50% of sites were found by only 1 engine. Another 30% of sites were found by only 2 engines. Search Engine Comparison

  7. Search Engine Comparison • Different search engines have different-sized indexes • Database Total Size Estimates • Sometimes different search engines are actually using the same index • Search Engine Relationship Chart

  8. Search Engine Comparison

  9. Some sites deliberately exclude bots. http://sils.unc.edu/robots.txt • Robots Exclusion Protocol • There are a lot of robots out there. • They generate a lot of traffic • Can be excluded from: • Individual pages using a META tag • Sections of a site using a file named robots.txt User-agent: Googlebot Disallow: /~shoh/ User-agent: Wild Ferret Web Hopper Disallow: /~shoh

  10. Bots can’t cope with these:

  11. Some sites and parts of sites are created dynamically.

  12. File formats • What search engines can read: • HTML, SGML, XML • Text • Word − RDF • Excel − PDF • Powerpoint − Postscript • What search engines can’t read: Everything else.

  13. Dynamically-created Pages • The page does not exist until created as a result of a specific query. • The content on the page is pulled from a database. • For example: Amazon.com • Bots can only follow links, they can’t conduct searches. • Some indication in the filename: • ?, .jsp, .asp, cgi, .php

  14. Why does the Invisible Web Exist? • Most of it is made up of the contents of thousands of specialized searchable databases made available via the web. • Rarely are such pages stored anywhere: it is easier and cheaper to dynamically generate the answer page for each query than to store all the possible pages containing all the possible answers to all the possible queries people could make to the database. • In-depth information available • Up-to-date information available

  15. Examples from our everyday life • Google Search • Internet Movie Database • U.S. Census Bureau • CNN.com • Mapquest

  16. Finding Information in the Invisible Web • Find someone who has already collected pointers to databases and provided a list of resources • e.g. librarians, domain experts, etc. • Search for topic and “database” • e.g., research databases • Know your subject matter and start from a well-known piece of literature • e.g., citation search

  17. Sites you should know about • The Invisible Web Directory: invisible-web.net • UNC library’s database page: www.lib.unc.edu • Google Scholar: scholar.google.com • Scirus: www.scirus.com • NCLive: www.nclive.org • Internet Public Library: www.ipl.org • Librarians’ Index to the Internet: lii.org • refdesk.com • FirstGov: www.firstgov.gov

  18. Exercise • The purpose of this exercise is to practice your online search engine skills. For this exercise we will explore some documents produced from the tobacco litigation settlement. (1) tobaccodocuments.org (2) legacy.library.ucsf.edu • Both index the same original content, but were constructed independently

  19. Next • Email me the names of the databases that your group choose and the preferred presentation date by Sep 22. • Readings: • UNC Libraries Catalog Search Tips • WorldCat: Read Introduction • Antelman, K., Lynema, E. & Pace, A. K. (2006). Toward a 21st Century Library Catalog. Information Technology and Libraries, 25(3), 128-139. Available at: http://www.lib.ncsu.edu/staff/kaantelm/antelman_lynema_pace.pdf

More Related