470 likes | 804 Views
Search Engines. INLS 200-001, Week 4, Session 7 Instructor: Sanghee Oh . Today. Logistics 2 nd draft of the research prospectus : due to next Monday Topics How do search engines work? How does Google work? Activities Google Advanced Search. How do search engines work?.
E N D
Search Engines INLS 200-001, Week 4, Session 7 Instructor: Sanghee Oh
Today • Logistics • 2nd draft of the research prospectus : due to next Monday • Topics • How do search engines work? • How does Google work? • Activities • Google Advanced Search
How do search engines work? Web Crawler / Spiders Databases & Indexes (Inverted Index) Search Results Ranking
How Search Engines Work • Three main parts: • Gather the contents of all web pages (using a program called a crawler or spider) • Organize the contents of pages in a way that allows efficient retrieval (indexing) • Take in a query, determine which pages match, and show the results (ranking and display of results)
Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index Inverted index Search engine servers
Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index user query Inverted index Search engine servers Show results To user 3. Results Ranking
1. Web Crawlers / Spiders • Crawlers gather pages/sites • Programs that move from site to site on the web and gather information about the pages found • Start with a list of domain names (homepages), and follow the hyperlink on the homepages. • Keep a list of urls visited and those still to be visited. • At each site, the crawler may be focused on breadth or depth • Breadth – gather top pages and move on to another site • Allows it to find more sites • Depth – gathers all pages at site • Allows it to index more pages in each site • How frequently a site gets crawled varies • From engine to engine • From site to site
Web Crawler do collect… • Mostly html pages • PDF • Word • PPT, etc.
Web crawlers do not collect… • Documents which are told not to collect. • Documents with dead links • Documents with deep Web (The Invisible Web)
Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index user query Inverted index Search engine servers Show results To user 3. Results Ranking
2. Databases & Indexing • Databases • Input from crawlers, from submissions by authors, from related directories • Cashed pages • Describes pages (indexes) • The size of the database is an important issue • Even the largest does not cover the entire Web
2. Databases & Indexing • Indexing • Each page that is included in the database is indexed (automatically) • “All” the words on the page (for full-text search) • Stop words • Metatags: title, others • URL • Hypertext anchors and links • Spamming • Load words into metatags • Load invisible words (e.g., white text on white background)
Inverted Index • How to store the words for fast lookup • Basic steps: • Make a “dictionary” of all the words in all of the web pages • For each word, list all the documents it occurs in. • Often omit very common words • “stop words” • Sometimes stem the words • (also called morphological analysis) • cats -> cat • running -> run • In reality, this index is huge. • Need to store the content across many machines • Need to do optimization tricks to make lookup fast
Standard Web Search Engine Architecture Check for duplicates, store the documents 2. Databases crawl the web 1. Crawler machines Create an inverted index user query Inverted index Search engine servers Show results To user 3. Results Ranking
3. Results ranking • Search engine receives a query, then • Looks up the words in the index, retrieves many documents, then • Rank orders the pages and extracts “snippets” or summaries containing query words. • Most web search engines assume the user wants all of the words (Boolean AND, not OR). • These are complex and highly guarded algorithms unique to each search engine.
Some ranking criteria • For a given candidate result page, use: • Number of matching query words in the page • Frequency of terms on the page and in general • Proximity of matching words to one another • Location of terms within the page • Location of terms within tags e.g. <title>, <h1>, link text, body text • Anchor text on pages pointing to this one • Link analysis of which pages point to this one • (Sometimes) Click-through analysis: how often the page is clicked on • How “fresh” is the page
Number of matching query words in the page US Education System Doc 1 Doc 2
Frequency of terms on the page and in general Online Shopping Doc 2 Doc 1
Proximity of matching words to one another Sleep Paralysis Doc 1 Doc 2
Location of terms within the page Transportation Fuel Doc 1 Doc 2
Anchor text on pages pointing to this one tobacco advertising Doc 2 Doc 1
Link analysis of which pages point to this one US Prison Population Doc 1 Doc 2
Search Engine Rankings? • Complex formulae combine different ranking criteria together.
Collecting Results • Googlebot “spider” crawls the billions of pages on the web. • Google spider asks web servers to send web pages and it scans the pages for links, which connects Google to other pages. Each page is assigned a number. • Google builds an index of these numbers.
Presenting Results When someone puts a query into Google… • Google uses the index to find the pages that include the words in the query. • Google ranks the pages in order of relevance.
An Example Query: civil war “civil” is in documents: 3, 8, 22, 56, 68, 92 “war” is in documents: 2, 8, 15, 22, 68, 77 Which documents have both words? The result is called a “posting list”.
Ranking Results Google’sPageRank evaluates: • How many links there are to a web page from other pages. • The quality of the linking sites. “A link from Page A to Page B is like a vote from Page A to Page B.”
Google PageRank Doc 4 Doc 1 Doc 5 Doc 2 Doc 3 Doc 8 Doc 7
PageRank Value • 0 to 10 values • A PageRank value of a Website can be checked in the Google Toolbar
Ranking Results-What else matters? • Proximity of words. • Preference given to pages with your words: • In the order you typed • Close together • In phrases • Word location (like in titles or headings). • Frequency of words. • …And about 97 other factors.
Boolean Terms (AND, OR, NOT) crime AND music alcoholism OR binge drinking china NOT dishware
Google Search Tip 1 • AND is the default connector All you terms should be somewhere • In the text of result pages • In pages that link to a result page • In other pages on the same site as the result page You need to force AND using “+” (plus sign) • OR must be capitalized california OR oregon • Use – (minus sign) for NOT china –porcelain
GoogleSearch Tip 2 Phrase searching • Phrase searches use “ ” for the specific phrase rather than separate words “moon landing”
GoogleSearch Tip 3 Field Searching • intitle: Words must occur in the official title of the page. Try: intitle:mileage “hybrid cars” or allintitle:mileage “hybrid cars”
Google Search Tip4 • inurl: Words must occur in the URL. Try: inurl:ils library science or allinurl:ils library science • filetype: Try: filetype:ppt china onechild policy
Google Search Tip5 Synonym searches ~food recipes, nutrition, cooking ~help guide, tutorial, FAQ, manual Similar pages searches Try: related:www.consumerwebwatch.org
Google Search 6 Find definition of your words Try: define:internet Calculator functions… Try: 2+2 Try: What is the mass of an electron?
GoogleAdvanced Search Find results • all these words AND • this exact wording or phrase “ “ • with one ore more of the words OR • Unwanted words – (minus sign) • Domain search • Page-Specific Search • Google Advanced Search Exercise
Other Search Engines • Ask.com • Kartoo.com • Others people use?
Next • Due: 2nd draft of research prospectus • Email me by 5:00 pm • Readings for next class • Sullivan, D. (2005). Web Searching Tips. Search Engine Watch. • Search Engine Math, Power Searching for Anyone, and Search Assistance Features. For a summary, Search features chart. Recommended • Ackermann, E., & Hartman, K. (2000). The Information Specialist's Guide to Searching and Researching on the Internet and the World Wide Web. Chicago: Fitzroy Dearborn. [Available from Blackboard] • Chapter 6, Search strategies for search engines, 144-153 only