430 likes | 577 Views
CSIT225 Internet Programming. CHAPTER 2 PART A THE MECHANICS OF SEARCH ENGINES. Introduction. Search Engines keeps track of links to Internet resources in their databases. They collect information about World Wide Web sites and allow seekers to search for specific data.
E N D
CSIT225 Internet Programming CHAPTER 2 PART A THE MECHANICS OF SEARCH ENGINES
Introduction • Search Engines keeps track of links to Internet resources in their databases. • They collect information about World Wide Web sites and allow seekers to search for specific data. • Searchers could find anything that’s in the databases of the used search engine, but nothing out of databases.
Introduction • In this part of the chapter, you will get an idea about the mechanics of a search engine. • In the second part of the chapter the ways of utility. • For this purpose, we will consider the topics on “AltaVista” Search Engine.
The Mechanics of Search Engines • In general, people provide content on the Internet is mattering mostly not find, but rather being found, and being found in the best way. • They want to make sure that their potential readers will easily find the relevant or appropriate information they have created. • In this part you will see that you can arrange your Web Site, in order to be more attractive for both, the search engines and the searchers.
How AltaVista Search Finds Sites • The Web-comprising tens of millions of separate documents-is dispersed on over hundred thousand computers around the world. • Capturing information from these millions of documents requires a computer program to visit each site, and each page, just as a regular human user would, and retrieve all text that is found here.
How AltaVista Search Finds Sites • These programs that automatically visit Web sites and gather information are known as “spiders” or “robots”. • “Scooter” is the spider of AltaVista. • Scooter sends out a thousand simultaneous connection to a thousand different web addresses. • It retrieves the complete text of the web page.
How AltaVista Search Finds Sites • It then parses the pages and identifies other URLs (Uniform Resource Locators) mentioned. • Check to see if they already been retreived, then goes and gets the new ones. • That process continues until Scooter has reached all new pages that are linked by older pages on the Internet.
How AltaVista Search Finds Sites • Scooter does not retrieve everything: • There are some pages linked to others, and they have no connections with the rest of the Web. • There are entire private networks behind firewalls, that Scooter cannot access. • There are some pages where Webmasters have indicated that they do not want to be visited by robot programs (using the Robot Exclusion Standards).
Submitting Your Site • You should submit your URL manually to AltaVista in these situations: • Your site is relatively new and very few other sites have pointers to it. • You site is already in the AltaVista Search Index, but you recently added new pages or made significant changes on the existing pages. • You can submit your URL with clicking “Add URL” link at the bottom of any AltaVista Search Page.
Submitting Your Site • Scooter also visit existing sites and notice when pages no longer exist. • They will automatically be purged from the index.
How AltaVista Search Indexes Sites • After Scooter retrieves site, AltaVista builds an index of every word in every document using custom-designed index software.
How AltaVista Search Ranks Sites • When you submit a simple search query, results you get are ranked documents (they are not just a random listing of documents). • AltaVista Search ranks the information you receive using two elements: • The rarity of the words you provide in the search query. • The location of the searched words in the documents.
How AltaVista Search Ranks Sites • The rarity is the most important ranking factor. • Extremely common words (e.g., a, and, the, Internet, computer) are ignored for purposes of ranking because they appear in so many documents that they provide no help in establishing how relevant a given document is to your query.
How AltaVista Search Ranks Sites • AltaVista Search prioritizes positions according to the following rules: • Pages that have the target words in the HTML title and near the top of text are ranked higher then others. This rule includes any content in <META> tags in the head of the document. • Pages in which the words appear near on another in the text are ranked higher then those in which the words farther apart.
How AltaVista Search Ranks Sites • AltaVista Search prioritizes positions according to the following rules: (Continues) • Pages in which the word appears twice are ranked higher then those with only one occurrence. (More then twice makes no difference) • The indexing software looks through the documents in the index and considers every document that includes at least one of the words, in the query, calculates the score, ranks the matching, and generate a prioritize list of Web pages.
What AltaVista Does Not Index • If pages: • Requires registration or password-based access. • Relies heavily on image rather then text. • Relies on specialized formats (Such as Acrobat and PostScript). • Uses primarily multimedia files. • Includes extremely long (over 4MB) text files. • Provides dynamic or very frequently and automatically updated content.
What AltaVista Does Not Index • While there is no need to remove or reduce the number of graphics, audio, video, and other multimedia files at your site, you should: • Give them clear and descriptive names. • Provide detailed descriptive text. • Provide alternative forms of the same information, if possible.
Controlling Scooter On Your Site • There are are several areas you can control that will have a significant effect on how AltaVista Search indexes your site. • HTML document title. • Abstract or description • Date • Keywords • Content
Controlling Scooter On Your Site • HTML document title. • <TITLE> …… </TITLE> • Abstract or description • <META name=“description” content=“ …”> • Date • No matter how good your content may be, if people see a date like MAY 15, 1994.
Controlling Scooter On Your Site • Keywords • <META name=“keywords” content=“ …”> • When deciding which word are “key” for you, take advantage of unique terms associated with your company and its products. • Content • Comment lines are ignored by AltaVista Search when your pages are adding to the search index (<!-- ….--> and <COMMENT>…</COMMENT>).
Excluding Pages From AltaVista Search • Why Exclude? • You may want to exclude your site while it’s still being constructed. • You may want to exclude some pages from AltaVista to try to maintain some control over the context of the user’s experience (To ensure that users are accessing your pages one by one). • You may want to exclude pages that only specific people should access.
How To Exclude Pages or Sites • If you have a reason to exclude pages or sites you can do the followings: • Set up a file named ROBOTS.TXT on your Web server in the main (top level) directory. • This file should contain a list of directories and files that you want to deny access to. • For instance, the file in the following example would exclude all spiders (*) from free directories (cgi-bin/sources, access_stats, and cafeteria/lunch_menus).
How To Exclude Pages or Sites User-agent: * Disallow: /cgi-bin/sources Disallow: /access_stats Disallow: /cafeteria/lunch_menus
CSIT225 Internet Programming CHAPTER 2 PART B THE USE OF SEARCH ENGINES Prepared By Cem Yağlı
Getting Started with AltaVista Search • A search engine has the followings : • Search Selector,allows you to choose whether you want to search the WEB, news or discussion groups. • Query Field,is where to typed in the search criteria. • Submit Button, is what you click on to submit the query for search.
Getting Started with AltaVista Search • A search engine is responding the following as a result of your search: • Word Counts, tells approximately how many times each of your search terms are found in the Search Index. • Ignored List, which of you search terms are too common to be useful and are therefore not represented in the results.
Getting Started with AltaVista Search • A search engine is responding the following as a result of your search: (Continuing) • Approximate Number of Matches, indicates roughly how many documents the search engine found, that match thecriteria. • Title, might not be appear in the body of the document, but is the name that has been given by the author to the document (<TITLE>…</TITLE>).
Getting Started with AltaVista Search • A search engine is responding the following as a result of your search: (Continuing) • First Text from the Document, gives you a brief idea of what information is in the document (META content=“ “). • URL, the fully identified web address of the document. Example: http://sct.emu.edu.tr/kansoy/csit225.htm
Getting Started with AltaVista Search • A search engine is responding the following as a result of your search: (Continuing) • Size of the Text within Document, shows how much disk space the actual text of the document requires. • (Web pages can consist entirely of text, or they can contain images and multimedia files. Search engines just index the text portion of Web pages, so other forms do not reflected in this number).
Getting Started with AltaVista Search • A search engine is responding the following as a result of your search: (Continuing) • Date, file was last modified, helps you to determine how old the information might be. • Page Number of Results, lets you move to additional screens of results.
Doing Simple Searches • You can do the simple search with just entering your word or words to theQueryFieldand clicking theSubmit Buttonor pressingEnter Key. Saving Results • Saving the search result. (You are saving the results web document ) • Bookmarking the Search. (You are saving the query string)
Improving The Simple Search Skills • These are the techniques to help you to do more efficient search: • Using Rare Words (use collie instead of dog or pet) • Using all words that might matter (use capital Alaska or Alaska capital to learn the capital of Alaska) • Using phrases when you know exact word order (use Office 2000 to find the software package that you are looking for)
Improving The Simple Search Skills • These are the techniques to help you to do more efficient search: • Using punctuation and spaces (486-DX, 486/DX, “486 DX” are identical but 486DX is not) • Using Capitalization (If you enter red ; all of the documentts containig red, Red, RED, rEd and reD will be founded, but if you enter Turkey, the documents containing turkey will not be founded).
Improving The Simple Search Skills • These are the techniques to help you to do more efficient search: (Continuing) • Using wildcards when you aren’t sure the exact word ( The search forcol*rwill get matches for both the British spelling colour and the American color. Black * white-- “Black and white”, “Black red white”, “Black vs. white”, ...)
Improving The Simple Search Skills • These are the techniques to help you to do more efficient search: (Continuing) • Searching according to structural elements • Altavista lets you search for nine kinds of Web page structural elements 1) text: 2) title: 3) link: 4) anchor: 5) url: 6) host: 7) domain: 8) image: 9) applet: (example:If you want to search for all documents that have links to your web page: link: http://sct.emu.edu.tr/kansoy/index.htm )
Improving The Simple Search Skills • These are the techniques to help you to do more efficient search: (Continuing) • Requiring specific words or phrases (Dear Hunting--- finds all docs containing either “Dear” or “Hunting” Dear +Hunting--- finds all docs containing “Hunting” and rank them as some containing “Dear” will come first. +Dear +Hunting--- finds only the docs that are containing both “Dear” and “hunting” at the same time )
Improving The Simple Search Skills • These are the techniques to help you to do more efficient search: (Continuing) • Excluding specific words or phrases (Indian -India--- finds all docs containing American Natives and exclude the natives of India ) • Combining Elements (+link:yourdomain.com -host:yourdomain.comwill give all the web pages outside of your own site that have hypertext links to your site)