1 / 21

SWE 363: WEB ENGINEERING & DEVELOPMENT

SWE 363: WEB ENGINEERING & DEVELOPMENT. Spring Semester 2012-2013 (2012-2). Module 1-2: Web Basics. Dr. Nasir Al-Darwish Computer Science Department King Fahd University of Petroleum and Minerals darwish@kfupm.edu.sa. Outline. Finding Information on the Web Search Engines Other means

hunter-goff
Download Presentation

SWE 363: WEB ENGINEERING & DEVELOPMENT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SWE 363: WEB ENGINEERING & DEVELOPMENT Spring Semester 2012-2013 (2012-2) Module 1-2: Web Basics Dr. Nasir Al-Darwish Computer Science Department King Fahd University of Petroleum and Minerals darwish@kfupm.edu.sa

  2. Outline • Finding Information on the Web • Search Engines • Other means • Web 2.0

  3. Search Engines • The Web is a wealthy repository of information on almost any subject • To help find information quickly on a specific subject, use • Subject Index (known as search directories); The Web lacks a central directory • Search Engines; Primary tools for finding information (on the web) that is relevant to the user query • There are many search engines • Some of them are used for general search but others (called vertical search engines) focus on specific topics (e.g. bioinformatics, medical, jobs, business, real estate, travel, etc.) • Examples of major search engines • Google, Yahoo, Bing, Ask.com, About, etc. • See a list of them at http://www.thesearchenginelist.com/ • Differ in their capabilities and the way they work

  4. Search Engines Market Share Jan. 2013 Statistics, Source:http://www.karmasnack.com/about/search-engine-market-share/

  5. How Web Search Engines Work • Search Engine • A search engine searches a database of previously stored information about web pages • It supports a web-based GUI that allows users to enter their search criteria (search query) • Finds the best matches for the search criteria and return results to the user • Web Crawler (spider) • Navigates the web, retrieves web pages that satisfy certain criteria • Starts with a popular Web site containing lots of links, such as Yahoo, then continues until it finds a logical stop, e.g. a dead end with no external links or reaching a number of levels inside the Web site’s structure • Indexing • Pages are analyzed and a list of words (extracted from titles, headings and other special meta tags) are stored in indexes to facilitate quick retrieval

  6. How Web Search Engines Work … • Large search engines, such as Google, index hundreds of millions of web pages involving a large number of distinct terms (in different spoken languages), and answer millions of queries every hour. • Some search engines pre-process the user query to improve the retrieval performance (a process called query expansion), e.g. • Removing spelling errors • Searching for synonyms of the specified keyword • Stemming the given keywords to find all morphological forms • Some search engines rank (sort according to relevance) pages that satisfy the criteria specified in the user query

  7. How Google Works • Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. • Parallel processing allows computation to execute concurrently, significantly speeding up data processing. • Google has three distinct parts: • Googlebot, a web crawler that finds and fetches web pages. • The indexer that sorts every word on every page and stores the resulting index of words in a huge database. • The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

  8. Google’s PageRank • PageRank uses a mathematical algorithm based on a graph, the webgraph, with the HTML pages as nodes and hyperlinks as edges • Google uses several measures, including PageRank, to order hits of a search query. For example, Google offer personalized search results based on your web history and location.

  9. Google’s PageRank … • A hyperlink pointing to a web page counts as a vote of support. If there are no links pointing to a page there is no support for that page. • The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). • A page that is linked to by many pages with high PageRank receives a high rank itself. • In the figure, Page C has a higher PageRank than Page E, even though there are fewer links to C; this because the one link to C comes from a page with a high ranking.

  10. Challenges Faced by Search Engines • Size of the Web • The Indexed Web contains at least 9.45 billion HTML pages (Wednesday, 05 September, 2012). • Currency (Freshness) • Many Web pages are updated frequently, which forces the search engine to revisit them periodically. • Relevancy • The queries one can make are currently limited to searching for exact words; this can result in many false positives • Better results might be achieved by using proximity-search option or organic search engines

  11. Shortcomings of Search Engines • Problem with dynamically-generated Web sites • Because these sites may be slow or difficult to index, or may result in excessive results • Search engines can be tricked • To return pages, in favor of the trick makers, which contain little or no information about the matching phrases. • Making the more relevant Web pages pushed further down in the results list • Indexing secured pages • Content hosted on HTTPS URLs pose a challenge for crawlers which either can't browse the content for technical reasons or won't index it for privacy reasons

  12. Invisible Web • Also called deep Web, dark Web, hidden Web • Refers to web content that is not seen (or indexed) by general search engines • “It would be a site that's possibly reasonably designed, but they didn't bother to register it with any of the search engines. So, no one can find them! You're hidden. I call that the invisible Web.” Frank Garcia, http://en.wikipedia.org/wiki/Deep_Web

  13. Invisible Web Deep Web resources may be classified into one or more of the following categories: • Dynamic content:dynamic pages which are returned in response toa submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge. • Unlinked content: pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. • Private Web: sites that require registration and login (password-protected resources). • Contextual Web: pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence). • Scripted content: pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or Ajax solutions.

  14. Enhancing Site Visibility • Search engine optimization (SEO) • Developing or tuning a website to improve its ranking in non-paid search engines; to maximize traffic to this site • SEO can be done using White hat SEO or Black hat SEO • White hat SEO: refers to methods approved by search engines (i.e. do not attempt to deceive search engines), e.g. • Offering quality content • Using proper metadata and effective keywords • Having inbound links from relevant high-quality pages • Black hat SEO (spamdexing): methods that are used to deceive search engines (e.g. repeat certain words, especially with links); these can result in temporal improvement • Google bomb is a black hat SEO that attempts to trick the search engine to promote a certain page (by creating a large number of links the page from various places) to have a high ranking for searches on unrelated or off topic keyword phrases)

  15. Other Ways to Find Info. on the Web • Meta-search engines • have no databases or indexes but they use multiple other search engines and aggregate their results, e.g. WebCrawler, DogPile • Web directories • human- edited databases that store information about links in a categorized manner (e.g. Yahoo! Directory, DMOZ,Business.com, etc); they can also be automatically created by mining the output of search engines • Public Web portals • multi-service web sites that provides a single point of access to a variety of content and services. Example: http://msn.com • often includes customizable pages, calendars, discussion groups, announcements, reports, searches, email and address books, and access to news, weather, maps, and shopping, as well as bookmarks. • often organizes information into channels, where specific information or an application appears to facilitate locating information of interest by content category.

  16. Web 2.0 • Web 1.0 is mostly a “brochure web”; companies and advertisers produce content for users to access • Web 2.0 provides collaborative community-based platforms that allows more user participation, interaction and community contributions • Users create content, help organize it, critique it, update it, etc. • Users create open source software and make it available for anyone to use and modify • Users direct how media is delivered and which news and information outlets to trust • Examples: Wikis, YouTube, Flickr, MySpace, Facebook, LinkedIn, Google, etc. • The growth of Web 2.0 can be attributed for some key factors • Improvements in hardware: cheaper and faster • Increasing memory capacities and speeds at a rapid rate • Faster Internet access • Availability of open source software has resulted in cheaper (and often free) customizable software options

  17. Web 2.0 … • User Generated Content • Allow users to edit existing content and add new information • Collaboration can result in smart ideas • But, users also might deliberately submit false or faulty information • Web 2.0 companies rely on collaborative filtering to help police their sites • Let users promote valuable material and flag offensive or inappropriate material • Wikis (What I Know Is) and social networks, e.g. Wikipedia, MySpace, YouTube, Facebook, LinkedIn, Second Life, etc.

  18. Web 2.0 … • Blogs (“Web logs”) • Websites consisting of entries listed in reverse chronological order; exponential growth; bloggers = blog authors • Blogs also incorporates media, such as music or videos, e.g. Xanga, LiveJournal • Social Media • Allows users to decide which content (or news articles) are most significant, e.g. Digg, Reddit, StumbleUpon • Social Bookmarking • Allows users to recommend their favorite sites, e.g. del.icio.us (http://delicious.com/), ma.gnolia (http://gnolia.com/, deceased) • Tagging • Labeling already existing web content by subject or keywords that allow anyone to locate information more effectively – pushing the content right to the user’s desktop • RSS feeds • RSS = Rich Site Summary • Allow users to receive new information as it is updated

  19. Web 2.0 - Summary • The term “Web 2.0” defines an era; like “Dot Com” • Read/Write, two-way, anyone can be a publisher • Social Web, Social Networks (MySpace, Facebook, OpenSocial) • Enhanced Search (e.g. Google search by file type, file and specialized search engines) • Online Media (YouTube, Hulu, Last.fm) • Content Aggregation / Syndication (Bloglines, Google Reader, Techmeme, Topix) • Mashups (Google Maps, Flickr, Amazon) • Personalized Web pages, Widgets (e.g., iGoogle, My Yahoo)

  20. Web 3.0 • Semantic web (attach meaning to data), personalization (e.g. iGoogle), intelligent search and behavioral advertising among other things.

  21. References • Data Communications and Networking, 4/e. B.A. Forouzan, McGraw-Hill Higher Education 2007. http://www.mhhe.com/forouzan • The World Wide Web Consortium (W3C) • The Anatomy of a Large-Scale Hypertextual Web Search Engine, by Sergey Brin & Lawrence Page at Stanford University • Dive Into Web 2.0, http://www.deitel.com/freeWeb20ebook/ • “Crawling the Hidden Web,” Proc. of the 27th International Conf. on Very Large Data Bases (VLDB). pp. 129-138.  (PDF)

More Related